This photograph depicts a large filing system from the time before there were computers. The humans in this picture are reading tables. These days, computers process data at a scale that is unfathomably larger than this filing system, but those data have to be organized very precisely or else the computer cannot work with them. It is much harder to have the computer work with data that has not been organized for automatic processing, but for human reading.

Examples of such tables include web tables, spreadsheets, tables in articles like scientific papers, and open data sets published by various organizations. Many applications such as web search, question answering, or automated reporting could benefit from being able to automatically extract and combine information from large numbers of these kind of tables, but this is still a very difficult task because you'd first need a computer program that can interpret what they mean.

In the real world tables may appear in unexpected places. This is a table on a scoreboard for a baseball game. If you're not a baseball fan you might already have trouble interpreting what it means. Now imagine that you encountered this table on a website without its context. Would you know what it was about?

To interpret the meaning of the columns, rows, and cells of a table you need a certain amount of background knowledge. For baseball fans this is easy. The first column contains baseball teams, not places. The nine columns in the middle express the number of runs per inning, and the last three columns contain run, hits, and errors. But if you don't have the right background knowledge you can't do this. So if we want a computer to extract and combine information from a wide variety of tables it needs a way to represent this kind of knowledge on as many different topics as possible.

We can do this using a class of technology called knowledge bases, which allow us to combine and connect facts about real world concepts in a coherent way. Knowledge bases are used in business, government, and science for combining information about many topics. The biggest ones have millions of concepts and billions of facts. This tiny fragment shows some statements about famous people and places.

For automatic table interpretation, we associate elements of a table to concepts in a knowledge base. We still we need to make sure that this works even when the table uses uncommon or ambiguous names or when it contains small mistakes.

Then if we can find a match between the table and our knowledge base we can extract new facts from table rows and extend the knowledge base with them. However when you apply this process in practice you encounter several challenges.

In my PhD thesis I've addressed some of these challenges by designing and evaluating software. These are extracting new facts instead of just recognizing ones we know, dealing with the diversity and scale of tables in the real world, using tables for building knowledge bases about new domains, and doing research in this area in a modular way.

To address the challenge of extracting new facts, I described the design and evaluation of a computer program that interprets tables using implicit signals.

Consider this table about movies. If it contains many new facts, it's harder to find matches. You might find several possible matches per cell. Those are connected to other concepts in the knowledge base. Assuming that the concepts in one column are all related, our approach combines these signals to make a coherent interpretation of the table.

We combine this interpretation approach with an implicit model of the knowledge base, which expresses the likelihood of new extractions. In our evaluation, this technique could extract more new facts with higher precision than other approaches.

To measure this, we introduced new metrics that quantify the trade-off between having fewer but more confident interpretations, based on strong knowledge-base matches, and interpretations that are less confident but more likely to result in new facts.

To address the diversity and scale of real-world tables, I developed and evaluated a method for large-scale table integration. The goal was to analyze millions of Wikipedia tables and extract facts that could be integrated with Wikidata, a huge collaborative knowledge base that is already partly linked to Wikipedia, but not to its tables yet. The method consists of several steps.

The first step uses some rules of thumb to find values in the header and context of the table, to reshape the tables so they are easier to compare to each other.

The next step clusters the tables together by efficiently finding sets of possibly related tables, calculating a graph of their similarity, dividing this up into clusters, which are estimated to express similar kinds of facts. All the tables in a cluster are then merged together into a single one.

Because the merged table is much bigger than the originals, it's easier to integrate it with a knowledge base. In this step, we also introduced a new way to extract facts which involve multiple entities and values, like the rank a song had in a music chart at some point in time.

With this sequence of steps, we were able to deal with a large diversity of structures, schemas, and semantics that occur in real-world tables. From one and a half million tables on Wikipedia, the method was able to extract 21 million new facts at high precision.

For the challenge of using tables to build a knowledge base in a new domain, we developed a technique to extract information from tables in scientific publications to create an entity-linked knowledge base. Instead of first creating a large amount of background knowledge for the new domain, our goal was to minimize the manual effort needed to perform this task.

Because we assume that there is no knowledge base available yet on this domain, the first step is to manually create definitions of what type of concepts we intend to extract, based on the selection of the data. The next step involves training a machine learning algorithm to interpret the tables. This usually involves creating a large dataset of manually annotated examples, which is lots of work. Instead we only wrote heuristics which were used to train the algorithm in an approximate way. Then we use a powerful logical reasoning program to create entities from matching table cells based on manually written rules.

The resulting knowledge base combines information from the metadata of the papers such as who wrote them and which papers they cite with predicted concepts and entities mentioned in the tables of those papers. This could be useful for supporting powerful search engines that find scientific results in published research papers. More generally, this research shows the large potential of combining machine learning and logical reasoning for information extraction.

To conclude, the work that I describe in my PhD thesis makes an effort to address several challenges that arise when extending knowledge bases from human-readable tables. All techniques that I described make sensible use of some carefully designed background knowledge to overcome these challenges in an effective way. I believe these techniques can contribute to future solutions for finding information in large collections of data. Thank you for your attention. I am now ready to defend my dissertation.