Welcome, and thank you for watching this talk about building a knowledge base from tables in scientific papers. I am Benno Kruit, and this work was done in Amsterdam in collaboration with Hongyu He and Jacobo Urbani. In the next 10 minutes, I'm going to be presenting a pipeline that allows for the bootstrap creation of a knowledge base from tables in scientific papers, with minimal human annotation effort.

The main thing that I'd like you to remember from this talk is that it's possible to affect effectively combine machine learning and logical reasoning when doing semantic information extraction by using weak supervision and an existential rule engine. The use of Sparkle queries allows us to express sophisticated heuristics which weakly supervise the training of statistical models. Then the existential rule engine allows us to express the circumstances in which entities should be created and linked to table cells. This way, we avoid having to perform expensive labeling data sets while keeping fine grained control over the generated output. In particular, we ran our pipeline on a set of 73,000 tables extracted from scientific papers in the domain of artificial intelligence. Our code and data is publicly available and we encourage the community to build on our work to find new approaches for semantic information extraction.

So, why build a knowledge base from tables and scientific papers in the first place? As scientists, we all know how much work it can be to do even a superficial literature review. While much improvement has been made over the past years in terms of metadata and search, information extraction from papers still takes much manual annotation effort. We are quite far away from being able to write a sparkle query that returns information about the content of scientific papers. The tables from these papers seem like a good place to begin. They express structured information about the scientific process and often have similar structure across documents. This information could be immediately useful for different applications if it were available in a more structured and semantic form. From this angle this work also helps answer the long-standing research question how do we convert semi-structured data into high quality semantic data with as little effort as possible.

The first problem is that scientific papers are published as PDFs. As you probably know, PDF is an even less semantically rich format than most HTML. To process the tables, we have to detect and reconstruct them from text box coordinates. Next, tables and scientific papers do not use standardized vocabularies or structures. To make matters worse, their terminology is highly specific to the domain in question. To annotate it all, you'd need lots of work from domain experts, which is prohibitively expensive. The largest challenge is that we're semantically starting from scratch. However, this does allow us to take a flexible, goal-specific approach, where we are free to design the knowledge base to suit our needs, whether those are geared towards literature review or specific research question.

To answer our research question and extract semantic information from tables and scientific papers with minimal annotation effort, we create a knowledge base with three layers. First, we transform the paper metadata along with the extracted tables into a simple RDF graph that expresses their structure and relationships to each other. Second, we use weakly supervised machine learning models to predict the types of tables and columns, which we add to the graph as another layer. Third, we perform large-scale existential reasoning to infer which cells refer to the same latent entities, which we also add to the graph. The only input required from the domain experts is queries that express table and column type heuristics, which are used to train the machine learning models and existential rules for entity resolution. This way, we don't need a prohibitively expensive set of manual annotations while still generalizing to unseen data and keeping the expert in control of the process. I will now continue to describe these layers in more detail.

The input to our system is a set of scientific paper PDFs and their metadata. In our case, we use a subset of the Semantic Scholar corpus, which provides harmonized author and venue IDs, along with paper citation links. However, you can easily use other sources of papers, such as the Microsoft Academic Graph, DBLP, or PubMed. We extracted the tables from these papers using the popular open source tools PDF Figures 2 for detection and Tabula for extraction. Then we converted all this data to RDF using well-known vocabularies and loaded it into a triple store. This already allowed us to perform quite sophisticated queries such as finding frequent cells and co-occurrences, column headers and data types, and using venues and author IDs to filter on scientific subdomains. However, because the graph only expresses structure, not semantics, many queries still contain noise, or they grow very large if you want to cover edge cases. Therefore, we'd like users to be able to add the semantics that they need for their use case. To add semantics, we need to perform table interpretation.

The interpretation of web tables has become a well-studied problem in recent years, so much so that there is a whole workshop at this conference devoted to it. But our tables are quite different because not only are they written by experts and for experts, we also don't have a knowledge base to link them to. Most concepts in scientific papers cannot be found in DBpedia or even Wikidata. We need to make our own ontology, of which you can see a part on the right, and label our own data. However, to label lots of scientific tables, you need lots of work from experts and that's often not feasible. Additionally, if your requirements change, you'd have to label that data all over again. Therefore, we use a method for generalizing from heuristics written by experts that can label lots of data at once called weak supervision.

Weak supervision is a machine learning paradigm introduced in recent years based on aggregating different forms of supervision for noise-resilient machine learning. Instead of hand labeling data, you write labeling functions that express heuristics. As long as the heuristics don't actively contradict each other, you can aggregate those weak signals and exploit their correlations and sparsity for training statistical models. The reference implementation in this paradigm is Snorkel, a set of APIs written at Stanford. There are some theoretical guidelines to follow, but its effectivity has been demonstrated in research papers published at database NIA AI conferences and it's already widely used in practice in industry.

The way that this works is that the interactions between the labeling functions are captured by a generative model. The generative model aggregates the labeling function outputs according to probabilistic reliability judgments. These aggregated labels can be used at training data for a discriminative model, which can be any deep or classical machine learning model.

As an example, consider the task of predicting table types. On this slide you can see examples of four table types we want to detect. There are many features that could help us classify these tables, but it is impossible to write rules that cover the entire range. At the same time, it would be prohibitively expensive to label sufficient data to train a good statistical classifier. However, if you're familiar with a domain, it's straightforward to write some queries that find example tables of these types with reasonable precision.

For example, this SPARQL query matches tables that contain the word "example" in their caption. Here the caption is part of the extraction metadata, but given our structural graph, we are free to use the table content, its header row, the publishing venue of the paper, its authors, or any combination of this information to find tables of a certain type. These sparkle queries make up our labeling functions.

After writing several of them for every class we want to label, we create a label matrix that can be aggregated for creating the training data set. If multiple queries label the same table, they might disagree. To resolve conflicts, we can simply take the majority vote or create a snorkel model that estimates their reliability. The final labels can be used as inputs to any discriminative classifier, which can be trained using a restricted set of features that, unlike the labeling queries, don't require expensive graph operations. However, one important factor heavily influences our choice of classifier. Because the labels and code heuristics, they will be strongly biased to correlate with the features used in the query. In order to actually generalize from these queries, we must prevent the classifier from overfitting. In our work, we opted to use relatively simple statistical classifiers to keep this in check. We would be very interested in techniques to mitigate bias when training deep neural networks in such a setting, but this is future work.

The third layer in our knowledge base is created by entity resolution. For this task, we use VLog, a rule engine that performs existential reasoning. The scalability of this engine won it the best resource award at this conference last year. The rule set consists of two types of rules, tuple generating dependencies, two of which are shown at the top, and equality generating dependencies shown below. This reasoning is performed on the graph generated by the previous steps, which includes all structure, metadata, and predicted semantics. That means, for example, that we can use a rule that states that if some term is used by the same author in two different papers, it must refer to the same entity. Before I describe our experimental results, here's an overview of our system to refresh your memory.

The input to our system is a set of pdf tables whose extracted structure and metadata we use to create a naïve kb. Then the user writes a set of sparkle queries for this kb that represent heuristics for table interpretation, which are generalized by weakly supervised classifiers. Finally, we perform existential reasoning for entity linking to create the final semantically enriched graph.

To test the performance of this pipeline, we sampled 400 tables from 17 artificial intelligence venues and had them manually annotated to create a gold standard. We used the four table types shown before as targets, for which two annotators wrote 39 heuristic label queries. We tested three classifiers that were robust against overfitting, of which the Naive Bayes model had the highest cross-validation performance.

For predicting column types, we used 22 target classes. We had our annotators write 55 heuristic label queries, which we used to train the same three types of simple classifiers as in the previous task. The test results were obtained by cross-validation. Additionally, we compared our pipeline to the recent Tablepedia work of UN collaborators. Their dataset consists of 451 manually cleaned tables, labeled with three column types. Their approach is bootstrapped from 15 seed concepts, which we met to label queries in our paradigm to compare against.

From these results, we can see that our dataset is more challenging than the one that of the previous work of TablePedia, due to the larger set of target labels. On the simpler dataset of TablePedia, our pipeline outperforms the previous work. In both cases, the weekly supervised logistic regression classifier achieves the highest performance.

Finally, for evaluating the entity resolution component of our pipeline, we performed an ablation study with respect to the number of merging rules activated when reasoning. All rules individually contribute to entity resolution, and their union is able to reduce the number of unique entities roughly by half. Inspecting a random sample of 100 merged entities, we observed that 65% of entities made sense, while 97% mergers were correct. Examples of high-quality and low-quality entities are shown in the table on the right.

Finally, we demonstrate the scalability of our pipeline by extracting tables from over 143,000 PDFs from the Semantic Scholar Open Research Corpus. We created this subset of the corpus by filtering on papers published in the past five years at 17 top-tier artificial intelligence venues. We extracted 73,000 tables from these papers and ran our pipeline to create a knowledge base with 31 million triples. This knowledge base allows us to formulate interesting queries about research and AI, such as popular datasets at different venues. If you want to explore it, all our data and code are publicly available at the addresses shown here.

In conclusion, this work demonstrates the possibility of combining weak supervision and existential rules for creating a knowledge base from tables in scientific papers. It requires minimal annotation from experts by using statistical classifiers that learn from heuristics and existential rules for entity resolution. Our pipeline outperforms previous work and we introduce a new annotated data set for this task, which are both publicly available along with our large-scale generated knowledge base. As future work, we would like to quantify how much effort is needed from experts for them to create task-specific knowledge bases from tables in different scientific fields. Thanks for your attention, and please let me know if you have any questions.