Welcome, and thank you for watching this talk about extracting NRE facts from Wikipedia table clusters. I am Benno Kruit, and this work was done in Amsterdam as part of my PhD research supervised by Peter Boncs and Jacopo Urbani. In the next 10 minutes, I'm going to be presenting a pipeline that takes tables from Wikipedia articles as input, and outputs binary and NRE facts that can be added to Wikidata.
So, why extract information from Wikipedia tables in the first place? Well, they contain high-quality encyclopedic background knowledge about known topics that are already covered to a certain extent by most knowledge bases. This makes it possible to integrate them with relatively high precision and recall, and have them serve as a testbed for general purpose data integration tasks. As well-known examples of tables for human readers and not machines, they help us answer the long-standing open research question, "How do we automatically process tables that were not designed for automatic processing.
As a motivating example, here's a table about election results in Canada. This table involves entities that are already in Wikidata, but it expresses facts that are not. In this example, we'd like to be able to figure out which cells in this table express the number of votes that these politicians received in this election, and extract this information as n-ary facts.
As a running example, I'll use this popular type of table about songs and music charts. We'd like to be able to extract n-ary facts about songs charting in some music chart, in some year, at some rank. In this example, you can see that the time and ranking are represented as qualifiers of the charted-in relation. Such n-ary data, while valuable, is currently poorly represented in knowledge graphs. As you can see, the schemas and layout that are used for these tables can differ even on a single page. Additionally, it is essential that we involve the table context, in this case the page entity in the extraction, because all these tables are about the song "Imagine" by John Lennon.
So what makes this hard is that the tables are related because they're all about entities on Wikipedia, but they can look very different because they've been created by different authors. And they very often express n-ary relations, which is something that has not been studied much before. In fact, most work on table interpretation assumes that every column expresses a binary attribute of entities in some key column, but as we've seen, that assumption doesn't work here.
To answer our research question and extract usable facts from real tables, we use three steps. First, to clean the layouts that people use in practice, we use some heuristics that detect when to unpivot a table. This makes the tables easier to compare to each other. Second, to propagate information between similar tables, we cluster them together and create union tables. Third, to integrate them in an NRE data model, we use strong assumptions for high precision.
In the first step, we clean up subheaders and footnotes and convert the table context to extra columns. These pre-processing steps are pretty straightforward and explain in detail in the paper. A novel contribution of our work is that we use heuristics to detect value-like header cells. After unpivoting these tables on those cells, it becomes easier to compare their schemas to find table pairs that express similar relations.
Let's look at some example table header for each heuristic. First, we unpivot tables on header cells that start or end with numbers, like this table about music charts. We also unpivot tables on header cells that have links to entities with the agent type, like these tennis tournaments. The third heuristic finds header cells that span repeating values, like these ski races, which span the time and rank cells. You'll see that these heuristics are designed to find header cells that do not describe the column content, but contain values that are needed for expressing entry facts. Heuristics numbered four and five here are based on statistics from a corpus of tables. If some cell value mostly occurs spanned in headers or mostly in the body of tables, it's a signal that it probably doesn't describe the column content. Finally, we unpivot cell values that are rare outliers. That is, they occur much less frequently in the corpus than others. While these heuristics have low recall individually, when combined, they make it easier to compare tables by reducing the diversity of schemas.
Next, we use four steps to create union tables that give us stronger signals for fact extraction. First, we reduce the set of table pairs to analyze by finding four approximate neighborhoods per table. We do this by creating approximate indexes using locality-sensitive hashing on the Jaccard similarity and approximate nearest neighbor indexes on the word embedding similarity for cells from both the header and the body of tables.
Then within the approximate neighborhoods, we compute similarity scores for all table pairs using the Jacquard index and word embedding similarity of their header and body cells and the similarity of their column types.
We aggregate these similarity scores to create a weighted graph, which we partition using the Louvain modularity algorithm, which skills to large graphs such as ours.
Finally, we create a union tables from these partitions by finding the best alignment of columns using agglomerative clustering on the column similarities we computed before. This allows us to propagate information within the unified table. For example, about the types of entities in a column, which is useful for entity disambiguation. It also gives us a stronger signal for integration by allowing us to count knowledge-based matches for the entire union table instead of individual smaller tables.
The last step in the pipeline is matching the columns of the union tables to the knowledge base. Our goal was to integrate n-ary tables with Wikidata, which uses qualifiers to represent n-ary relations. You can see that any binary fact can have qualifiers that add extra information, shown here with dotted lines. There are two challenges here. First, n-ary facts are underrepresented in Wikidata, which, in the naive approach, makes it more likely that some row will incorrectly match a binary fact than correctly match an n-ary one. Second, n-ary facts are often expressed incompletely, which means different rows might match different parts of the n-ary relation.
In order to overcome these challenges, we make use of probabilistic functional dependencies. We find the key column that has the highest harmonic mean of incoming functional dependency scores, and if that value is above a threshold, we assume this table is an entity attribute table. Otherwise, we assume the table expresses an entry relation and try to match qualifiers. We assume that any column with a high functional dependency score on a non-key column expresses extra binary facts. Because our union table gives us a stronger signal, this method outperforms popular baselines on key column detection.
Lastly, we have strong assumptions for matching union tables to entry relations by only matching qualifiers of the best matching main property. This compensates for the incompleteness in the knowledge base with respect to qualifiers, and compensates for the partial matches of tables. It gives us high precision, but makes the recall suffer, which we plan to address in future work.
We evaluated this pipeline using a gold standard that we created with the help of our annotators. At each step, our approach outperforms the baseline, and the pipeline allowed us to extract 15 million binary and 6 million n-ary novel facts from 1.5 million Wikipedia tables.
At the moment, we're working on finding the optimal configuration of our pipeline for running it in practice, and contributing novel facts to Wikidata. We'll be improving the steps by using weekly supervised learning, and we'd like to see how to use it for ontology augmentation, that is, finding new entities and discovering new relations that are not yet in Wikidata. Our goal is to involve the community in an open-source effort, and we're looking forward to contributions and feedback on the project from you! Thanks for your attention, and please check out TAKCO and contact me if you have any questions. Thank you.