Hi, my name is Benno Kruit and I will be presenting work done together with Yiming Xu and Jan-Christoph Kalo at the Vrije Universiteit in Amsterdam, titled "Retrieval-Based Question Answering with Passage Expansion Using a Knowledge Graph".
This work is based on retrieval-augmented generation, which is quite popular these days. It works by a user asking a question to the engine, the engine then performing a search query to find relevant documents. Those documents are retrieved, and then the answer is generated based on these retrieved documents. There are different ways these days that the documents are retrieved. There's different search methods. The different categories are sparse methods based on keyword retrieval, and dense methods based on text embeddings. They both have their own strengths and weaknesses. There's lots of research interests into dense embeddings, but they seem to underperform on the long tail of entities, entities that are not very popular. There's also work on sparse retrieval methods, which seems to work fairly well for the long tail, but we're not exactly sure whether this is exactly suited to entity-centric questions.
In our work, we investigate what happens if we add a semantic search method on top of an existing retrieval augmented generation system for question answering. We want to know what happens if the answer to a question is already known in a knowledge graph. This is specifically aimed at entity-centric questions.
Like this example, where did Ludwig will die? Our pipeline works as follows. First, we find and link the subject entity of the question, in this case Ludwig will, to an identifier in the knowledge graph. Then we predict which relation is expressed by the question, in this case its place of death, which also has an identifier in the knowledge graph. Then we simply look up in the knowledge graph whether there is an object for this subject and relation, in this case Karlsruhe. Then we add snippets of text to the set of retrieved texts that is used to answer the question.
Our experiments are structured as follows. The Entity Linking step is performed by a state-of-the-art Entity Linking model, Entity Linking in Questions, based on a model from Meta-AI. Our Relation Extraction component is a dissently supervised BERT classifier that we train ourselves on data from Wikidata. We use a dense passage retrieval for the set of retrieved passages that we expand with a knowledge graph. The KG-based retrieval, we evaluate two different variants of, first, what the Wikipedia article title retrieval, in which we add passages from the object's article, Wikipedia article, also why Wikipedia hyperlinks. Here we add passages that link to the object that we retrieved from the Knowledge Graph. Finally, the last step is the reader that actually generates an answer to the question. Here we use the off-the-shelf fusion and decoder model. Different questions and the same question is concatenated with different passages and encoded. Then the answer is generated by the decoder.
We evaluate on three different benchmark datasets. Two of them are artificially generated and one of them is natural data collected from users of the Google search engine. PopQA is an artificially generated questioning answer dataset generated from a knowledge graph. The questions are about 16 different knowledge graph relations, and the entities are linked in this data directly. This means that the relation extraction model can achieve very high accuracy. Entity questions is another artificially generated data set of entity-centric questions from a knowledge graph. This data set is generated from 24 different knowledge graph relations, which are not linked directly and the entities are not either. This means that the relation extraction model does not achieve as high performance. Finally, natural questions is a data set generated from questions that users pose to the Google search engine, which can be related to Wikipedia articles. Here, the questions are not entity-centric, and the distantly supervised relation extraction model cannot achieve very high performance.
In these graphs, we show the recall performance of our models on the y-axis, compared with the popularity of the entities in the question on the x-axis. In blue, orange and green, we see the performance of the dense passage retriever model. In red and purple, we can see our proposed extension of the retrieved passages, using the knowledge graph retrieval. On most datasets, all models achieve higher performance on more popular entities. On the artificial data, our approach can improve the performance by retrieving more relevant passages. On the natural data this does not happen.
In these graphs we show the accuracy of the entire retrieval augmented generation question answering pipeline. Mostly we can see that the correct answer about popular entities is easier to generate, especially on the artificial data. We also see that our approach only improves the performance on these artificial benchmarks. Additionally, the article title "Passage Expansion" is more accurate than the one based on the hyperlinks.
In these graphs, we show the percentage of correctly retrieved Knowledge Graph objects for the less popular and more popular entities. We can see a small trend that the correct object is easier to retrieve when it's popular. For the PopQA dataset, we can almost always retrieve the correct Knowledge Graph entity. For the EntityQuestions benchmark, we have to link the entities and predict a relation with lower accuracy, and this also results in a lower percentage of correctly retrieved Knowledge Graph entities. In the NaturalQuestions data set, most of the time the correct entity cannot be found with our method, because the questions are not all entity-centric.
We also perform an error analysis and categorize the different errors that our method makes. Sometimes the passages are simply noisy. Sometimes the reader is not weak, is not strong enough to predict the correct answer. Sometimes the entity linker links the wrong subject, or the relation predictor predicts the wrong relation. Many times for the natural data, the answer is quite complex, or the question is not entity-centric and it's a more general question.
In conclusion, the main contribution of our work is an evaluation of adding knowledge graph retrieval to an off-the-shelf question answering system. This pipeline depends on entity linking and relation prediction components, and those errors add up. It helps long-tail performance on artificial benchmarks, but does not generalize to natural questions. In the future, we would like to evaluate the effect of doing a better job at entity linking, and also detecting entity-centric questions and only running our pipeline when this is relevant. Thank you for your attention.