Thanks very much for the introduction. My name is Benno Kruit. I was the supervisor for Lucas Lageweg, who was a master's student who did this for his thesis, and we turned this into a paper. So I want to give all the credits to Lucas, who did all the implementation and also already presented this at the Dutch conference. Lucas is an employee of the Dutch National Institute for Statistics, the CBS, Centraal Bureau voor de Statistiek. In this job, he was responsible for working on the search engine for the statistical data sets of the Dutch National Institute of Statistics. This motivated him to do this kind of research, and he's now pursuing a PhD based on this work. So also these slides were mostly Lucas' work, so I want to give him most of the credit.

So what are we going to do here? The system that we're presenting is called GECKO, Generative Expression Constraint Knowledge-Based Decoding for Open Data, a nice acronym that goes to GECKO. What we're trying to do is query decoding.

Let me give you an introduction to the data itself first. The use case is for Statistics Netherlands, specifically the team of information dialogue, to help users find the desired answer to their questions more quickly about statistics of the Netherlands. What we want to do is knowledge base question answering, so the input is a question from the user, and the output is a single table cell, a single observation from one of the statistical tables of the Dutch Institute, and we need to find that table itself with a query.

Now, the reason we want to do this is we want to generate good answers to the questions that users have, and we want to do this in a trustworthy way. We want to give users exactly the provenance of the answer that they're getting. They have to know exactly which table is this from, why are they getting this answer, how did we retrieve this information so that the users can actually check whether this output actually answers their question and maybe change their question if it's not completely correct. It also needs to scale to the large amount of data that the National Institute of Statistics has available, so the answers need to be justified so users understand them.

In order to contrast this with another kind of approach, we can see here what happens if you search Google. This was two years ago. Maybe they've changed it. If you search for the population of Amsterdam, you get a correct answer. The source of this data is from the Dutch National Institute of Statistics. But it's put into the wrong context. If we see here, Amsterdam and Rotterdam compared to Utrecht, it looks like the city of Utrecht has a larger population. What's actually happening is that the city of Utrecht shares the name with the province of Utrecht, and the province of Utrecht, of course, has a larger population. But because this is not integrated well semantically, we get these confusing results. So the goal of this project was to actually use the correct metadata of the information that we have in order to prevent these kinds of mistakes from happening.

Our focus is on the Dutch key figures tables. Some of these datasets in the large repository of data of the National Institute of Statistics are on key figures, and we're starting with that just because they are about most of the topics, the most frequent topics that users are interested in. The data looks like this. This is a statistical table, and it's slightly different than a relational table that you might be familiar with.

Here we have different concepts. We have the dimensions of the table, the measures, and observations. The observations are the actual data, the content of the dataset. The measures are what is being measured, and the dimensions qualify these kinds of observations, such as the point in time, regions, and maybe things like housing characteristics, the kind of aggregates that we're using to present these data. Remember that each point in a table can be uniquely identified using one measure and several dimensions. Those are like the primary keys. The distinction between measures and dimensions is sometimes subtle. It's sometimes hard to distinguish these, and that's why their metadata is also captured in a knowledge graph.

Before we get to the knowledge graph, first we have the way that this data is accessed. The Institute has an API to access this data. It's a well-known standard called OData API, where the measures and dimensions can be input, and you get the observation back. For example, here is a data point on average gas consumption for the total number of dwellings in the Netherlands for a certain period, which is 2010. This is how we access the data via the API.

The information that is in the API, the structure of all this data is also integrated into a knowledge graph. This graph makes use of the taxonomies that the institute has, and this is integrated somewhat. The people who make the tables describe the dimensions and the measures, and it's somewhat integrated, but it's not perfect. There are thousands of tables in the repository of the institute, and some concepts are reused but not always in a very predictable way. Sometimes, for example, a concept of a total is reused, but mostly this is a messy knowledge graph that has for every concept identifier a title, a description, and some labels.

The way we've structured this is instead of writing these queries directly, we have an intermediate query representation. Our goal is to map a user query to this intermediate query representation, which we call the S-expressions, inspired by decades-old structures from Lisp. Here we have an example. If we have a question like how many tourists went abroad by train, the output of the model should have this kind of structure in which we have identifiers of these concepts in the query.

We do this in a pipeline approach in which we first perform a sparse retrieval or dense retrieval of measures, tables, and dimensions, and then construct the query using these retrieved entities.

We've evaluated both sparse and dense retrieval. The dense retrieval works with a sentence transformer based on the labels and descriptions of the measures, dimensions, and tables in the dataset, and with an inner product embedding index to find the nearest vector neighbors.

Once we have the candidate nodes that were retrieved, we go into a subgraph exploding step in which we take the connected nodes of these entities. These connected nodes we retrieve via SPARQL queries, and this gives us a set of relevant concepts that might be useful for putting them into the query.

The query itself is created in two different ways that we evaluate. One is a greedy baseline in which we construct the query with the highest-ranked entities for every category, so for the table, then for the measures and dimensions associated with that table.

The main research was about building a query decoder language model that would create these queries from the identifiers that we put into a prompt together with the question. First, we add the identifiers to the token vocabulary and make fixed embeddings for the entity identifiers. This means that every entity in the knowledge graph has its own embedding based on its description or the text associated with that entity, and then that embedding becomes a token in the decoder of the language model. We can create these identifiers as atomic tokens.

At query time, we perform the entity retrieval and then put these identifiers in the prompt. Here we have the question that the user asked and the list of candidate identifiers that can then be transferred in a sequence-to-sequence manner to the prompt. The prompt is then used to generate the query. The query is created using constrained inference to ensure that it's syntactically correct. The S-expressions are just brackets hierarchically, and we ensure that only tokens associated with the table are generated. First, we generate the table and then only the measures and dimensions associated with that table to ensure that what we retrieve actually exists. This is done using constrained inference with a beam search over the admissible tokens.

To encode the text into vector representations, we evaluated several pre-trained language models. Most of this work was done one and a half years ago, so the language models we had back then were Roberta models. Specifically, this research is about Dutch data, so we needed Dutch language models, which were also fine-tuned on the CBS domain.

We compared two different ones, Roberta and GroNLP. These were trained on hand-labeled 1,200 samples labeled by domain experts at the Institute of Statistics, which trained fairly well. They seemed to converge.

The dense vectors and the sparse vectors were actually fairly comparable in performance in retrieval on the in-domain dataset.

The generated expressions on the in-domain dataset also gave usable results. However, we see that using the sparse baseline is better, especially the greedy query generation, which works better than the pre-trained language model. The fine-tuning of this language model was only done on 1,200 examples, and it seems that this might not be enough data, especially to generalize to an out-of-domain dataset.

We made a split between the tables. The training and test sets of the question-answer pairs were different, but for the out-of-domain dataset, we had a whole new set of tables to retrieve observations from. This out-of-domain data was much harder to retrieve with the trained models. They seemed to overfit on the in-domain data, and then using the BM25 retrieval with some greedy decoding worked much better. This is unfortunate because we'd like to support the use case in which we add new datasets without needing to train on them.

Concluding, the sparse retrieval and baseline method is better in the current format. The dense retrieval was hard to scale up to a larger number of tables, and the approach we used for the large language model was not easy to train on the small amount of data we had.

To recap the main challenges, we wanted to generate good answers, but this didn't go very well, especially not out-of-domain, with this large language model. The baseline approach did work fairly well. It did not hallucinate because we ensured that the design only retrieved data that was actually there. It wasn't very scalable to a larger number of tables, but the answers that we got back due to the query setup were well justified.

We've already expanded this work, and it's under review at another conference. We've worked on entity re-ranking, combining the sparse and dense search with other retrieval models. We've increased the training data significantly, as we're now able to use large language models for augmenting artificial data that is very realistic. We've also investigated the fine-tuning of our larger language model. There's still more work to do, such as working on more complex S-expressions with aggregations and determining whether a question makes sense or not. Sometimes the answer to a question is simply not in the data that the Dutch National Institute of Statistics has, so we want to be able to indicate when a question is impossible to answer. Lucas is going to do a PhD, so I expect very interesting results in the future. Thank you very much.