Hi, my name is Benno Kruit from the Vrije Universiteit in Amsterdam, and I will be presenting minimalist entity disambiguation for mid-resource languages.

The task that we're going to look at is named entity disambiguation. In this example, the user is asking an assistant what the average gatch mileage of a Lincoln is. The assistant detects that Lincoln is an entity and sends this to the disambiguation module. Named entity disambiguation module should predict that the user here means "Lincoln Motor" and not "Abraham Lincoln" or "Lincoln in Nebraska" in order to be able to answer the user's question correctly. We are interested in creating small named entity disambiguation modules using a minimalist approach.

Why do we want to do this? We observe that named entity disambiguation models are often very large. Some very popular multilingual models from DBPDF Spotlight can grow up to 2 gigabytes for the English model. The current state-of-the-art models are neural, and one of the top-performing ones is M-genre, a multilingual model that is 11 gigabytes. The state of the art in English is currently bootleg, especially on entities from the tail of the distribution. This model is five gigabytes. And note that these are compressed sizes. To use them in practice, we use the uncompressed versions, which are even bigger. This makes them very unwieldy to work with for practitioners.

But that's not the only problem. It would be useful to be able to tune these models per use case and language. Practitioners might already know whether they want to aim for the distribution head or the tail, that is, whether they're interested in popular or less popular entities. They might also already know what the most important domains are for them, which domain their data is, for example. Or they might have some simplifying assumptions for their named entity disambiguation system, for example, that they're interested in entities of a certain granularity. Furthermore, the training data determines the strategy for tuning these models. In some situations, there's lots of training data available.

In this figure, we can see the number of pages on Wikipedia in different languages, and the number of users and edits. Every language has its own pattern. This also determines the data quality of the training data. Given all of these assumptions, it would be nice to have small entity disambiguation models, because these are flexible and sustainable to tune.

So how do we go about creating small named entity disambiguation models? Let's first look at some observations on benchmark data. We've looked at Mewsli-9, which is a new dataset for named entity disambiguation in multiple languages. This dataset is created from articles from Wikinews. It consists of more than 50,000 non-parallel articles written between 2010 and 2019 by volunteers. The volunteers wrote articles containing hyperlinks to Wikipedia, and we can convert these to annotations on the dataset. It exists in nine languages, including English, but we replaced English by Dutch because we're focusing on mid-resource languages here.

So we're looking at the situation in which we're training on Wikipedia and testing on Mewsli-9. Now let's count the mentions. Mentions are name entity pairs that occur in training and testing, and we want to know how often do mentions in the test data occur in training. This is different for the different languages, but overall these mentions occur between 10 and 10,000 times. Which means that the long tail has lots of data, and there's more than enough data to train the disambiguation on. We don't necessarily have to resort to few-shot or zero-shot learning.

We can make other observations based on looking at the training rank of test mentions. So what we want to know is when we find a mention in the test data, was that the most popular mention in the training data or was it less popular for example we might find the word london in the test data that refers to london in canada this is not the most popular entity for this name this is what we call a shadowed entity name in which uh london canada for the name london is shadowed by the more popular meaning london in the uk if we only predict the most popular entity for a certain name, we already get very high scores. This is the blue and the orange area in this graph added together.

There's also mentions in the test data that we never observe in training. For example, the name London referring to somebody called Jane London. We cannot hope to train a model without zero-shot learning on these situations. This means that we have a very narrow bound for learning a named entity disambiguation model that performs better than the baseline. Namely, we only learn when the shadowed situations occur. This changes a bit when we start stemming the names, which is removing the inflections. Some languages are heavily inflected, and when we remove the inflections, we can find more names, which means the names get more ambiguous, but the unseen cases reduce. So, this is how we can build the approach.

First, we collect and filter candidate mentions and clean these with heuristics. For example, we get rid of situations in which we have a hyperlink that says "click here" or something like this. Then, we can make a small model by only selecting the most popular mentions. Here we have a trade-off between the simplicity of the model and the accuracy, but you can see that, for example, if we choose the top 25% of most occurring mentions, we could already get a recall between 55 and 85%, depending on the language. We can learn to disambiguate these situations, and if a mention does not occur in the model, then we can fall back to the most popular entity for a certain name, which is the baseline.

The statistical model that we train is very simple. It's just logistic regression with bag-of-words features. Technically, the model that we use is just FocalWabbit. We use a ranking loss to learn the association between the context, the name, and which entity should be disambiguated. FocalWabbit uses feature hashing. This is a trick to train very small models. Instead of storing a parameter for every feature and entity combination, we store parameters for their hash. This leads to hash collisions, but these are ignored. So this means that multiple features might share a parameter. This works as regularization and can prevent overfitting. However, it introduces a trade-off. It's simply the trade-off between space or speed and accuracy. The hyperparameter that we can tune is the number of bits of the hash function. This determines how big the model is.

Here we plot the results of a normalized F1 score for all of the languages. The line is the average performance normalized, and the area is a 95% confidence interval for all of our languages. When we look at the bottom left corner, we see the model trained on all possible candidates in different sizes. The model that is not where we're not stemming names. We can see that there are strong diminishing returns on model size, basically in every situation. We can make the model eight times as big going from 24 to 28 bits, but we hardly gain any performance, so it might not be worth it.

Additionally, we can see that smaller models with quantile-based candidate selection can be achieved. So when we look at the right, we see when we only look at the 25% most popular candidates, we can train a model that is much smaller and still beats the baseline. A model of 16 megabytes, for example, especially if we fall back to the baseline for the names that we've not trained on. Finally, the effect of stemming is mixed. It helps with some languages, but not always, and we can also see that it leads to overfitting on models that are very large.

When we look at the results per language, we can see that most models are able to beat the baseline, and the effect of stemming here we can see is mixed, but there's clear room for improvement. After all, we're only using very simple bag-of-words features to do the classification. We can also see that the optimal parameters are different per language.

Another argument is that it's important to do tuning of these languages, and then it's useful that the models are small, so we can try lots of different options.

Finally, because we're using a very simple linear model, we can look at the model parameters and understand what the model is doing under the hood. We can see, for example, here where we're disambiguating the word 'Utrecht', which could be a city, a province, a place in South Africa, a university or a football club, which parameters are most important? For the city, it's important that the name Utrecht means the city only when the word province is not in the context. And when we see the word geography, it's very likely that Utrecht refers to the province. Similar to the word club, it's probably about the football club. Here, when we see weird features, this might mean that there's noise in the data. So this explanation is useful for debugging the data and training better models not by changing the model but by changing the data.

In conclusion, this work tried to find out how much we really need to perform named entity disambiguation in an acceptable way. We use very simple features and get acceptable models that are about 256 megabytes but probably when we use better features not just bag of words features we would be able to get higher performance but this is future work. We're trying to find out what we can leverage perhaps we can use features of the entities, such as their type, to train better models. We've also observed that which language we're looking at makes a big difference, and inflection has a large effect.

In the future, we'd like to look at the evaluation of the robustness of these models, for example, how good they are at popular or unpopular entities. We'd also want to look at ash word embeddings, which are very small embedding models. Finally, if we perform feature selection, the models might become even smaller. The code in the data is on GitHub and this is an ongoing project and will be available soon for practitioners. If you have any questions, don't hesitate to contact me. Thank you.