Skip to content

Details: how does it work?

Vincent Hellendoorn edited this page Jul 10, 2017 · 1 revision

Let's start with an example: imagine you are working on a web-app, and you open a file called "BillCustomer.java" in a package "Billing", with a neighboring package "Shipping". The moment you open that file, you'd like an accurate language model there for code completion, style suggestions, bug detection, you name it. The best model is one that really gets the context that you are working in, but what is that context? Firstly, the file itself of course; it's as specific as it gets. But it's not a lot of data for a good model; let's add all the other files in "Billing" as a second tier. After that, maybe all the files in "Shipping", then the rest of your system, and finally maybe all the other files in the world. Just like that, you have a five-tiered language model, from local to global, from specific and small, to general and huge.

But normal language models can't do this! They are usually trained once, often with limited dynamic updating, and they don't mix well either. We take a different route: we create models on-demand for every level of context, and wrote an algorithm that mixes all those models so that local ones are prioritized. That doesn't stop you from mixing with static models either: we actually found that mixing with an LSTM can further boost performance and hope to integrate these right into the library soon.

So how exactly does the nested model get its probabilities?

The nested model (-n flag on the Jar) produces models on the fly, one for each locality (think: file, package, super-package, etc.), and mixes them in such a way that the most local model (often a cache on the file itself) gets the final say. What that comes down to is this: when given a sequence, every model first takes that sequence and computes two things: a probability p (calculated directly from the counts using MLE), and a confidence λ in that probability, typically based on how much of the sequence it has seen. For instance, if the context is: Bill bill = customer., you (and our local model) can probably guess that the next token is something like getBill(...). The local models might have seen this whole context, so their λ will be high, and our most local model will probably also predict getBill with a high probability, for instance if it's been used in the same file. Our global model might never even have seen customer. so it reports low confidence and leaves the local models to get the biggest say in their probabilities. On the other hand, the global model may have seen those rare patterns that have never been seen locally, so it's definitely helpful.

Let's visualize the probability calculation for the example above (p(Bill bill = customer.getBill)) in a table with some (made-up but not unrealistic) values for p and λ. For instance, the Cache may get its 90% probability (p=0.9) by seeing getBill follow the context Bill bill = customer. 9 out of 10 times in the current file, and may get its 93.75% confidence (λ=0.9375) from having seen the preceding four tokens. These p and λ values are then mixed from global to local (here from left to right):

Model p / λ Merge Global/Shipping Merge Shipping/Billing Merge Billing/Cache
Global 0.05 / 0.5
Shipping 0.15 / 0.75 0.117 / 0.75
Billing 0.70 / 0.875 0.506 / 0.875
Cache 0.90 / 0.9375 0.767 / 0.9375

As you can see, the most local model gets the final say and that works out well: it's as confident as it is accurate. Even so, it doesn't get the only say (we lose some probability from the cache), which is okay: the more global models "pay it back" at times where the local model is uncertain in turn.

How do you update these models without flooding my memory?

The key to efficiency in our many-models approach is that every model is count-based and estimates probabilities from those counts at run-time. So it can act like a regular language model, but allow updates to these counts at will. We can't do bit-twiddling here (like the excellent KenLM) since our model must always be ready for updates. To make it all feasible, each model internally uses a Trie data-structure, which provides one of the most space-efficient solutions while also allowing rapid updates. We added some performance enhancements, like storing unique events as simple arrays. Plus, our newest giga-counter reduces overhead from Java's GC by shelving many sub-counters while training.

When else would I update my model?

Lots of reasons! Any time you update a file, pull a new commit, switch to another package, your development context changes. We found that software developers are really good at grouping similar files in modules, packages, etc., and we want to reward that by teaching our models to capture exactly that structure. At the same time, some models, like LSTMs, are rather static and still give amazing performance. We have already found that combining those models with (nested) n-grams helps both sides out and will integrate this into our tool as well.