## Teach Me How to Dutchie

Lately I’ve been learning Dutch. It’s been a lot of fun, and I’m currently at the point where I can read a newspaper article with a dictionary and mostly find my way. Some elements of Dutch, like the slippery word order, are basically intractable and are simply going to have to be internalized over time with practice. There is one element, though, that I’ve become annoyed enough with to do something about: vocabulary. Since I’m really very lazy at heart, I wondered:

**What’s the optimal way to learn vocabulary?**

Knowledge of vocabulary doesn’t confer knowledge of a language, far from it, but you obviously can’t know a language without knowing vocabulary. Moreover, facility with vocabulary should help the language learner to pick up new grammatical constructions and idioms from context as well as to reduce overall cognitive burden. Most importantly for the purposes of this blog post, vocabulary is easily quantifiable: you know it or you don’t.

# Spaced Repetition

It turns out that we know a lot about how to efficiently study knowledge like vocabulary. The process of forgetting is well-modeled by simple exponential decay, by which I mean that the probability $p(t)$ of successfully recalling an item at time $t$ can be adequately captured by:

$p(t)=e^{-\lambda t}$

for some rate $\lambda > 0$. If we are bad at remembering the item,
$\lambda$ is high; if we are good, low. The spacing
effect tells us, as a
matter of empirical psychology, that the best way to form a long-term
memory is to recall the desired item just as you’re about to forget
it. This leads straightforwardly to the conclusion that, given a
fixed allotment of time for study, the most effective study regime is
one that *spaces* repetitions of the material. The superior efficacy
of spaced repetition is robust and well-known to memory psychologists;
see Gwern for a nice
overview.

Incidentally, the fact that spaced repetition (a) definitely works and (b) probably bears no relation to anything you did in school is something to think about.

Spaced repetition is exactly the sort of task that one would prefer to automate as much as possible: given a collection of material to be learned, one would like to have a machine that creates a model of the learner and schedules the material so as to maximize the learner’s long term recall. There is a lot of software out there that promises to do roughly that; I’ve heard good things from many people about Anki.

In the specific case of language learning, however, there’s a statistical (or really, decision-theoretical) issue that I’ve never seen addressed. To appreciate it, we first need to consider the zeta distribution, better known as Zipf’s Law.

# Zipf’s Law

Words in a language are not used with equal frequency. Some words are
quite frequent, and others extremely rare. As it happens, words in
all known human languages obey a *Zipf distribution*, in which the
frequency of a word is inversely proportional to some power of its
rank. More formally we have:

$p(w)\propto \frac{1}{r(w)^\alpha}$

where $p(w)$ is the probability of observing word $w$ with rank $r(w)\geq 1$, and $\alpha$ controls the rate of decay. To normalize this distribution, we would sum:

$Z = \displaystyle\sum_{i=1}^N\frac{1}{i^\alpha}$

which is just the generalized harmonic number $H_{N,\alpha}$. If the number of words is infinite (mathematicians, always thinking ahead…) then the series is still convergent for $\alpha > 1$. This series is so central in mathematics that it has its own name: the zeta function, written $\zeta(\alpha)$. Hence this distribution is often known as the zeta distribution even though its most celebrated application, Zipf’s law, pertains to languages with a finite number of words. It’s also the discrete counterpart to the Pareto distribution, with which it is often confused.

The upshot of all this is that words in a language are zeta-distributed, and whenever someone mentions “Zipf” or “zeta” or “Pareto”, you really just want to think “frequency goes inversely with a power of rank” and work the details out at your leisure.

# Spaced Repetition à la Zipf

What import does Zipf’s law hold for language learning? First, consider the following learning goal:

**Maximize the probability of recalling a word drawn at random from
a natural corpus.**

Formally, we want to maximize $\textrm{E}[p(w)]_{w\sim \mathcal{Z}}$, where $\mathcal{Z}$ denotes the natural zeta distribution for words in the target language. This seems like a plausible utility function for study– it simply formalizes the truism that words that occur more often are more important to know, and assumes the simplest possible relationship between frequency and utility that could make that so. Knowing that words are zeta-distributed allows us to determine exactly how much effort we should spend memorizing the 10,000th word versus the 10,001st.

This seems like a worthy goal, but what are we maximizing the
expectation *over?* We are maximizing over *training regimens*, which
we can think of as functions of the type:

$f:(W,T,B)^Q \rightarrow (W,T)$

The idea is that the training regimen should take the user’s training history of $Q$ questions and answers $(W,T,B)^Q$ listing all previous words presented, time presented and binary response (correct or incorrect), and return a word to be presented at some future time $(W,T)$. The training regimen must also take data about words and ranks as a fixed parameter. Exercise: must the optimal strategy be greedy? If not, the optimization problem isn’t well-posed until we specify some bound on the amount of training left.

So, I think that’s enough to think about in one sitting. In the next post we’ll consider how to construct such a training regimen, and see how it works in practice.