Vocabulary Needed for 95% Coverage


I’ve been tinkering with ways of comparing different easy readers for language learners. Previous posts I’ve used a type-token ratio or vocabulary density, which gives some idea of how likely it is you might learn new words through repetition from a text. But for something to be readable, the general consensus is that you need to know at least 95% of the words that you read. This is a level that allows people to guess the meaning of the words they don’t know.

So something I’ve been messing with recently is predicting the general vocabulary size needed for different beginner stories in French, assuming people know all cognates and all proper nouns. I’ve only been working with short samples of text so far, and there are many other assumptions and issues that make it not a perfect comparison – including bugs in my code…

Given a small set of extracts, and assuming you don’t learn the words via their introduction one at a time, as in my comic books, we have the following:

Title Vocab Size
Gnomeville Episode 1 25
Gnomeville Episode 2 25
Gnomeville Episode 3 40
Bonjour Berthe 4179
Easy French Reader 5008
Martine a la Ferme 11854
Bonjour Luc 6163

Note that this vocabulary size assumes that each conjugation of verbs is a separate vocabulary item, as are plurals etc. so will be much larger than word family figures normally used.

You can see that the one text written for native French speaking children (Martine) has a much richer vocabulary than the texts written for language learners. The figures for these look worse than they seem, because there are many words that are typically taught early to allow conversation, but which feature much lower on word frequency lists. For example, “maman” was at rank 6163 in my list. In contrast, my Gnomeville comics are designed to prioritise frequent words and cognates to optimally improve reading, at the expense of conversation. Hence the very small vocabulary sizes required.

Recently I’ve been reading a 1939 paper by Tharp that looked at measuring vocabulary difficulty. He appears to have had similar ideas about measuring vocabulary load based on the general frequency of the words, as well as a measure of density of difficulty words. I also recently acquired yet another very early graded reader, “Si nous lisions”, from 1930, which attempted to introduce new words every ~60 running words, in the style of Michael West, who seems to have been the first to use the approach. However, I have a graded reader published in 1909 in my collection, which was intended for “rapid reading”, and was part of a series that  commenced with short easy texts. I’m not sure if they methodically introduced words at specific intervals as was done by West and others following his example.

In searches on-line, I found a French adapted reader from 1790, so we’ve been at it for quite a while. I’d like to say we know more about how to write graded readers these days, but I think West had it fairly right. The only thing we can do now is make them more interesting and relevant.

Here’s one from 1800 published for those with a German background. There seem to be quite a few published in the 1800s.

Anyway, I’ll finish off here with the usual things: we need 95% coverage to read comfortably (on average). To do that with native texts requires quite a large vocabulary. But vocabulary increases as you read more. So we should read as much as possible at the level that is right for us and of reading material that interests and motivates us. My Gnomeville comics are ideal first readers in French for those with an English language background and a good vocabulary in English. The Berthe and Luc et Sophie series are reasonable alternatives for children that are possibly too young for Gnomeville, as are the ELI A0 series. Until next time…



Mots fréquents français


I recently came across a new word frequency list for French words, which I’m placing here partly for my own benefit. This one is like some others that combine all conjugations of a verb together, which is not helpful for all applications. Typically present tense is much easier than other less frequently used tenses, particularly for irregular verbs.

Anyway, the list is still useful. It was created by Étienne Brunet, a statistical linguist, based on a corpus of written French.

Here are the top 20 words. Interestingly, compared to the newspaper corpus list I used for designing Episodes 1 and 2 of my comic, this corpus has first person singular (je) occurring much more frequently, as well as “have” (avoir). “ce”, “son” and “elle” also occur in this list higher than “au”, and were not in the newspaper list. “avoir” may be higher because of all conjugations of it being grouped together.

1050561 le (dét.)
862100 de (prép.)
419564 un (dét.)
351960 être (verbe)
362093 et (conj.)
293083 à (prép.)
270395 il (pron.)
248488 avoir (verbe)
186755 ne (adv.)
184186 je (pron.)
181161 son (dét.)
176161 que (conj.)
168684 se (pron.)
148392 qui (pron.)
141389 ce (dét.)
139185 dans (prép.)
143565 en (prép.)
127384 du (dét.)
126397 elle (pron.)
123502 au (dét.)

List of frequent words in French.