Gnomeville Comics are Easier than I Thought

On reviewing my readability measure results for various items in my collection, I suddenly thought, “hang on, how can the expected vocabulary size for Gnomeville Episode 1 be 25 when only 12 very frequent words are introduced?” Clearly something had gone wrong somewhere.

I blame the fact that part of my analysis is manual, and I probably didn’t follow the procedure very well. I run various scripts to produce a ranked list of words in the text in the frequency order of a large corpus of written French (mostly from Project Gutenberg). The manual bit is counting up cognates, or at least starting at the least frequent word end and counting up until I find 5% of the words that are not cognates or names. I think I went astray previously by having a less reliable process.

Results can differ depending on decisions that are made, such as whether to include titles (which I treat as sentences), the “Présentation” section that has brief notes about each character, and what is counted as a cognate. It is reasonably clear-cut for Gnomeville, but for other texts, it is less clear. Should “habiter” be considered a cognate due to its similarity to “inhabit”? And there are other words that are cognates in the linguistic sense but not particularly obvious from a learner perspective. The choice of general frequency list will also make a difference. Spoken text has different characteristics to written text, especially in French. Also, the very frequent words used for Episode 1 and 2 are the 20 most frequent in French newspapers, which is not the same set of words as any other corpus of text. The text I use for calculating expected vocabulary size has some of those words at lower ranks (“se” at 25, “au” at 31, and “on” at 40), which explains why there was the potential for the expected vocabulary size to be larger than the number of words introduced. But unless those words made up about 5% of the extract it was unlikely they would receive those scores.

Anyway, on revisiting my incorrect assessments of the Gnomeville episodes, I have the following updated vocabulary sizes.

EpisodeOld Expected Vocab SizeNew Expected Vocab SizeNew Readability Score
12532.20
216143.23
340173.83
4153.66

You may notice that Episode 4 has a lower expected vocabulary size at 95% and a lower readability score than Episode 3. There’s not a lot in it, but Episode 3 had longer sentences in the extract.

Well, there you are. Gnomeville’s expected vocabulary size is much smaller than originally calculated – at least for Episodes 1 and 3.

Leave a comment