I’ve been musing a lot lately about the first words we learn in a language. Children first communicate in one-word sentences, then tw0-word and then later more complex sentences. There is evidence that the same happens for second language learners. In my experience of picking up a language via TV series, it seems to hold true. The first words learnt are those that occur frequently in one-word sentences. This happens for exclamations, like “Ah!”, and “yes”, and “no”. As time goes on, it becomes possible to identify the words in longer sentences, and eventually to be able to notice patterns in sentences.
When I first saw the Up Goer Five comic by xkcd, I loved it. It epitomised what I do with my comic book and my research, and is a convenient example to show people, when explaining the idea of constrained vocabulary writing.
Fans figured out that the 1,000 words used by xkcd for it were the contemporary fiction list, shown in Wiktionary. This frequency list is based on over 9 million words of on-line contemporary fiction. It combines plurals and simple verb forms into one listed word (lemmas), which is a good choice, since if the root word is known, then the plurals with s, and simple verb forms are usually also understood.
As someone who writes using lists generated based on frequency, I’ve noticed that several problems arise. One is that, typically, male pronouns and nouns occur at higher frequencies than female ones. The Wiktionary list is not overly biased in this way, possibly because it is based on contemporary fiction. “he” is ranked at 8, “her” and “she” at 12 and 13 respectively, and “his” at 16. However, we find “man” at 163 and “woman” at 452, but “girl” is at 133 and “boy” at 217. This hints at what has been termed the systemic “infantilization” of women in society. The figures are probably quite different due to the common pairing of “guy” (at 178) with “girl” in colloquial speech. Google’s auto-suggest, which is also based on frequency, has occasionally come up with phrases that are considered racist, sexist or otherwise problematic – and it is purely a reflection of what we as a society tend to write. When writing in a principled manner for language learners, it may be important to balance what word frequency lists tell us, with what is a more equitable representation. I didn’t really think very much about this when I started writing Gnomeville years ago, but have become more aware of these issues thanks to some of my friends who are more knowledgeable in them.
Another issue that needs to be considered is what is culturally appropriate to write for the target audience. For example, I have recently been made aware that it is inappropriate to use words referring to alcoholic beverages when the audience is Islamic. Obviously for work intended for children (or for experimental subjects) it is customary to exclude expletives. For this reason, several words on the list would need to be excluded. There seems to be an expressive set of expletives in the list.
For the method of writing I employ in the Gnomeville story, I introduce one new high frequency word per page of story, and somewhat less frequently I introduce a grammatical pattern. Sometimes I’ve changed the order in which I add words due to the story. This happened in episode one, in which I introduced “se” very early instead of after about a dozen other words. Also, I recall that “le” was added before “de”, even though their ranks are reversed. Having said that, my first 20 words were based on a corpus of newspaper articles. Every corpus gives a different ranking of words. There are some similarities across corpora however. For example, if the corpus is large enough, the frequency of the word “the” is likely to be about 7% for English text.
Anyway, back to Up Goer Five. The upcoming book “Thing Explainer”, as well as the text uploaded to the up goer five text editor provide some good practice at reading for people still consolidating their first 1000 words of the English language. If going beyond that, the writing should have less than 5% of words outside the vocabulary set to be suitable for improving language skill while fluently reading for comprehension. A text editor with more flexibility is the OGTE Editor, designed for writing English text for different language learner levels.
There has been a lot of research over the past few decades on the use of extensive reading for language learning, with Paul Nation being a prominent name in the research community. Out of all this research has come some general guidelines on how to use extensive reading to improve your language learning skills, but also how to write or adapt stories to suit language learners. Here’s my version of the basic requirements.
- Decide what your core vocabulary will be, for example 1,000 word families. You may also want to decide what grammatical repertoire you are going to include – at least for lower levels of language skill.
- Decide whether you want to teach a particular set of vocabulary in the story (eg. colours).
- Ensure that at least 95% of the text consists of words from your core vocabulary or proper nouns.
- For words in your original draft that are outside of the core vocabulary, consider changing them to ones that are within the core vocabulary.
- For the remaining out of vocabulary words that occur less than 5 times (say) in your story, provide a gloss. (Also for any idiomatic expressions.)
- For words that you want to teach, ensure they occur at least 5 times in the story, but in a way that doesn’t ruin the story. It would be better to have fewer occurrences than to make the story less entertaining.
- Use illustrations, as they help the learner retain meaning.
Here’s a vocabulary checker for stories in English.
I found this page when looking for things on vocabulary density – something of relevance for reading books designed for language learners. The guy who wrote it is also interesting, in that he has a non-traditional career path into academia.
He shows his analysis is of ~2000 Gutenberg texts based on vocabulary – the kind of thing I like to muck around with. It is “unpublished” work, so lacks a few things, like references and axis labels, making it less useful than it otherwise might be.