Category Archives: Research

These articles are about research related to language learning or text.

Nabokov’s Favourite Word is Mauve – my kind of book

1 September, 2018Research, Reviews, Writingauthorship attribution, quantitative linguistics, readability, stylistics, writingSandra Bogerd

I recently noticed an article in an issue of Reader’s Digest while in a waiting room. It was discussing the vocabulary of Green Eggs and Ham, and other statistical aspects of writing. It was an extract of Ben Blatt’s book “Nabokov’s Favourite Word is Mauve”. Since the article was about a lot of the things that I work on, I felt compelled to get the book, and immediately read it.

The book compares a set of literary classics, best sellers and fan fiction via quantitative statistics. The author used my old friend Natural Language ToolKit (NLTK) for Part of Speech tagging etc. for the text analysis.

The first chapter looks at adverbs, which are frowned upon in writing. The next looks at statistical differences in writing by each gender. Chapter 3 discusses the history of authorship attribution, as played out with the Federalist Papers. The relative frequency of different function words tends to be like a fingerprint for an author’s writing. He uses this on co-authored works to figure out who wrote what. Next up is seeing if authors follow their own advice on writing. Then comes the chapter I saw in Reader’s Digest, that discusses Dr Seuss and readability measures. In the same chapter he notes that the average reading age of bestsellers is decreasing, from grade 8 in the 1960’s to grade 6.

Next up is a comparison of UK and US writing, including an interesting comparison on the loudness of Americans compared to English people. Chapter 7 looks at clichés – another area I have researched, albeit in lyrics instead of novels. The remaining chapters look at book covers, first sentences, and text generation.

This book did a lot of things that are closely related to my own tinkerings, as well as some of my published research. The author is a journalist and statistician, and several of the chapters, if written in an academic rather than journalistic manner, would have made good quantitative linguistics papers, with the amount of research within the book possibly being enough for a PhD in the field.

I only had one gripe about the book, and that is one sentence in which he assumes his reader knows nothing about statistics and says so. It’s one thing to explain something clearly starting from scratch or to state that something is left to an appendix for those interested. It’s quite another to tell the reader that they don’t know enough statistics. While some readers may not be insulted (eg. the “for dummies” series was popular), I usually am. I may not be the world’s expert on statistics, but I have a working knowledge, and I am capable of learning. I had a similar experience when I wanted to learn about the premise behind the Zone diet, and read a book on it, only to be constantly told by the authors that I’m fat (I’m not). The moral is: explain things clearly, but don’t insult the audience.

All in all, this was a book that’s exactly the kind of thing that I enjoy.

Vocabulary Needed for 95% Coverage

31 August, 2018My Comic Books, Research, Reviewsextensive reading, French, language, language acquisition, language learning, readabilitySandra Bogerd

I’ve been tinkering with ways of comparing different easy readers for language learners. Previous posts I’ve used a type-token ratio or vocabulary density, which gives some idea of how likely it is you might learn new words through repetition from a text. But for something to be readable, the general consensus is that you need to know at least 95% of the words that you read. This is a level that allows people to guess the meaning of the words they don’t know.

So something I’ve been messing with recently is predicting the general vocabulary size needed for different beginner stories in French, assuming people know all cognates and all proper nouns. I’ve only been working with short samples of text so far, and there are many other assumptions and issues that make it not a perfect comparison – including bugs in my code…

Given a small set of extracts, and assuming you don’t learn the words via their introduction one at a time, as in my comic books, we have the following:

Title	Vocab Size
Gnomeville Episode 1	25
Gnomeville Episode 2	25
Gnomeville Episode 3	40
Bonjour Berthe	4179
Easy French Reader	5008
Martine a la Ferme	11854
Bonjour Luc	6163

Note that this vocabulary size assumes that each conjugation of verbs is a separate vocabulary item, as are plurals etc. so will be much larger than word family figures normally used.

You can see that the one text written for native French speaking children (Martine) has a much richer vocabulary than the texts written for language learners. The figures for these look worse than they seem, because there are many words that are typically taught early to allow conversation, but which feature much lower on word frequency lists. For example, “maman” was at rank 6163 in my list. In contrast, my Gnomeville comics are designed to prioritise frequent words and cognates to optimally improve reading, at the expense of conversation. Hence the very small vocabulary sizes required.

Recently I’ve been reading a 1939 paper by Tharp that looked at measuring vocabulary difficulty. He appears to have had similar ideas about measuring vocabulary load based on the general frequency of the words, as well as a measure of density of difficulty words. I also recently acquired yet another very early graded reader, “Si nous lisions”, from 1930, which attempted to introduce new words every ~60 running words, in the style of Michael West, who seems to have been the first to use the approach. However, I have a graded reader published in 1909 in my collection, which was intended for “rapid reading”, and was part of a series that commenced with short easy texts. I’m not sure if they methodically introduced words at specific intervals as was done by West and others following his example.

In searches on-line, I found a French adapted reader from 1790, so we’ve been at it for quite a while. I’d like to say we know more about how to write graded readers these days, but I think West had it fairly right. The only thing we can do now is make them more interesting and relevant.

Here’s one from 1800 published for those with a German background. There seem to be quite a few published in the 1800s.

Anyway, I’ll finish off here with the usual things: we need 95% coverage to read comfortably (on average). To do that with native texts requires quite a large vocabulary. But vocabulary increases as you read more. So we should read as much as possible at the level that is right for us and of reading material that interests and motivates us. My Gnomeville comics are ideal first readers in French for those with an English language background and a good vocabulary in English. The Berthe and Luc et Sophie series are reasonable alternatives for children that are possibly too young for Gnomeville, as are the ELI A0 series. Until next time…

Mots fréquents français

8 August, 2018My Comic Books, Research, ResourcesFrench, language, vocabularySandra Bogerd

I recently came across a new word frequency list for French words, which I’m placing here partly for my own benefit. This one is like some others that combine all conjugations of a verb together, which is not helpful for all applications. Typically present tense is much easier than other less frequently used tenses, particularly for irregular verbs.

Anyway, the list is still useful. It was created by Étienne Brunet, a statistical linguist, based on a corpus of written French.

Here are the top 20 words. Interestingly, compared to the newspaper corpus list I used for designing Episodes 1 and 2 of my comic, this corpus has first person singular (je) occurring much more frequently, as well as “have” (avoir). “ce”, “son” and “elle” also occur in this list higher than “au”, and were not in the newspaper list. “avoir” may be higher because of all conjugations of it being grouped together.

1050561	le	(dét.)
862100	de	(prép.)
419564	un	(dét.)
351960	être	(verbe)
362093	et	(conj.)
293083	à	(prép.)
270395	il	(pron.)
248488	avoir	(verbe)
186755	ne	(adv.)
184186	je	(pron.)
181161	son	(dét.)
176161	que	(conj.)
168684	se	(pron.)
148392	qui	(pron.)
141389	ce	(dét.)
139185	dans	(prép.)
143565	en	(prép.)
127384	du	(dét.)
126397	elle	(pron.)
123502	au	(dét.)

List of frequent words in French.

Extensive Reading Musings

26 May, 2018Researchextensive reading, language, language acquisition, languages, readability, readers, reading, research, stories, vocabularySandra Bogerd

I’ve been reading some more research on extensive reading and readability lately. One paper showed gains in reading rate, vocabulary and comprehension with students reading about 150K words over 15 weeks at an intermediate level. This was contrasted with another study where learners read ~65K words over 28 weeks and failed to show improvement. I think there is probably a threshold of some kind where you need to read a certain amount per week to improve language skill. The amount probably varies with the level of skill you already have. Someone still improving their knowledge of the most frequent 400 words of the language will not need to read as much to achieve vocabulary gain (assuming appropriate graded readers) as someone reading at the 2000 word level. The study that showed gains had students reading with vocabularies of 800+.

Given the 10K words per week guide, and the typical reading rate in foreign languages often being around 150 words per minute, that equates to about an hour of reading per week, or 10 minutes a day. That’s not a bad aim for maintaining and hopefully improving your language skills.

What is the target age range of your writing?

3 June, 2017My Comic Books, Research, Writingcomic books, comics, French, French language, language, readability, stories, word frequencies, writingSandra Bogerd

In writing my Gnomeville comic book series, I was mainly focused on making an entertaining story that used French-English cognates and highly frequent words like “le”. As it was a comic book format I seem to have automatically written and drawn in a style that is similar to the main comics of my childhood: Donald Duck, Asterix and the Smurfs. Perhaps that is why people believe it to be targeted at children.

When my fellow French students at the Alliance Française read a draft of Episode 1 of my comic book I heard the occasional chuckle. These were all adults. A recent customer said “BTW, my 11yo read your book and I saw him giggling.” so I guess it works for at least some 11-year olds. It certainly is a general audience work at the very least. It does, however, include a few challenging words for the very young. such as “matérialise”, which led one native French speaker to rate the book as more difficult to read than other French children’s books. Though in my experience, children have fewer hang-ups about unfamiliar vocabulary than adults do.

In the world of readability measurement, a reading age is often calculated. This is usually based on vocabulary and grammar measures, often approximated by average word length and sentence length. Some vocabulary difficulty measures are based on a set of words that are generally known by children. These measures don’t directly capture conceptual difficulty or age-appropriate content. I may know quite a bit about readability research, but my knowledge of age-appropriate content is purely based on personal experience.

Generally speaking, stories for children tend to be full of fun, adventure, magic, mystery and silliness. Stories for adolescents start to include relationships as part of the plot, and then stories for adults have more of the complexity of the adult world, such as politics, law, medicine, finance, ethics and bureaucracy. Having stated that, it makes it clear that my stories are written for children without me realising it. While that isn’t a bad thing, I guess it makes sense. It is difficult to express complex and subtle ideas with a small vocabulary.

In other news, my Episode 2 comic book launch was a success. Episode 1 is still available as a countdown deal if you are in one of the few lucky countries that can enjoy those deals on Amazon. The next phase for me will be converting Episode 2 into an ebook.

Comprehensible Input

12 May, 2017My Comic Books, Researchcomics, French, French easy reader, French for fun, French graded reader, French language, language, language acquisition, language learning, SLASandra Bogerd

I came across this article recently while looking at on-line language learning groups and resources. Apparently there is a friction between those who understand the research on language acquisition and those who believe in language lessons. If one tries to learn or memorise language, it uses a different mental process to that used for communication, and doesn’t contribute to communication skill in the language, which explains a lot about people’s frustration with language education.

One point raised in the article is that early stage language acquirers tend to focus on content words, and not absorb the surrounding function words. This agrees with the observation that it is often easier to remember concrete nouns than the words that connect them in sentences.

How does this apply to my comic book? Well, my comic book attempts to make the input as comprehensible as possible for the complete novice. Anecdotal evidence suggests that this works. It also attempts to be as engaging as possible. Having heard chuckles from students of French when reading earlier drafts, I’d say that it does achieve that goal. Also, a recent customer said the following: “BTW, my 11yo read your book and I saw him giggling”. This makes me happy, as I was advised to develop this as a children’s reading resource.

First Words

14 January, 2016Researchfirst words, language, language acquisitionSandra Bogerd

I’ve been musing a lot lately about the first words we learn in a language. Children first communicate in one-word sentences, then tw0-word and then later more complex sentences. There is evidence that the same happens for second language learners. In my experience of picking up a language via TV series, it seems to hold true. The first words learnt are those that occur frequently in one-word sentences. This happens for exclamations, like “Ah!”, and “yes”, and “no”. As time goes on, it becomes possible to identify the words in longer sentences, and eventually to be able to notice patterns in sentences.

Thoughts on Up Goer Five and Constrained Vocabulary Writing

10 June, 2015My Comic Books, Research, Resources, Reviews, Writingconstrained writing, corpus linguistics, cultural sensitivity, gender, graded readers, race, readability, representation, sociolinguistics, upgoerfive, word frequencies, writingSandra Bogerd

When I first saw the Up Goer Five comic by xkcd, I loved it. It epitomised what I do with my comic book and my research, and is a convenient example to show people, when explaining the idea of constrained vocabulary writing.

Fans figured out that the 1,000 words used by xkcd for it were the contemporary fiction list, shown in Wiktionary. This frequency list is based on over 9 million words of on-line contemporary fiction. It combines plurals and simple verb forms into one listed word (lemmas), which is a good choice, since if the root word is known, then the plurals with s, and simple verb forms are usually also understood.

As someone who writes using lists generated based on frequency, I’ve noticed that several problems arise. One is that, typically, male pronouns and nouns occur at higher frequencies than female ones. The Wiktionary list is not overly biased in this way, possibly because it is based on contemporary fiction. “he” is ranked at 8, “her” and “she” at 12 and 13 respectively, and “his” at 16. However, we find “man” at 163 and “woman” at 452, but “girl” is at 133 and “boy” at 217. This hints at what has been termed the systemic “infantilization” of women in society. The figures are probably quite different due to the common pairing of “guy” (at 178) with “girl” in colloquial speech. Google’s auto-suggest, which is also based on frequency, has occasionally come up with phrases that are considered racist, sexist or otherwise problematic – and it is purely a reflection of what we as a society tend to write. When writing in a principled manner for language learners, it may be important to balance what word frequency lists tell us, with what is a more equitable representation. I didn’t really think very much about this when I started writing Gnomeville years ago, but have become more aware of these issues thanks to some of my friends who are more knowledgeable in them.

Another issue that needs to be considered is what is culturally appropriate to write for the target audience. For example, I have recently been made aware that it is inappropriate to use words referring to alcoholic beverages when the audience is Islamic. Obviously for work intended for children (or for experimental subjects) it is customary to exclude expletives. For this reason, several words on the list would need to be excluded. There seems to be an expressive set of expletives in the list.

For the method of writing I employ in the Gnomeville story, I introduce one new high frequency word per page of story, and somewhat less frequently I introduce a grammatical pattern. Sometimes I’ve changed the order in which I add words due to the story. This happened in episode one, in which I introduced “se” very early instead of after about a dozen other words. Also, I recall that “le” was added before “de”, even though their ranks are reversed. Having said that, my first 20 words were based on a corpus of newspaper articles. Every corpus gives a different ranking of words. There are some similarities across corpora however. For example, if the corpus is large enough, the frequency of the word “the” is likely to be about 7% for English text.

Anyway, back to Up Goer Five. The upcoming book “Thing Explainer”, as well as the text uploaded to the up goer five text editor provide some good practice at reading for people still consolidating their first 1000 words of the English language. If going beyond that, the writing should have less than 5% of words outside the vocabulary set to be suitable for improving language skill while fluently reading for comprehension. A text editor with more flexibility is the OGTE Editor, designed for writing English text for different language learner levels.

Writing Easy Readers

14 April, 2014Research, WritingEFL, ESL, extensive reading, language learning, languages, vocabulary, writingSandra Bogerd

There has been a lot of research over the past few decades on the use of extensive reading for language learning, with Paul Nation being a prominent name in the research community. Out of all this research has come some general guidelines on how to use extensive reading to improve your language learning skills, but also how to write or adapt stories to suit language learners. Here’s my version of the basic requirements.

Decide what your core vocabulary will be, for example 1,000 word families. You may also want to decide what grammatical repertoire you are going to include – at least for lower levels of language skill.
Decide whether you want to teach a particular set of vocabulary in the story (eg. colours).
Ensure that at least 95% of the text consists of words from your core vocabulary or proper nouns.
For words in your original draft that are outside of the core vocabulary, consider changing them to ones that are within the core vocabulary.
For the remaining out of vocabulary words that occur less than 5 times (say) in your story, provide a gloss. (Also for any idiomatic expressions.)
For words that you want to teach, ensure they occur at least 5 times in the story, but in a way that doesn’t ruin the story. It would be better to have fewer occurrences than to make the story less entertaining.
Use illustrations, as they help the learner retain meaning.

Here’s a vocabulary checker for stories in English.

Vocabulary Analysis of the Gutenberg Collection

8 January, 2014

Vocabulary Analysis of the Gutenberg Collection

I found this page when looking for things on vocabulary density – something of relevance for reading books designed for language learners. The guy who wrote it is also interesting, in that he has a non-traditional career path into academia.

He shows his analysis is of ~2000 Gutenberg texts based on vocabulary – the kind of thing I like to muck around with. It is “unpublished” work, so lacks a few things, like references and axis labels, making it less useful than it otherwise might be.