Tag Archives: word frequencies

Readability Zones

I’ve just been updating my database of French readers and observing the types of books or stories in the different ranges of my current preferred readability measure.

Scores under 4 are ridiculously easy for people with an English speaking background. Currently this consists only of episodes 1 and 2 of my Gnomeville comics. Sentences are short and vocabulary is highly constrained, exploiting French-English cognates.

Scores in the 4-4.99 range are very easy: Bonjour Luc, A First French Reader by Whitmarsh, and Histoires pour les grands. They tend to be conversation-based.

Scores in 5-5.99 tend to be the short illustrated graded readers such as Bibliobus, as well as La Spiga’s Zazar for grands débutants (target vocabulary of 150). Gnomeville Episode 3 sits here due to having longer sentences compared to the first two episodes.

Scores in 6-6.99 tend to have longer sentences, including some classic graded readers such as Si nous lisions and Contes Dramatiques, as well as the 300 word vocabulary Teen Reader Catastrophe au Camping des Roses.

Scores 7-7.99 also have the more text-like graded readers, including Sept-d’un-Coup by Otto Bond, which tends to have long sentences but well-controlled vocabulary.

In the 8-8.99 range I find the first story for native speaking children, as well as more graded readers, including one with a target vocabulary of 1000 words.

The first books for adult native speakers occur with scores between 10 and 12.

Looking at the stories in the list, my own level seems to be from 7 to 10, suggesting I should continue reading more challenging graded readers in addition to stories written for French children. That is pretty much what I have been doing for a while, as well as incidental reading on the web and elsewhere.

A quick look at the relationship between stated vocabulary sizes and the 95 percentile that I have been using indicates that the required vocabulary is  roughly 1.5x  + 2600. However, I am using a token-based vocabulary whereas most would use a word family one. If I assume token vocabulary sizes are 5 times word family sizes, then the equivalence point for this model is when the vocabulary is about 770, meaning that the vocabulary load will be excessive for stated vocabulary sizes less than 770 but be ok for sizes greater than 770. That’s reasonably reassuring. Mind you this is an extremely rough estimate.

This work was based on about 100 words from the start of the text of 40 stories, but it does seem to sort things fairly usefully. The outlier based on my experience of reading the stories is Aventure en Normandie, with a score of 9.49. I don’t recall it being a difficult read.

Meanwhile I am making more progress on Episode 3 of my comic book. I decided to divide one page into three pages, as it had a lot of text and too many new language concepts for a single page. So Episode 3 will probably be 32 pages long, breaking the standard Gnomeville pattern of 28 page episodes. Hopefully it will be ready within a month.

Advertisement

A few more French graded reader book stats

Since my last graded reader update I’ve looked at a few more books, some of which are “classics”, in the sense they were from the “direct reading” era of the first half of the twentieth century, following the influence of Michael West’s constrained vocabulary for language teaching, the various word and idiom frequency lists created at the time, and the idea of readability. Some of these books I had already acquired earlier; but through reading some papers published at that time, I was able to compile a shopping list of other books written according to the same philosophy.

As a result, I have a new winner in terms of expected vocabulary size at the 95% threshold of reading comfort. A New French Reader by Ford and Hicks received a 95% vocabulary size of 3532, and Otto Bond’s Sept-d’un-Coup was a close second, with 3650. Bond’s book starts with a much smaller initial assumed vocabulary (97 words) than the Ford and Hicks book (523), so Bond’s book may be a better first read despite the slightly higher vocabulary score here. As seen in my first post on expected vocabulary size for 95% coverage, these are much higher scores than my Gnomeville comics, as my comics take readability criteria to the extreme.

So based on the current stats available on vocabulary, I recommend the following first graded readers for English speakers learning French.

For 6-9 year olds: Bonjour Berthe.

For 10+: Gnomeville

For adults who don’t like fantasy comics: Sept-d’un-Coup by Otto Bond – though I think there are some errors in it, and it’s out of print (and it probably counts as fantasy…).

Stay tuned for further updates.

 

Challenges of Representation in a Language Comic Book for Beginners

I often reflect on the content of my comic book, and how I have unconsciously absorbed the default story of a white male character (in my case a group of white male characters) on a quest. In addition I have a wise (white) female character (Chantal), who is an oracle that intends to change the likely outcome if the quest continues as it normally would.

I’ve been made aware that people of colour want to see more people like themselves in stories and movies. I must admit that I have yearned for more female perspectives in literature and movies at times, which is as close as I can come to imagining how people of colour feel about being left out of mainstream media. Similarly for people who are queer, obese or disabled.

The difficulty with comic books is that the illustrations are often caricatures that exaggerate features. It would be tricky to create a PoC character without it seeming racist. There is no opportunity in a comic book for beginners in French, which has an extremely constrained vocabulary, to make things nuanced. I think the best I can do is have a variety of skin colours across the cast of characters, and not make the bad characters the dark-skinned ones. Having a queer character _might_ be possible (more likely a queer couple, as that’s easy to do visually without resorting to stereotype appearances). Given it’s a fantasy world, I could potentially do a genderqueer character that magically goes back and forth between genders all the time. After all I have a python that can make itself look like a dragon and a large gnome. Theoretically, the same could happen with skin colour.

I received only one star from one reader on Goodreads for Episode 1, without explanation. I can only guess why, but my guess is it’s to do with it being an entirely white male cast in the first episode – apart from the griffon, which is a mixture of white, blue and brown. This is partly due to unconsciously absorbing this default – even though my various influences (mainly fairly tales, Astérix, Smurfs, and Uncle Scrooge) do have more female characters than I do in Episode 1, partly as an artifact of being a slave to word frequency lists and my rules about what to include in each episode. In Episode 1 I only use French-English cognates that look identical in both languages. As such I only use adjectives that are either identical for both genders, such as “visible”, an exact spelling for masculine nouns only, such as “certain”, or exact for feminine nouns, such as “complète” (first occurs in Episode 2). I also chose to use a very limited palette in the drawings, roughly equivalent to a typical 12-colour set of coloured pencils, crayons or felt pens.

I think my comic books will evolve to have more diversity through the series. Episode 1 is already published, so it is what it is. Episode 2 at least introduces a main female character, who, like me, tends to work on her own to solve problems – at least at this stage in the plot. Episode 3 includes new characters, but since they’re not “good” characters, I won’t make them PoC. I haven’t written the Taxi and La Question du Moment for Episode 3 yet, so there is a bit of scope there to increase diversity. At least now I’m more aware of this, and can consider it in my writing/drawing process. Stay tuned for Episode 3… Meanwhile, here is a first attempt at a PoC for my comics – a recolouring of a panel from Episode 2. Is it OK?

g2croppedp17excerptrecoloured
Recoloured panel from Episode 2’s La Question du Moment. I think this is ok. Let me know if it isn’t.

What is the target age range of your writing?

In writing my Gnomeville comic book series, I was mainly focused on making an entertaining story that used French-English cognates and highly frequent words like “le”. As it was a comic book format I seem to have automatically written and drawn in a style that is similar to the main comics of my childhood: Donald Duck, Asterix and the Smurfs. Perhaps that is why people believe it to be targeted at children.

When my fellow French students at the Alliance Française read a draft of Episode 1 of my comic book I heard the occasional chuckle. These were all adults. A recent customer said “BTW, my 11yo read your book and I saw him giggling.” so I guess it works for at least some 11-year olds. It certainly is a general audience work at the very least. It does, however,  include a few challenging words for the very young. such as “matérialise”, which led one native French speaker to rate the book as more difficult to read than other French children’s books. Though in my experience, children have fewer hang-ups about unfamiliar vocabulary than adults do.

In the world of readability measurement, a reading age is often calculated. This is usually based on vocabulary and grammar measures, often approximated by average word length and sentence length. Some vocabulary difficulty measures are based on a set of words that are generally known by children. These measures don’t directly capture conceptual difficulty or age-appropriate content. I may know quite a bit about readability research, but my knowledge of age-appropriate content is purely based on personal experience.

Generally speaking, stories for children tend to be full of fun, adventure, magic, mystery and silliness. Stories for adolescents start to include relationships as part of the plot, and then stories for adults have more of the complexity of the adult world, such as politics, law, medicine, finance, ethics and bureaucracy. Having stated that, it makes it clear that my stories are written for children without me realising it. While that isn’t a bad thing, I guess it makes sense. It is difficult to express complex and subtle ideas with a small vocabulary.

In other news, my Episode 2 comic book launch was a success. Episode 1 is still available as a countdown deal if you are in one of the few lucky countries that can enjoy those deals on Amazon. The next phase for me will be converting Episode 2 into an ebook.

Thoughts on Up Goer Five and Constrained Vocabulary Writing

When I first saw the Up Goer Five comic by xkcd, I loved it.  It epitomised what I do with my comic book and my research, and is a convenient example to show people, when explaining the idea of constrained vocabulary writing.

Fans figured out that the 1,000 words used by xkcd for it were the contemporary fiction list, shown in Wiktionary.  This frequency list is based on over 9 million words of on-line contemporary fiction.  It combines plurals and simple verb forms into one listed word (lemmas), which is a good choice, since if the root word is known, then the plurals with s, and simple verb forms are usually also understood.

As someone who writes using lists generated based on frequency, I’ve noticed that several problems arise.  One is that, typically, male pronouns and nouns occur at higher frequencies than female ones.  The Wiktionary list is not overly biased in this way, possibly because it is based on contemporary fiction.  “he” is ranked at 8, “her” and “she” at 12 and 13 respectively, and “his” at 16.  However, we find “man” at 163 and “woman” at 452, but “girl” is at 133 and “boy” at 217.  This hints at what has been termed the systemic “infantilization” of women in society.  The figures are probably quite different due to the common pairing of “guy” (at 178) with “girl” in colloquial speech.  Google’s auto-suggest, which is also based on frequency, has occasionally come up with phrases that are considered racist, sexist or otherwise problematic – and it is purely a reflection of what we as a society tend to write.  When writing in a principled manner for language learners, it may be important to balance what word frequency lists tell us, with what is a more equitable representation.  I didn’t really think very much about this when I started writing Gnomeville years ago, but have become more aware of these issues thanks to some of my friends who are more knowledgeable in them.

Another issue that needs to be considered is what is culturally appropriate to write for the target audience.  For example, I have recently been made aware that it is inappropriate to use words referring to alcoholic beverages when the audience is Islamic.  Obviously for work intended for children (or for experimental subjects) it is customary to exclude expletives.  For this reason, several words on the list would need to be excluded.  There seems to be an expressive set of expletives in the list.

For the method of writing I employ in the Gnomeville story, I  introduce one new high frequency word per page of story, and somewhat less frequently I introduce a grammatical pattern.  Sometimes I’ve changed the order in which I add words due to the story.  This happened in episode one, in which I introduced “se” very early instead of after about a dozen other words.  Also, I recall that “le” was added before “de”, even though their ranks are reversed.  Having said that, my first 20 words were based on a corpus of newspaper articles.  Every corpus gives a different ranking of words.  There are some similarities across corpora however.  For example, if the corpus is large enough, the frequency of the word “the” is likely to be about 7% for English text.

Anyway, back to Up Goer Five.  The upcoming book “Thing Explainer”, as well as the text uploaded to the up goer five text editor provide some good practice at reading for people still consolidating their first 1000 words of the English language.  If going beyond that, the writing should have less than 5% of words outside the vocabulary set to be suitable for improving language skill while fluently reading for comprehension.  A text editor with more flexibility is the OGTE Editor, designed for writing English text for different language learner levels.