I recently noticed an article in an issue of Reader’s Digest while in a waiting room. It was discussing the vocabulary of Green Eggs and Ham, and other statistical aspects of writing. It was an extract of Ben Blatt’s book “Nabokov’s Favourite Word is Mauve”. Since the article was about a lot of the things that I work on, I felt compelled to get the book, and immediately read it.
The book compares a set of literary classics, best sellers and fan fiction via quantitative statistics. The author used my old friend Natural Language ToolKit (NLTK) for Part of Speech tagging etc. for the text analysis.
The first chapter looks at adverbs, which are frowned upon in writing. The next looks at statistical differences in writing by each gender. Chapter 3 discusses the history of authorship attribution, as played out with the Federalist Papers. The relative frequency of different function words tends to be like a fingerprint for an author’s writing. He uses this on co-authored works to figure out who wrote what. Next up is seeing if authors follow their own advice on writing. Then comes the chapter I saw in Reader’s Digest, that discusses Dr Seuss and readability measures. In the same chapter he notes that the average reading age of bestsellers is decreasing, from grade 8 in the 1960’s to grade 6.
Next up is a comparison of UK and US writing, including an interesting comparison on the loudness of Americans compared to English people. Chapter 7 looks at clichés – another area I have researched, albeit in lyrics instead of novels. The remaining chapters look at book covers, first sentences, and text generation.
This book did a lot of things that are closely related to my own tinkerings, as well as some of my published research. The author is a journalist and statistician, and several of the chapters, if written in an academic rather than journalistic manner, would have made good quantitative linguistics papers, with the amount of research within the book possibly being enough for a PhD in the field.
I only had one gripe about the book, and that is one sentence in which he assumes his reader knows nothing about statistics and says so. It’s one thing to explain something clearly starting from scratch or to state that something is left to an appendix for those interested. It’s quite another to tell the reader that they don’t know enough statistics. While some readers may not be insulted (eg. the “for dummies” series was popular), I usually am. I may not be the world’s expert on statistics, but I have a working knowledge, and I am capable of learning. I had a similar experience when I wanted to learn about the premise behind the Zone diet, and read a book on it, only to be constantly told by the authors that I’m fat (I’m not). The moral is: explain things clearly, but don’t insult the audience.
All in all, this was a book that’s exactly the kind of thing that I enjoy.