The Twelve Essays of Christmas: Swanning About


For the past three months, I've been working on a teaching method, whereby students go quickly through all the elementary grammar, but with minimal vocabulary. Then with that foundation they go on to more advanced grammar and more words.

In fact, these little essays came out of that work.

I think it suits the Arab educational mindset, and incidentally it's how I would prefer to learn. The initial minimal vocabulary would consist of the most common nouns, adjectives and regular verbs - plus 'grammar words' like auxiliaries, modals, prepositions, articles etc.

Which leaves only the nagging question: What are the most common nouns etc.? And what's a reliable way to find out what they are?

I've tried various methods. One was to take the subtitles of 2000 films and BBC TV shows, which I've collected over 2 years, and run them through a word frequency analyser. After trying out several programs, I found they were all either far too expensive...or utter rubbish. Except for one which was free and did exactly what I needed, but crashes after 37,000 words.

So I tried writing a script which took the same corpus and counted off the non-plural nouns, non-adverbial adjectives and non-irregular verbs, got by another script from the COED. It took a week to run and the result was...unusable. Here for instance are the 'top 25 nouns'.

NounFrequency
I326168
it242586
S236762
and229025
T125990
he90903
on89512
me70876
are67004
but65995
so60612
re58068
do57479
not57186
No54522
M52176
at48153
don47514
she43154
as42600
if39252
go34934
oh34120
see30611
yeah28665

Now, there are plenty of other people, much more clever and expert than me, with much more computing power, who are doing similar work. This fellow has produced some very nice word frequency graphs using Google's Ngrams - a frequency list which google produced from works on their googlebooks service. (Hat tip to Ben Goldacre, who's essay on he subject I found by accident while taking a break from staring at wordlists.)

So I thought: Why don't I filter the Ngram list, ommitting everything that isn't a noun, verb or adjective from the COED corpus, the sort the result into nouns, verbs and adjectives (again, as defined in the COED), and sort the output by google's estimate of frequency? And take out the irregular verbs.

Well....this is the result:

NounsAdjectivesRegular Verbs
andtheof
totoin
inina
isabe
thatthaton
foritor
itonbut
asnothave
washeyou
withthiswere
byhisone
onwhichcan
nothavemore
hetheywill
iyouno
thistheirmay
areonetime
oralldo
butherout
havemoreup
anwillsee
yousoover
werenolike
onesheeven
allotherwell


Hm. 'And' is a noun? 'To' is an adjective? 'Over' is a verb? Well, yes.

An 'AND Gate', or just 'An AND', is something you'll find in electronic circuits - it takes two electrical inputs, and if they're both 'on', it'll output a 'on' signal. So 'and' is a noun.

'To' is an abbreviated form of the Middle English pronoun 'Tone', which is a contraction of 'The one' - as is 'The one or the other'. The fact that no one's used it for at least 500 years is just a detail.

To 'Over' means to jump over something, or to get the better of someone. And it's Late Middle English, so almost as obsolete as 'To.

So what I actually need is a frequency list of exclusively modern english. Like the lists given at Word Detail. Here's their top 25s:

NounsAdjectivesVerbs
timegoodbe
yearnewhave
peopleolddo
waygreatsay
manhighget
daysmallmake
thinglargego
childlongsee
mryoungknow
governmentrighttake
workearlythink
lifebigcome
womanlategive
systemfulllook
casefaruse
partlowfind
groupbadwant
numbersuretell
worldclearput
houselikelymean
arearealbecome
companyblackleave
problemwhitework
servicefreeneed
placeeasyfeel


'Government'? 'Mr'? 'Great'? What kind of corpus were they using that places 'Government' as the 11th most common noun? It's number 145 in the Googlebooks corpus, just above 'States' (159) and 'Thus' (170), and I'm fairly sure these words aren't in my top 500.

So it seems the answer to the question is another question. What are the most common words used in the english language? Answer: Used by who?

The usual intuition of the layperson is that the most common nouns are words like 'knife', 'fork', 'spoon', 'plate' and 'cup'. But these are common things, not common topics of conversation. When was the last time you discussed your cutlery?

My own intuition was that the most common verbs would include 'push', 'pull', 'take', 'drop' and 'put'. But as I've found in teaching, students are more likely to know 'sit down' than 'sit', and 'give up' than 'give'. At least, my students - someone else's will be different.

Something else I could try is the WordNet project, which doesn't so much collect words as meanings - like a thesaurus, but aimed at linguists instead of bad novelists.

Perhaps for my purposes it doesn't matter that I use the commonest words, because I'm using them as a way to teach grammar first and communication second. So maybe this has all been a bit pointless. But pointless journeys have a habit of being more interesting than...er, pointed ones.

No comments:

Post a comment