For the past three months, I've been working on a teaching method, whereby students go quickly through all the elementary grammar, but with minimal vocabulary. Then with that foundation they go on to more advanced grammar and more words.
In fact, these little essays came out of that work.
I think it suits the Arab educational mindset, and incidentally it's how I would prefer to learn. The initial minimal vocabulary would consist of the most common nouns, adjectives and regular verbs - plus 'grammar words' like auxiliaries, modals, prepositions, articles etc.
Which leaves only the nagging question: What are the most common nouns etc.? And what's a reliable way to find out what they are?
I've tried various methods. One was to take the subtitles of 2000 films and BBC TV shows, which I've collected over 2 years, and run them through a word frequency analyser. After trying out several programs, I found they were all either far too expensive...or utter rubbish. Except for one which was free and did exactly what I needed, but crashes after 37,000 words.
So I tried writing a script which took the same corpus and counted off the non-plural nouns, non-adverbial adjectives and non-irregular verbs, got by another script from the COED. It took a week to run and the result was...unusable. Here for instance are the 'top 25 nouns'.
Noun | Frequency |
I | 326168 |
it | 242586 |
S | 236762 |
and | 229025 |
T | 125990 |
he | 90903 |
on | 89512 |
me | 70876 |
are | 67004 |
but | 65995 |
so | 60612 |
re | 58068 |
do | 57479 |
not | 57186 |
No | 54522 |
M | 52176 |
at | 48153 |
don | 47514 |
she | 43154 |
as | 42600 |
if | 39252 |
go | 34934 |
oh | 34120 |
see | 30611 |
yeah | 28665 |
Now, there are plenty of other people, much more clever and expert than me, with much more computing power, who are doing similar work. This fellow has produced some very nice word frequency graphs using Google's Ngrams - a frequency list which google produced from works on their googlebooks service. (Hat tip to Ben Goldacre, who's essay on he subject I found by accident while taking a break from staring at wordlists.)
So I thought: Why don't I filter the Ngram list, ommitting everything that isn't a noun, verb or adjective from the COED corpus, the sort the result into nouns, verbs and adjectives (again, as defined in the COED), and sort the output by google's estimate of frequency? And take out the irregular verbs.
Well....this is the result:
Nouns | Adjectives | Regular Verbs |
and | the | of |
to | to | in |
in | in | a |
is | a | be |
that | that | on |
for | it | or |
it | on | but |
as | not | have |
was | he | you |
with | this | were |
by | his | one |
on | which | can |
not | have | more |
he | they | will |
i | you | no |
this | their | may |
are | one | time |
or | all | do |
but | her | out |
have | more | up |
an | will | see |
you | so | over |
were | no | like |
one | she | even |
all | other | well |
Hm. 'And' is a noun? 'To' is an adjective? 'Over' is a verb? Well, yes.
An 'AND Gate', or just 'An AND', is something you'll find in electronic circuits - it takes two electrical inputs, and if they're both 'on', it'll output a 'on' signal. So 'and' is a noun.
'To' is an abbreviated form of the Middle English pronoun 'Tone', which is a contraction of 'The one' - as is 'The one or the other'. The fact that no one's used it for at least 500 years is just a detail.
To 'Over' means to jump over something, or to get the better of someone. And it's Late Middle English, so almost as obsolete as 'To.
So what I actually need is a frequency list of exclusively modern english. Like the lists given at Word Detail. Here's their top 25s:
Nouns | Adjectives | Verbs |
time | good | be |
year | new | have |
people | old | do |
way | great | say |
man | high | get |
day | small | make |
thing | large | go |
child | long | see |
mr | young | know |
government | right | take |
work | early | think |
life | big | come |
woman | late | give |
system | full | look |
case | far | use |
part | low | find |
group | bad | want |
number | sure | tell |
world | clear | put |
house | likely | mean |
area | real | become |
company | black | leave |
problem | white | work |
service | free | need |
place | easy | feel |
'Government'? 'Mr'? 'Great'? What kind of corpus were they using that places 'Government' as the 11th most common noun? It's number 145 in the Googlebooks corpus, just above 'States' (159) and 'Thus' (170), and I'm fairly sure these words aren't in my top 500.
So it seems the answer to the question is another question. What are the most common words used in the english language? Answer: Used by who?
The usual intuition of the layperson is that the most common nouns are words like 'knife', 'fork', 'spoon', 'plate' and 'cup'. But these are common things, not common topics of conversation. When was the last time you discussed your cutlery?
My own intuition was that the most common verbs would include 'push', 'pull', 'take', 'drop' and 'put'. But as I've found in teaching, students are more likely to know 'sit down' than 'sit', and 'give up' than 'give'. At least, my students - someone else's will be different.
Something else I could try is the WordNet project, which doesn't so much collect words as meanings - like a thesaurus, but aimed at linguists instead of bad novelists.
Perhaps for my purposes it doesn't matter that I use the commonest words, because I'm using them as a way to teach grammar first and communication second. So maybe this has all been a bit pointless. But pointless journeys have a habit of being more interesting than...er, pointed ones.