What words should I teach? What words do students actually need to know? I don't know either, and intuition is always a lousy guide, but here's one approach to finding out.
The latest Oxford English Dictionary contains about 290,500 entries. The Concise OED has 65795, and I'm using these as my starting point.
I can discard 13,479 entries because they're phrases instead of individual words, plus I can lose 2,727 entries because they're hyphenated terms. That leaves 45,495.
But which ones are absolutely essential, which are kind-of useful, and which are in there to make it look 'comprehensive' or because the compilers just liked them?
I have the subtitles of 20,749 BBC programmes broadcast over the last two years - in effect, transcriptions. By ditching the shortest 749, then filtering out formatting data and punctuation, I've got a pretty large corpus of reasonably authentic utterances.
So, what words from the COED occur with what frequency in the BBC transcriptions? And what words don't occur at all?
Well, here a selection from the 14,633 individual words which occur exactly zero times in two years worth of BBC TV. I know what ten of them mean.
Word | Occurances |
backgrounder | 0 |
bouclé | 0 |
chametz | 0 |
contumacious | 0 |
delist | 0 |
dyspepsia | 0 |
externalism | 0 |
gambado | 0 |
headquarter | 0 |
inamorato | 0 |
kaffeeklatsch | 0 |
linstock | 0 |
menhir | 0 |
mutuel | 0 |
orangeman | 0 |
pemphigoid | 0 |
portière | 0 |
raja | 0 |
sandinista | 0 |
siksika | 0 |
stumer | 0 |
tetrastich | 0 |
tynwald | 0 |
usufruct | 0 |
yaar | 0 |
That means 29,140 words occur at least once. Here are 25 of the 19,381 which occur less than ten times. I know the meanings of 14 of them - what about you?
Word | Occurances |
shirty | 9 |
maraschino | 8 |
cortisone | 7 |
serried | 7 |
ganglion | 6 |
turbocharger | 6 |
gilet | 5 |
som | 5 |
convulsion | 4 |
nimbus | 4 |
unlistenable | 4 |
divestment | 3 |
miscast | 3 |
spousal | 3 |
bioactive | 2 |
epsilon | 2 |
leafhopper | 2 |
prelate | 2 |
tambourin | 2 |
angiography | 1 |
chinkara | 1 |
eclampsia | 1 |
honeyguide | 1 |
minuteman | 1 |
piscina | 1 |
9,339 occur a hundred times or more. The following happen more than ten but less than 1,000 times.
Word | Occurances |
honourable | 704 |
underwear | 529 |
vain | 416 |
troop | 327 |
muffin | 264 |
max | 215 |
cam | 177 |
blip | 150 |
uncanny | 129 |
gland | 111 |
aerospace | 96 |
detonate | 83 |
mangle | 72 |
yam | 63 |
ringside | 55 |
chamomile | 47 |
embryonic | 41 |
poncho | 36 |
uneducated | 32 |
bawl | 27 |
gunfight | 24 |
permissible | 21 |
convection | 18 |
lucre | 16 |
morass | 14 |
A more managable quantity of 520 occur 10,000 or more times. Here are some of those between 1000 and 10,000:
Word | Occurances |
summer | 8831 |
offer | 7861 |
including | 6985 |
hopefully | 6285 |
fruit | 5693 |
showing | 5224 |
closed | 4761 |
sight | 4342 |
location | 4031 |
countryside | 3784 |
product | 3474 |
lack | 3211 |
arrive | 3001 |
transport | 2793 |
shower | 2633 |
iron | 2484 |
breathe | 2297 |
panic | 2162 |
twist | 2031 |
cave | 1921 |
purple | 1824 |
innocent | 1718 |
fraud | 1641 |
virtually | 1560 |
assume | 1480 |
Here's a selection between 10,000 and 100,000. Do any surprise you as being more or less common than you thought?
Word | Occurances |
get | 273618 |
much | 117311 |
home | 63526 |
course | 45194 |
lovely | 34349 |
such | 28509 |
front | 24414 |
while | 20752 |
easy | 18113 |
hold | 16047 |
dad | 13846 |
hour | 12535 |
cost | 11389 |
beat | 10660 |
A mere 76 occur 100,000 or more times:
Word | Occurances |
the | 3812984 |
to | 2404952 |
a | 2098428 |
of | 1699308 |
and | 1640805 |
it | 1432806 |
in | 1209482 |
that | 1173021 |
for | 730257 |
on | 699870 |
have | 633194 |
this | 568081 |
be | 564405 |
are | 532264 |
with | 490772 |
not | 401645 |
at | 396377 |
he | 374352 |
do | 368329 |
me | 350617 |
all | 346828 |
what | 338545 |
there | 337154 |
as | 331112 |
but | 321033 |
like | 319404 |
just | 309660 |
up | 305068 |
can | 297541 |
about | 290147 |
out | 285530 |
so | 285324 |
going | 283807 |
think | 281064 |
from | 275761 |
get | 273618 |
will | 269378 |
know | 267222 |
here | 252640 |
go | 239097 |
an | 220785 |
very | 212718 |
them | 212592 |
see | 201120 |
time | 196690 |
now | 187122 |
if | 187114 |
right | 182193 |
by | 182007 |
more | 179865 |
really | 176809 |
good | 169427 |
people | 168640 |
or | 163126 |
back | 160195 |
some | 159211 |
she | 157138 |
want | 152556 |
no | 148444 |
then | 137351 |
into | 134491 |
down | 130689 |
how | 130228 |
look | 124413 |
come | 124007 |
way | 123631 |
make | 122129 |
over | 119391 |
well | 117864 |
say | 117728 |
much | 117311 |
need | 115195 |
bit | 113799 |
off | 107582 |
little | 103122 |
take | 101002 |
So, here's one difficulty in learning a language. Once you've got the major meanings of the top 100-200 words, you've got tens of thousands of others to learn, and the additional benefit of knowing each of them - their usefulness - is pretty damn small.
How often do you need to describe something as 'spicy' (position 3,000, 975 occurances)? Or 'compulsory' (position 5,000, 370 occurances)? Or describe someone as a 'colonel' (position 7,000, 182 occurances)?
I may have had the occasional 'manky' cheese sandwich (position 10,000, 85 occurances) - but I'm not sure I've ever used the word in conversation.
The Arabic method of learning languages is "memorise the dictionary". It doesn't work, for obvious reasons. But it's...interesting that they've taken only the most difficult, least interesting and least rewarding part of the process, missing out all the easy, fun and useful parts.
The Arabs are almost British in their ability to miss the point.