Kapitano: The Twelve Essays of Christmas: Swanning About

For the past three months, I've been working on a teaching method, whereby students go quickly through all the elementary grammar, but with minimal vocabulary. Then with that foundation they go on to more advanced grammar and more words.

In fact, these little essays came out of that work.

I think it suits the Arab educational mindset, and incidentally it's how I would prefer to learn. The initial minimal vocabulary would consist of the most common nouns, adjectives and regular verbs - plus 'grammar words' like auxiliaries, modals, prepositions, articles etc.

Which leaves only the nagging question: What are the most common nouns etc.? And what's a reliable way to find out what they are?

I've tried various methods. One was to take the subtitles of 2000 films and BBC TV shows, which I've collected over 2 years, and run them through a word frequency analyser. After trying out several programs, I found they were all either far too expensive...or utter rubbish. Except for one which was free and did exactly what I needed, but crashes after 37,000 words.

So I tried writing a script which took the same corpus and counted off the non-plural nouns, non-adverbial adjectives and non-irregular verbs, got by another script from the COED. It took a week to run and the result was...unusable. Here for instance are the 'top 25 nouns'.

Noun	Frequency
I	326168
it	242586
S	236762
and	229025
T	125990
he	90903
on	89512
me	70876
are	67004
but	65995
so	60612
re	58068
do	57479
not	57186
No	54522
M	52176
at	48153
don	47514
she	43154
as	42600
if	39252
go	34934
oh	34120
see	30611
yeah	28665

Now, there are plenty of other people, much more clever and expert than me, with much more computing power, who are doing similar work. This fellow has produced some very nice word frequency graphs using Google's Ngrams - a frequency list which google produced from works on their googlebooks service. (Hat tip to Ben Goldacre, who's essay on he subject I found by accident while taking a break from staring at wordlists.)

So I thought: Why don't I filter the Ngram list, ommitting everything that isn't a noun, verb or adjective from the COED corpus, the sort the result into nouns, verbs and adjectives (again, as defined in the COED), and sort the output by google's estimate of frequency? And take out the irregular verbs.

Well....this is the result:

Nouns	Adjectives	Regular Verbs
and	the	of
to	to	in
in	in	a
is	a	be
that	that	on
for	it	or
it	on	but
as	not	have
was	he	you
with	this	were
by	his	one
on	which	can
not	have	more
he	they	will
i	you	no
this	their	may
are	one	time
or	all	do
but	her	out
have	more	up
an	will	see
you	so	over
were	no	like
one	she	even
all	other	well

Hm. 'And' is a noun? 'To' is an adjective? 'Over' is a verb? Well, yes.

An 'AND Gate', or just 'An AND', is something you'll find in electronic circuits - it takes two electrical inputs, and if they're both 'on', it'll output a 'on' signal. So 'and' is a noun.

'To' is an abbreviated form of the Middle English pronoun 'Tone', which is a contraction of 'The one' - as is 'The one or the other'. The fact that no one's used it for at least 500 years is just a detail.

To 'Over' means to jump over something, or to get the better of someone. And it's Late Middle English, so almost as obsolete as 'To.

So what I actually need is a frequency list of exclusively modern english. Like the lists given at Word Detail. Here's their top 25s:

Nouns	Adjectives	Verbs
time	good	be
year	new	have
people	old	do
way	great	say
man	high	get
day	small	make
thing	large	go
child	long	see
mr	young	know
government	right	take
work	early	think
life	big	come
woman	late	give
system	full	look
case	far	use
part	low	find
group	bad	want
number	sure	tell
world	clear	put
house	likely	mean
area	real	become
company	black	leave
problem	white	work
service	free	need
place	easy	feel

'Government'? 'Mr'? 'Great'? What kind of corpus were they using that places 'Government' as the 11th most common noun? It's number 145 in the Googlebooks corpus, just above 'States' (159) and 'Thus' (170), and I'm fairly sure these words aren't in my top 500.

So it seems the answer to the question is another question. What are the most common words used in the english language? Answer: Used by who?

The usual intuition of the layperson is that the most common nouns are words like 'knife', 'fork', 'spoon', 'plate' and 'cup'. But these are common things, not common topics of conversation. When was the last time you discussed your cutlery?

My own intuition was that the most common verbs would include 'push', 'pull', 'take', 'drop' and 'put'. But as I've found in teaching, students are more likely to know 'sit down' than 'sit', and 'give up' than 'give'. At least, my students - someone else's will be different.

Something else I could try is the WordNet project, which doesn't so much collect words as meanings - like a thesaurus, but aimed at linguists instead of bad novelists.

Perhaps for my purposes it doesn't matter that I use the commonest words, because I'm using them as a way to teach grammar first and communication second. So maybe this has all been a bit pointless. But pointless journeys have a habit of being more interesting than...er, pointed ones.

The Twelve Essays of Christmas: Swanning About

No comments:

Post a Comment

Archive