M1: Unit 1: Q6.2 Answer

« back to questions                                                                                                                                                                              » proceed to chapter 7


This is a matter of counting not words as such, but the number of different words in relation to the total number of words. To do this, corpus studies distinguish between ‘types’ and ‘tokens’.

A ‘token’ is any and every word, counted every time it occurs.

A ‘type’ is each new word counted only the first time it occurs.

The sentence ‘The dog bit the man’ contains 5 tokens (i.e. 5 words) but only 4 types, namely: ‘the’, ‘dog’, ‘bit’ and ‘man’. The word ‘the’ occurs twice (2 tokens, 1 type). Think of a text of 100 words in which, say,

2 words are repeated 10 times = 2 types, 20 tokens

5 words are repeated 5 times = 5 types, 25 tokens

5 words are repeated 3 times = 5 types, 15 tokens

10 words are repeated twice = 10 types, 20 tokens

20 words occur just once = 20 types, 20 tokens

TOTAL = 42 types, 100 tokens

If we set alongside this another text of 100 words (i.e. 100 tokens) in which the number of types was lower than 42, it would mean that this other text had a smaller number of different words and that the same words were repeated more often.
If we had a third text of 100 words and it showed a ratio of types higher than 42, it would mean it had more different words than our first text, hence a larger vocabulary, and fewer repetitions.

Now think of entire collections of original texts and translations, all stored electronically. The statistics show, with a fair degree of consistency, that original texts contain a higher type/token ratio than translations. In other words, originals tend to feature a wider vocabulary and fewer repetitions than translations. Translations use fewer different words and recycle the same words more often.