The effect of corpus size on identifying the highest frequency words: an analysis of frequency data from five corpora

2017. № 2 (34), 70-91

Gota Sayama, Tokyo University of Foreign Studies; Tokyo, Japan


This paper is focused on defining the concept of highest frequency words used in foreign language teaching and computer linguistic technology. The aim of this paper is to establish the range of core lexical items using data from corpora. A method of specifying this core has been developed on the basis of sample selections, each one including one million tokens, and stability of vocabulary in them has been evaluated. We created five corpora, each containing one million running words and constituting about one-hundredth of the Russian National Corpus collection used in the frequency dictionary. The Russian National Corpus is considerable in size and well-balanced in genre coverage, thus presenting fairly plausible information about word frequency in the written language. Our five corpora represent the genre balance of the frequency dictionary and thus provide a miniature Russian National Corpus. In the present paper the overlapping proportion of lexical items was analyzed by comparing the frequency data from these five corpora with the data taken from the Russian National Corpus, using the method of A. Kilgarriff. It was found that one million tokens are sufficient to describe about 1,500 first words in the frequency rank order.