Initially it will be helpful to distinguish. Are we interested in. any word that is a part of the document, i.e. any word that plays a role in the establishment of the document and the meaning that it contains, or are we aiming for the crux of the matter (in German “der springende Punkt”). Ad 1. It is straightforward to identify all the unique words in the document and calculate their frequencies. It is even possible to predict the size of the unique vocabulary based on the total number of words. I have found this formula to be generally applicable (R2 value 0,97). n = total number of words f(n) = the estimated number of unique words f(n) = 4,1*n^0,67 Note. The formula suggests that, on average, 67% of all words that are added to the length of the document can be expected to be unique. Of course, if the text is quite short, this proportion is higher, and, if the text is longer, there may be hardly any new words since, predominantly, existing words are re-used. This is how the general relationship between length of text and number of unique words looks. Here is an example of a list containing all the words in a text. It appears that, in this text, the most frequent word is “of” (45 occurrences). In a general list of words in English texts it is only rank no. 9. Further down the list the word “ecology” is rank no. 13. This is somewhat interesting since this word on average is rank no. 33.640. I invite you to extract this information for your own texts from Quantitative and Qualitative Research Software & Services - this is the function you want to use (marked in yellow). In the above list, with rank 17, you encounter a “long-word” (meaning a word with one or more spaces inside the word or word-expression. Rank 17 - “steps to an ecology of mind” - 5 occurrences This long-word is an example of a named entity that consists of two or more “short-words”. It appears that the tool offers not just “short-words” but also a number of such “long-words”. Ad 2. At this point you may have started wondering if and how “long-words” can be useful for the identification and extraction of keywords as the crux of the matter. Let me offer this rule of thumb. The number of potentially interesting keywords is equal to the number of expected unique words times 1,5 or perhaps 2. So for example. Length of text = n - - - f(n) - - - f(n)*1,5. You may argue that there are many “trivial” words (e.g. function words such as pronouns, modal verbs etc.) that do not merit the criterion of a keyword. It has been my experience that only by carefully looking at even the finer details can you be confident that nothing of potential importance has not been overlooked. You may also argue that that close to 3.000 keywords in a single text is a (too) huge amount of detail (I would agree!). Fortunately, there are supplementary approaches that may be put to good use. (a) Looking for words that are unusually frequent. These are the first 13 words (only counting “short-words”) that may be key to the document. (b) Looking for topics that characterize the document. This pie chart presents the relative size of twenty-some topics that may go a long way for you identifying what is key about the document. Here’s what emerges … if we open up the “thinking/cognitive” master-topic, we get this. In summary, look for frequent short-words, check for long-words, and use topics to create overview.
The “Mind as a Machine” (MAM) approach to writing and language is just one of many examples of this new way of approaching the content.