Text Mining and Language Standardization

Warning: Undefined variable $num in /home/shroutdo/public_html/courses/wp-content/plugins/single-categories/single_categories.php on line 126

Warning: Undefined variable $posts_num in /home/shroutdo/public_html/courses/wp-content/plugins/single-categories/single_categories.php on line 127

Jeffery M. Binder’s ‘Alien Reading’ introduces us to the controversial and unchartered world of Text Mining and Language Standardization. In an age where written information is exploding at light speeds, the prospect of being able to quickly breakdown and categorize and localize snippets of texts is an extremely compelling technology for researchers and linguists. However, the difficulty in this task lies in the fluidity of language itself. To try and convert language into data so that it can be used to make statistical analysis is an inherent problem in and of itself. For example, language is dynamic and is constantly changing. What one word or phrase means to somebody may have a completely different meaning to somebody else. Thus creating a method of standardization is controversial. This issue is ubiquitous across models by which “over fitting” for language occurs. The technology of text mining and language standardization needs to find a balance in which their technology is fast and conclusive enough to be useful while also taking into consideration the locomotive nature of language.

In addition, Text mining faces issues of context. When certain models rely on words, their spelling, and their respective definition these algorithms run into issues about true definition. This phenomena surfaces in Matthew Jockers’s book Macroanalysis. We see a “particular use of stream [that] is not related to the “jet stream” or to the “stream of immigrants” entering the United States in the 1850s.” Rather this stream refers to running water.

With the issues of overfitting and context misjudgment, these text mining algorithms face serious obstacles. If they continue along this pathway without serious considerations and critical analysis by humans on the other side these algorithms could be responsible for a great deal of confirmation bias down the line. One could easily imagine an algorithm sacrificing nuance for efficiency leading to a serious misuse of information.