18 December 2010

Culturomics

I came into my room approximately three hours ago to pack, clean, and watch a Psych episode (because the activity needed to be made a little more enjoyable).


I'm still in bed under the covers with no packing, no cleaning, and no Psych. I'd say it's because I feel sick (which I do) but the truth would probably be laziness (which I am). But I'll take advantage of my supine position and tell you what I think about Culturomics. Because I definitely have an opinion now.

First, let's define both corpora and Culturomics. A corpus is a collection, a body (notice the Latin base). A linguistic corpus, then, is a collection, or body, of words, usually in a specific language and covering a specific period of time. There are multiple linguistic corpora in existence, one of the most famous and well-used being the Helsinki Corpus. Culturomics is a linguistic corpus based off the Google Book Scan project. For those unaware, the idea behind Google Book Scan was to literally scan the books in a variety of academic libraries and then create an online library so they would be available electronically. Unfortunately (or not, depending on your viewpoint), Google had some legal issues regarding copyrights to a lot of the books scanned a while back, so the project has been put indefinitely on hold. However, with the books that have been scanned they created Culturomics. So it is a corpus of 500 billion English (as well as a few other languages) words found in texts published from 1800 to 2000 that reside in university libraries. 

I appreciated the positive things David Crystal said yesterday. He made good points (he usually does), and I think the idea of having a public corpus is a great one. I love linguistic corpora. However, as linguistic corpora go, Culturomics has some huge flaws. It cannot recognize semantic or syntactic differences in words (i.e. the noun versus verb forms of run, play, drive, show, etc.), it does not allow you to see the context in which the words/phrases were used, the words/phrases are apparently not properly categorized by date (which is a big aspect of a corpus), and the entries are only from books found in university libraries---so probably more academic-focused than pop books, and no magazines, newspapers, websites, or spoken language. Apparently some people have said these problems are not that important, or even if they are there isn't a corpus this big in existence that can account for those difficulties while searching.

That is true. There is not another 500 billion word corpus in existence. However, there is a 400 million word corpus covering American English from 1810 to 2000. Created by Dr. Mark Davies, the Corpus of Historical American English became available this summer, and it works just as well (if not better than) Culturomics. In fact, I will go so far as to say it is a superior corpus, even if it does have fewer words.

I emailed Dr. Davies last night to ask his opinion about Culturomics as a linguistic tool, particularly compared to COHA, and he responded by sending me this link. I appreciated the amount of work he has already put into this comparison, and from my experiences with COHA I completely agree with his assessment. (If you are interested in my usage of COHA, feel free to visit this site created for my capstone class, English Language 495: A Corpus-based Approach to the History of American English.) While Culturomics sounds impressive, COHA has a better search engine, can give you more definite results, has a broader range of linguistic information (adding to the accuracy of results), and gives more context than Culturomics.

Plus, I am not a fan of the name Culturomics. It is difficult to say and it is not indicative of a linguistic corpus. So while I normally applaud Google endeavors, on this one I'm sticking with the COHA. You should too.

No comments:

Post a Comment