What I thought of the reading:
This week’s readings were enlightening because they demonstrate how digital tools are useful not only in presenting history to the public or other audiences, but also in the process of researching and creating historical scholarship.
Franco Moretti’s Graphs, Maps, and Trees was a nice introduction to what exactly can be done with manipulating and visually presenting historical data. For Moretti, visualizations of trends, patterns, and cycles in literary history do not replace close reading of individual texts. Rather, they add new layers of information, and sometimes even debunk generally held assumptions about literature’s history. Tim Burke praises Moretti’s approach, in that viewing quantitative data about literature can problematize many commonplace assumptions about it. However, Burke cautions that, while numbers can seem quite concrete and infallible, they can still be misleading. For example, quantifying publication does not actually tell us about readership. He also criticizes Moretti’s lack of emphasis on authors’ agency and the breaks and ruptures (as opposed to gradual divergence) in literary history. However, I think Moretti is still useful in demonstrating how these tools can be used not just in the social and hard sciences, but also in the humanities. Burke’s criticisms show that despite these visualizations’ seeming authoritativeness, the way in which they are interpreted or presented is still quite subjective.
While Moretti mostly deals with publication data for various genres, the rest of the authors focus on data mining specific texts or corpuses of texts in order to analyze them in new ways. Daniel Cohen and Gregory Crane focus on the new scholarly opportunities presented by large digital collections such as Google Books or Project Gutenberg. In conjunction with close examination of a limited number of texts, scholars who use various data mining/text mining tools can, in the words of Cohen, “find patterns, determine relationships, categorize documents, and extract information from massive corpuses.” For example, one might perform a statistical analyses of how often two keywords or phrases appear together, or find specific types of documents (such as syllabi) by assessing frequently used words in these texts.
Unfortunately, these large digital libraries can have some drawbacks, such as “noise” from incorrect OCR, missing texts due to copyright restrictions or cost of digitization, and inability to present or crawl texts in non-Roman alphabets. For these reasons, scholars need to be careful about drawing conclusions from potentially-incomplete data sets.
Trying it out myself:
Playing around with some web-based text mining tools, it was obvious that some of the tools are better suited to entertainment than serious scholarship. Wordle, which generates text clouds of the most frequently used words in a document, creates aesthetically pleasing visualizations. However, aside from giving a general idea about the topics or keywords of a text, I am not sure that this tool has any serious scholarly use. Here is my text cloud for Grimm’s Fairy Tales:
Another tool which was entertaining but probably not statistically sound is Google’s Ngram Viewer. Because you cannot control which texts are included in the analyzed corpus, the data may be misleading. However, for general information rather than scholarly purposes, the Ngram Viewer can give a nice idea of when certain terms may have come in and out of fashion. For example, in the Ngram below, you can see the shift from using the term Great War to the term World War:
Because of the user’s ability to choose texts and because of its myriad analytical tools, Voyant was the most promising tool for scholarly research. I chose to analyze the same Grimm’s Fairy Tails text I tried in Wordle, available through Project Gutenberg. I like how in the user can manipulate the data provided by Voyant in many ways. Not only can you see the most frequently used words, but you can also compare the frequency of two words against each other and see words in context. Voyant also provides a word cloud, which seems to be generated using a different algorithm than Wordle’s, as they came out differently.
Although I felt like I couldn’t take full advantage of Voyant’s tools since I wasn’t undertaking an actual text-mining project, I did find it interesting that Voyant identified “said” as the most frequently used word in Grimm’s Fairy Tales. This might say something useful about the structure of the tales or how the narrative action is pushed forward. As you can see above, Wordle actually eliminated “said” from its word cloud, perhaps because it is too commonly used; this shows how lack of control over the algorithm or data manipulation of tools like Wordle and Ngram can lead to misleading information.