DH Lab 12: Text Analysis

Text analysis has been a popular form of computational analysis since it’s inception. Whether you support close reading, distant reading, or a healthy mixture of both, there is always something to be learned when evaluating, comparing, and considering the words used by scholars, authors, poets, and anyone in between. 

Voyant Tools is a popular online source for analyzing digital texts. Any user can upload their word source and then play with the various visualizations offered by the site. All of the visualizations show the various relationships between the digitized words and can be connected and presented in unique ways. The image above shows the first page Voyant shows after analyzing the American Medical Association (AMA) Journal of Ethics, July 2018 edition. The screenshot shows the page exactly as it first appeared. I did not make any edits or refine any key-terms. This is why abbreviations like ‘dr’ are visible. 

The Cirrus tool makes these word maps. The larger the word, the more frequently it is used in the selected text. The AMA example uses the terms “health”, “spirituality”, “religious”, and “medical” very often, which makes sense considering the edition is about integrating spirituality work into healthcare. While interesting and visually appealing, this tool is really only effective for recognizing key terms. It also neglects to combine terms that share meaning, like spiritual and spirituality or patients and patient for example. I assume this could be edited and fixed, but at this moment it misrepresents the frequency of the use of particular words because they are not written exactly the same. 

The Trends tool in Voyant takes the Cirrus tool a step further.  Instead of emphasizing frequency alone, the Trends tool demonstrates fluctuations in use throughout the writing piece. This is more useful and interesting because it demonstrates the relationship between particular words and thus themes. Consider how flow similarly until the fifth document segment. There, the terms “health”, “spiritual” and “spirituality” see a spike, as the term “religious” does the opposite. 

The X-axis corresponds to each article in the Journal, so the relationship between religion and spirituality could be due to author topic or preference. The fifth article — where the use of “religious” initially drops — is titled, “Fostering Discussions when Teaching Abortion and Other Morally and Spiritually Charged Topics” and is followed by similarly followed by less religiously explicit articles like, “Training Physicians as Healers” and, “Chaplains Roles as Mediators in Critical Clinical Decisions”. While there is still an assumed religious background in these articles, the term “spirituality” takes the lead. 

Much like the Trends tool, the Loom tool represents key words graphically based on frequency of use throughout the written work. It includes more terms than the Trends tool, causing the overall visualization to feel a bit messy, but simultaneously making the outliers  more notable and interesting. 

Unlike the Trends and Loom tools, the TermsBerry tool does not represent keywords on an axis. More like the Cirrus tool, the terms stand alone and can be investigated individually. This visualization provides more information than the others, though. As with the line graph tools, the TermsBerry shows the frequency of use for each term, but it also shows the relationship between multiple terms. This can be accomplished with the Trends and Loom tools (e.g. you can tell from the Trends tool that the use of “health” increases as the term “religious” decreases across the articles) but the representation is less traditional in the Termsberry. Hovering your mouse over a term like “spiritual” shows that the word is often used alongside “care” and “religious”. The frequency of this correlation is also represented numerically and with a color gradient (when hovering on “spiritual” you can see that “care” is darker than “religious” and has 29 uses with “spiritual” compared to 16 uses with “religious”). In this way, the Termsberry tool provides viewers with more information about the terminology used in the AMA article.

As Ted Underwood points out in his blog post titled, “Where to start with text mining“, much can be gained from computational text analysis. But, he also acknowledges that comparing multiple sources is usually more useful than mining in one. He reminds his readers, “If you want to interpret a single passage, you fortunately already have a wrinkled protein sponge that will do a better job than any computer. Quantitative analysis starts to make things easier only when we start working on a scale where it’s impossible for a human reader to hold everything in memory” (Underwood 2012, 1). A single text source can be more easily deciphered by a human than by a machine. The AMA source, for example, is a piece I read throughly last semester in one single sitting. While computers are much better at working with numbers than I am, the human brain is already familiar with the context of the reading and can interpret what parts of a document impact the relationship between the words and the world. A computer, as Underwood points out, must be taught this context and so, is more effective when handling large amounts of data.

All of this is not to say that the above visualizations are useless, but they are a bit repetitive and don’t give a really unique or insightful perspective on their own. We, as scholars or as knowledge consumers, must attribute meaning to the visualizations. Some of these attributed meanings are more persuasive than others. The key to successfully mining texts is finding which visualization best supports that argument.

Leave a Reply

Your email address will not be published. Required fields are marked *