August 17: Data p.2

The other tool I used through Voyant was Links. This was a different view from the previous visualization because it did not just present the most frequently used words, but searched out how they were connected. This type of visualization is useful in helping the reader or researcher see context and begin to look at alternative ways of broadening a search query. For example, “John” is now connected to “mother”. If we continued to trace either John or mother, we could potentially see what John does or why he is mentioned so many times and take a deeper reading to figure out how he connects to mother. Similarly, if our focus was on mother, we could explore how she is related to John and start to ask broader questions. In this case, we could ask why the mother is referred to as John’s mother as her identifier, asking questions about identity, gender, power, representation, and the role of newspapers/the printed word in perpetuating or addressing these issues. If we were looking at a huge collection of texts instead of one, we could do like Michelle Moravec does in her analysis of feminism and its representation through topic modelling, and do a deeper analysis.

Cautions to consider with this type of analysis include the possibility of program errors in connecting of words, and in the case of this data in particular, the concern about poorly OCRd text and missed or unrecognisable words. In order for this chart to be more than merely illustrative, it would have to be used critically and the researcher would have to have taken high levels of care and effort in cleaning up the data for more rigorous scholarly use.  In addition, the discussion we had previously about the dangers of being uncritical when using the “John” results applies here as well because this is really a different side of the same kind of data analysis.


I then explored the Terms tool. This was a list of all the words and their frequencies. I selected all the terms that occurred eight or more times. I chose this number because it was a number that reflected the lower end of the “frequent” word list and represented 136 individual terms. This number seemed to me to be enough to experiment with, but not too much to get bogged down with.
When I exported it as a text separated file (so I could turn it into a csv and use it later), it actually exported all 600 terms. I double checked what I clicked on and re-did it, making sure to manually select each of the 136 words I wanted. This is an important note about digital history: it is slow, finicky work and trying to rush through it or work on things you haven’t take the time to understand is not useful in the long-run.

I turned it into a .csv file, recognising that it now had the correctly-sorted values to be used appropriately in Google’s OpenRefine. However, it wasn’t to be. I didn’t think it through enough and recognise that their was actually only two categories: word and frequency, and that unlike the tutorial Texas correspondence, there wasn’t actually anything that needed “cleaning up” in the top 136 words. This is yet another example of how different tools are suitable for different projects and it is our job as digital history researchers to be aware of this. Part of doing the work is to make sure we understand the tools available and picking the right one. So, while using OpenRefine may have made sense at the beginning of the project, but wasn’t possible, now it was possible, but no longer makes sense.


As an interesting side note, I did do a little playing around with the .csv in Open Refine, specifically around the “text facet sort” function. I wanted to see if it would find two words that were similar enough that they should be merged to increase their word count. It found “friend” and “friends”. I clicked that it should merge, but on second thought, realised that in the context of this newspaper, it might actually be unethical to do this.

Unlike a series of correspondences where the names were almost certainly the same, the key words were not necessarily within the same context. “Friend” might be part of a sentence that said “John went to visit his dear friend last week”, whereas “Friends” might refer to people in the more formal and impersonal voice: “at this grocery store, we treat all our customers as friends”. I undid the merge and closed the file.

Upon deeper reflection, I realised I was confusing OpenRefine and Gephi. Gephi is a network analysis tool. If this project was continuing forward, I would have used it to continue my exploration of words and how they are connected re: the questions I am asking. I would also look briefly at topic modelling to see if the computer program (antconc?) makes the same connections between words as the “Links” and my own analysis does.

Leave a Reply

Your email address will not be published. Required fields are marked *