August 17: Presenting the data

Today was a big day because I was finally finished the encoding and ready to begin using some programs to figure things out about the data.

Ideally, I would like to map people’s names and compare the frequency of “important” vs “regular” people. I also want to see which towns are mentioned most frequently and what type of food is mentioned most.

To recap, these are the basic questions I have asked throughout this project in regards to the TEI work:

  1. How often are well-known people vs. regular people mentioned?
  2. How might this speak to the function or readership of the paper?
  3. How frequently are locations mentioned?
  4. Does this speak to the relative “world” they lived in? 
  5. Are some items sold more than others? What and why?

As a result of time constraints and as a proof of concept, my analysis has focused primarily on the first four questions, looking at them from an inter-related and connected way.


I began my analysis with a program called Voyant, which allows you to enter a file or URL and notice word frequencies and similar data visualisations. Within Voyant, I started with Cirrus, which is a word cloud generator. I wanted to see what the most frequently used words were throughout the newspaper. Unfortunately, the downside is that because of the poorly-done OCR, this word cloud is not entirely accurate and this needs to be mentioned. Failing to mention this could influence the way researchers approach this tool and lead them to draw wrong conclusions.

This is what the Voyant tool looks like, taken as a screenshot. The Cirrus tool is in the upper left-hand corner.

The top key words include: Mr, new, old, time, years, house, men, coun, shawville, John, law, business, and January.
Taken critically and within the context of a paper written shortly after the start of the new year and largely focusing on recent municipal elections, the results make sense. Elected officials are referred to formally and respectfully. References to “new” include “New” York, “New” Years, new councillors and new buildings/work. Interestingly enough, the name “John” is 7th among the top 25 most frequent words. This means that either the same person is being referred to multiple times, or there are many different people named John. In examining the phrases in which the term occurs, it appears that both are the case. Without doing additional research, it is difficult to know if the name John was popular in the Shawville region during this time, if it was popular among the English-speaking elites, or if some other factor is at play.

It is analysis such as this that helps to create a base from which further research into the questions I asked can be explored. For example, the term “January” is also in the list of frequent words and speaks heavily to the advertising tactics of G. H. Hodgins. Upon closer examination, it becomes clear that Hodgins was trying to beat the post-Christmas sales slump by heavily promoting his business throughout the month of January. The variety of his merchandise was extensive, ranging from dry goods, to furs, to hats and carpets, to clothing, to footwear, to various meats, to baked goods. I was able to establish the range of his offerings by examining the encoded .xml version of my file on the internet, quickly scanning the document for words that appear in purple (encoded for sale). Then I did a quick check of the context and included items mentioned at the same time as either Hodgins or January.  This is an example of using this Equity file as a proof of concept. Now that I know I that I can use this data in this way, I would be able to combine it with potentially hundreds of other editions and see how frequently Hodgins advertises his products, if what he sells changes (seasonal or otherwise) and if the extent or percentage of sale changes.

A note of caution has to be provided however. As historians, we need to continue to be critical about our research methods and data analysis. Just as we analyse the credibility of a written source, we need to analyse digital sources. For instance, while it may seem fair to conclude that Hodgins sells the items he advertises, it does not follow that we know his inventory. It would be best practice to say we know what items he has chosen to advertise as on sale. While this doesn’t seem like a big deal, it makes a difference from the conclusions we draw about his business, the consumer culture, and the items available for daily use in January 1897. A much more specific and in-depth project focusing on Hodgins, advertising culture, and the economic atmosphere would need to be conducted to make these types of conclusions fairly.


When I broadened the word frequency to 105 terms (as close to 100 as I could get), it included variations on the words “council/councillor/municipal/elected/mayor”. This points to the strong emphasis in the paper on the recent elections in the surrounding townships and also correlates with the frequency of “Mr.”. Other strong words inclded words associated with the January sale “prices/goods/company/cent/sale/January”. Again, this correlates to the top 25 list and its emphasis.


In terms of the tags I encoded for, I was happy to see that names of towns featured in the top 105 list. Bristol was mentioned 14 times, Arnprior 9 times, London 8 times, York (most likely New York) 9 times, and Ottawa 14 times. Closer reading would need to explore in what context these terms where mentioned. It surprised me to see London and (New) York mentioned so highly. On closer examination, it appears many of these references were concentrated in two or three distinct stories about criminal proceedings or internationally significant events.

At a preliminary level, this allows me to answer my initial questions about people and place names. In terms of people, names included John, Donaldson and Armstrong. A quick search through the text using the browser’s search function, or the list of terms and frequencies from Voyant, reveals several interesting things.

First, John and Donaldson are often part of the same name: John Donaldson. All we know about him is that he died the previous year and was married to Margaret Evans, who is now also deceased. This demonstrates some of the difficulty in relying on data analysis tools like this without doing the critical thinking. I know John Donaldson is one person because I took the time to run the names past the document and view them in context. Let me demonstrate this further.

John was a popular name and a search through the text reveals a John Mather, John McGuire, John Stewart, John McLellan, John Coyne, and John McIntyre in addition to Donaldson. Of these men, the tags let us know that the majority of them were wealthy and involved in the business or political world. Now, it does not make sense based on this analysis to conclude that John is a name given to the wealthy. However, it is possible to note based on word frequency and the encoding, that a significant number of people in this paper are named John and that the people named John are largely political influencers. Further analysis could determine their age ranges and perhaps infer if the name was popular during a certain time period. Taken against a much larger range of papers and census data, it might be possible to correlate the name “John” with English-speaking, Protestant, politically-influential people. At this stage, however, it is merely possible to note the curiosity as suggestive of a greater trend.

Similarly, an analysis of the tags of the people in the first 200 lines of the paper is insufficient to determine if the paper places more emphasis on rich versus poor people and if this emphasis is indicative of a wider social or structural bias. This is because I have not taken the time to encode the rest of the paper and run the risk of having my tags only capture the few stories that focus on the recent city elections and the fact that the city elections are covered themselves. As a colleague pointed out in her TEI work, the time of year or local events largely dictate what is present in the paper and a much bigger sample is needed to get an accurate picture.

In summary then, in brief, in this paper, more important business people and politicians are mentioned than ordinary people, but this may be an unfair generalisation because of the paper’s focus on the local elections and business advertisements. This speaks to the newspaper’s function as a place to share local news (i.e. so-and-so’s funeral is this Tuesday), political events, and business advertisements. Similarly, while local place names are mentioned more frequently than international names, the international names carry more significance and speak most often to a political or criminal story of interest to the broader public. Again, this is ties into the purpose of newspapers as having both a local and general interest focus.

Leave a Reply

Your email address will not be published. Required fields are marked *