August 20: The wrap up

The big project is due today, so I am calling this the big wrap up. However, as we discussed in class, there is really no such thing as a “finished” digital history project by its very nature, and so this post will be more of a summary of what I have done so far, what questions I have asked and begun to find answers for, and what steps and tools I would envision carrying this project forward.

So far, I have used wget to download Equity papers in .txt format, including the January 14, 1897 edition which I used for this paper. After trying several cleanup methods, I used TEI to cleanup and encode the single file. I encoded for people, places, medicines, and sales items. I used these tags to begin answering the questions I settled on:

  1. How often are well-known people vs. regular people mentioned?
  2. How might this speak to the function or readership of the paper?
  3. How frequently are locations mentioned?
  4. Does this speak to the relative “world” they lived in?
  5. Are some items sold more than others? What and why?

 

I used Voyant to visually map word frequencies and word connections using the Cirrus and Links tools. I produced two word pictures which confirmed my previous suspicions about what words would be most frequent throughout the paper. However, I had to re-evaluate my original premise and understand that word frequency could not automatically be equated with popularity, editorial intent, cultural importance, or local reality. This was especially true because the paper had been poorly digitalized using OCR technology. It was unclear what the “real” results would have been had the OCR done a better job and more fully captured the paper’s contents.

I explored and was confronted by the ethical and methodological challenges of doing digital history in public (on the web), and I re-evaluated many of my plans and initial suppositions. My project changed significantly over time, a product of an evolving understanding of the strengths and limitations of the digital mediums I was using, and the practical challenges I faced (described in my large fail-log).

I was able to conclude in a preliminary sense that yes, more wealthy people seemed to be in the paper, but that no, this couldn’t really provide an accurate reading of dynamics and daily realities in the Shawville region because of the presence of the council elections and the fact that the sample size was too small to be reliable. The same conclusion was drawn regarding locations and their relative importance to the region. I did not get far with the sales research I wanted to do, but did notice an abundance of winter-type clothing (as it was January) and meat products on sale. I also noticed many quack medicines on sale, a common situation during the late 1800s and early 1900s.

Going forward, I would suggest running several visualisations through Gephi and some topic modelling through a program such as Antconc.

 

The link to my paradata is on this google drive. This will need to be downloaded because it is a .md file and won’t read automatically.

Please also find it at github.

 

August 18: A series of errors and a helpful friend

This posting is really about things I tried to do most recently with the project, how they didn’t work, and how I received help to figure them out. I wanted to find a program that would allow me to not only count the number of words in the text like I did previously, but sort them by the TEI tags so that I could count word frequencies from within the collection of tags.

For example, I wanted to be able to address the question about what items were on sale and how they might have changed over time. All the sales items were differently named though, so it would not have been possible to search for them easily by keyword. However, each Sale tag did have the <sale> tag itself in common, and if I could search that, I could categorise from within it and determine how much of the sale focused on meat, how much on animal furs, etc.

I reached out to the professor via our Slack channel and received advice directing me towards a program called Catmandu. I began following the tutorial after I had installed it.  I showed it to my brother, hoping he was familiar with it and could give me a head start.

We decided after some exploration that we would prefer to not only count the number of tags each word totaled to, but to sort the place names and determine their relationship to each other. We abandoned Catmandu and began to work with regex and python.

The plan was the create a regex searching for the <placeName> tag, including up to the end of the “ref” line and then use a python expression to tabulate all instances of the same place name.

I wrote a regex that I thought would do this
(“<placeName” + “>)

The rest of the blog basically reads like a “fail-log”, an account of what I wanted to do, how I tried to do it, what help I had, and how it didn’t work.

I was right about the beginning section of the regex and the later need for the “+” to grab additional information. I was not precise enough and while I remembered some of the conventions about writing the expression, I forgot others, such as the “\” and the strategic use of brackets.
My brother modified my expression so that it continued until the “ref” at the end. <placeName key=\”(.+)\” ref “> – inserted to close the tag and prevent the rest of the text from changing colour” This correctly identified the place names, but the “ref” it was picking up was not always the one it started with, especially when the place names occurred on the same line

We then turned to the python part, looking for ways to isolate each line independently and then add re-occurring place names into a counter. This required the creation of the python file, using “for” to identify the regex phrase to be searched for, “if” means that if it finds more than one for the phrase, it adds it up and creates frequencies. If the place name is mentioned for the first time, it is simply noted. This is referred to as the dictionary.
He kindly created and sent me this python script:

It works in theory, but the “key=” causes problems with the script and it will not work. However, this process demonstrated the initial steps of this analysis. If I had more time and was following this project through, the goal would be to learn more about why the **placenames[m.group(1)] = placenames[m.group(1)] + 1 section isn’t working properly.

Theoretically, if it did, it would search for all place names that were in tags, and tabulate how many times each was mentioned. I would then take the information and place it into context, asking what, if any, the relationship was between a locations’s location and Shawville. Other data analysis could also be conducted, similar to what I have discussed previously. The same type of process would also be done for well-known vs. regular people, and items sold at the groceries what, if anything, it would have to do with seasonal products.

August 17: Data p.2

The other tool I used through Voyant was Links. This was a different view from the previous visualization because it did not just present the most frequently used words, but searched out how they were connected. This type of visualization is useful in helping the reader or researcher see context and begin to look at alternative ways of broadening a search query. For example, “John” is now connected to “mother”. If we continued to trace either John or mother, we could potentially see what John does or why he is mentioned so many times and take a deeper reading to figure out how he connects to mother. Similarly, if our focus was on mother, we could explore how she is related to John and start to ask broader questions. In this case, we could ask why the mother is referred to as John’s mother as her identifier, asking questions about identity, gender, power, representation, and the role of newspapers/the printed word in perpetuating or addressing these issues. If we were looking at a huge collection of texts instead of one, we could do like Michelle Moravec does in her analysis of feminism and its representation through topic modelling, and do a deeper analysis.

Cautions to consider with this type of analysis include the possibility of program errors in connecting of words, and in the case of this data in particular, the concern about poorly OCRd text and missed or unrecognisable words. In order for this chart to be more than merely illustrative, it would have to be used critically and the researcher would have to have taken high levels of care and effort in cleaning up the data for more rigorous scholarly use.  In addition, the discussion we had previously about the dangers of being uncritical when using the “John” results applies here as well because this is really a different side of the same kind of data analysis.

 

I then explored the Terms tool. This was a list of all the words and their frequencies. I selected all the terms that occurred eight or more times. I chose this number because it was a number that reflected the lower end of the “frequent” word list and represented 136 individual terms. This number seemed to me to be enough to experiment with, but not too much to get bogged down with.
When I exported it as a text separated file (so I could turn it into a csv and use it later), it actually exported all 600 terms. I double checked what I clicked on and re-did it, making sure to manually select each of the 136 words I wanted. This is an important note about digital history: it is slow, finicky work and trying to rush through it or work on things you haven’t take the time to understand is not useful in the long-run.

I turned it into a .csv file, recognising that it now had the correctly-sorted values to be used appropriately in Google’s OpenRefine. However, it wasn’t to be. I didn’t think it through enough and recognise that their was actually only two categories: word and frequency, and that unlike the tutorial Texas correspondence, there wasn’t actually anything that needed “cleaning up” in the top 136 words. This is yet another example of how different tools are suitable for different projects and it is our job as digital history researchers to be aware of this. Part of doing the work is to make sure we understand the tools available and picking the right one. So, while using OpenRefine may have made sense at the beginning of the project, but wasn’t possible, now it was possible, but no longer makes sense.

 

As an interesting side note, I did do a little playing around with the .csv in Open Refine, specifically around the “text facet sort” function. I wanted to see if it would find two words that were similar enough that they should be merged to increase their word count. It found “friend” and “friends”. I clicked that it should merge, but on second thought, realised that in the context of this newspaper, it might actually be unethical to do this.

Unlike a series of correspondences where the names were almost certainly the same, the key words were not necessarily within the same context. “Friend” might be part of a sentence that said “John went to visit his dear friend last week”, whereas “Friends” might refer to people in the more formal and impersonal voice: “at this grocery store, we treat all our customers as friends”. I undid the merge and closed the file.

Upon deeper reflection, I realised I was confusing OpenRefine and Gephi. Gephi is a network analysis tool. If this project was continuing forward, I would have used it to continue my exploration of words and how they are connected re: the questions I am asking. I would also look briefly at topic modelling to see if the computer program (antconc?) makes the same connections between words as the “Links” and my own analysis does.

August 17: Presenting the data

Today was a big day because I was finally finished the encoding and ready to begin using some programs to figure things out about the data.

Ideally, I would like to map people’s names and compare the frequency of “important” vs “regular” people. I also want to see which towns are mentioned most frequently and what type of food is mentioned most.

To recap, these are the basic questions I have asked throughout this project in regards to the TEI work:

  1. How often are well-known people vs. regular people mentioned?
  2. How might this speak to the function or readership of the paper?
  3. How frequently are locations mentioned?
  4. Does this speak to the relative “world” they lived in? 
  5. Are some items sold more than others? What and why?

As a result of time constraints and as a proof of concept, my analysis has focused primarily on the first four questions, looking at them from an inter-related and connected way.

 

I began my analysis with a program called Voyant, which allows you to enter a file or URL and notice word frequencies and similar data visualisations. Within Voyant, I started with Cirrus, which is a word cloud generator. I wanted to see what the most frequently used words were throughout the newspaper. Unfortunately, the downside is that because of the poorly-done OCR, this word cloud is not entirely accurate and this needs to be mentioned. Failing to mention this could influence the way researchers approach this tool and lead them to draw wrong conclusions.

This is what the Voyant tool looks like, taken as a screenshot. The Cirrus tool is in the upper left-hand corner.

The top key words include: Mr, new, old, time, years, house, men, coun, shawville, John, law, business, and January.
Taken critically and within the context of a paper written shortly after the start of the new year and largely focusing on recent municipal elections, the results make sense. Elected officials are referred to formally and respectfully. References to “new” include “New” York, “New” Years, new councillors and new buildings/work. Interestingly enough, the name “John” is 7th among the top 25 most frequent words. This means that either the same person is being referred to multiple times, or there are many different people named John. In examining the phrases in which the term occurs, it appears that both are the case. Without doing additional research, it is difficult to know if the name John was popular in the Shawville region during this time, if it was popular among the English-speaking elites, or if some other factor is at play.

It is analysis such as this that helps to create a base from which further research into the questions I asked can be explored. For example, the term “January” is also in the list of frequent words and speaks heavily to the advertising tactics of G. H. Hodgins. Upon closer examination, it becomes clear that Hodgins was trying to beat the post-Christmas sales slump by heavily promoting his business throughout the month of January. The variety of his merchandise was extensive, ranging from dry goods, to furs, to hats and carpets, to clothing, to footwear, to various meats, to baked goods. I was able to establish the range of his offerings by examining the encoded .xml version of my file on the internet, quickly scanning the document for words that appear in purple (encoded for sale). Then I did a quick check of the context and included items mentioned at the same time as either Hodgins or January.  This is an example of using this Equity file as a proof of concept. Now that I know I that I can use this data in this way, I would be able to combine it with potentially hundreds of other editions and see how frequently Hodgins advertises his products, if what he sells changes (seasonal or otherwise) and if the extent or percentage of sale changes.

A note of caution has to be provided however. As historians, we need to continue to be critical about our research methods and data analysis. Just as we analyse the credibility of a written source, we need to analyse digital sources. For instance, while it may seem fair to conclude that Hodgins sells the items he advertises, it does not follow that we know his inventory. It would be best practice to say we know what items he has chosen to advertise as on sale. While this doesn’t seem like a big deal, it makes a difference from the conclusions we draw about his business, the consumer culture, and the items available for daily use in January 1897. A much more specific and in-depth project focusing on Hodgins, advertising culture, and the economic atmosphere would need to be conducted to make these types of conclusions fairly.

 

When I broadened the word frequency to 105 terms (as close to 100 as I could get), it included variations on the words “council/councillor/municipal/elected/mayor”. This points to the strong emphasis in the paper on the recent elections in the surrounding townships and also correlates with the frequency of “Mr.”. Other strong words inclded words associated with the January sale “prices/goods/company/cent/sale/January”. Again, this correlates to the top 25 list and its emphasis.

 

In terms of the tags I encoded for, I was happy to see that names of towns featured in the top 105 list. Bristol was mentioned 14 times, Arnprior 9 times, London 8 times, York (most likely New York) 9 times, and Ottawa 14 times. Closer reading would need to explore in what context these terms where mentioned. It surprised me to see London and (New) York mentioned so highly. On closer examination, it appears many of these references were concentrated in two or three distinct stories about criminal proceedings or internationally significant events.

At a preliminary level, this allows me to answer my initial questions about people and place names. In terms of people, names included John, Donaldson and Armstrong. A quick search through the text using the browser’s search function, or the list of terms and frequencies from Voyant, reveals several interesting things.

First, John and Donaldson are often part of the same name: John Donaldson. All we know about him is that he died the previous year and was married to Margaret Evans, who is now also deceased. This demonstrates some of the difficulty in relying on data analysis tools like this without doing the critical thinking. I know John Donaldson is one person because I took the time to run the names past the document and view them in context. Let me demonstrate this further.

John was a popular name and a search through the text reveals a John Mather, John McGuire, John Stewart, John McLellan, John Coyne, and John McIntyre in addition to Donaldson. Of these men, the tags let us know that the majority of them were wealthy and involved in the business or political world. Now, it does not make sense based on this analysis to conclude that John is a name given to the wealthy. However, it is possible to note based on word frequency and the encoding, that a significant number of people in this paper are named John and that the people named John are largely political influencers. Further analysis could determine their age ranges and perhaps infer if the name was popular during a certain time period. Taken against a much larger range of papers and census data, it might be possible to correlate the name “John” with English-speaking, Protestant, politically-influential people. At this stage, however, it is merely possible to note the curiosity as suggestive of a greater trend.

Similarly, an analysis of the tags of the people in the first 200 lines of the paper is insufficient to determine if the paper places more emphasis on rich versus poor people and if this emphasis is indicative of a wider social or structural bias. This is because I have not taken the time to encode the rest of the paper and run the risk of having my tags only capture the few stories that focus on the recent city elections and the fact that the city elections are covered themselves. As a colleague pointed out in her TEI work, the time of year or local events largely dictate what is present in the paper and a much bigger sample is needed to get an accurate picture.

In summary then, in brief, in this paper, more important business people and politicians are mentioned than ordinary people, but this may be an unfair generalisation because of the paper’s focus on the local elections and business advertisements. This speaks to the newspaper’s function as a place to share local news (i.e. so-and-so’s funeral is this Tuesday), political events, and business advertisements. Similarly, while local place names are mentioned more frequently than international names, the international names carry more significance and speak most often to a political or criminal story of interest to the broader public. Again, this is ties into the purpose of newspapers as having both a local and general interest focus.

August 16: Trouble-shooting the Stylesheet

I had some trouble after I finished encoding the medicine and sales tags. This was because they were not part of the initial example that I used as the basis of my stylesheet. I opened the stylesheet and began experimenting with the format so that I could add my own. I was unable to figure out the key aspect of my versions that did not work. My first attempt looked like this:

<li style=”color:orange;text-decoration:none;”>Medicine</li>

<xsl:template match=”medicineName”>
<a style=”color:orange;text-decoration:none;” href=”{@ref}” title=”{@key}&#013;({@from}-{@claim})&#013;{@ingredients}”><xsl:value-of select=”.”/></a>

 

I reached to my professor and classmates on our Slack channel and to my brother. I reached out to my professor because he helped me begin the project by providing the stylesheet. I reached out to my classmates because I suspected that some of them were working on something similar. Indeed, I had a conversation with a fellow student @victoriabarker, who had similar questions about encoding people’s names and that whole process. I shared what I did and we discussed it in more detail. I also reached out to my brother because as an expert I knew, I could give him access to my screen and that would help with identifying the problem faster. Here are some of my notes explaining the process and results from this collaboration:

“I received help from the professor and was able to get the “medicine code to turn red throughout the text.
However, my brother pointed out that I was missing the “href=”{@ref}” to actually link the item to a web page that would open when it was clicked on. I also wasn’t able to get the medicine categories all to show up in the “hover” function. I realised that it was because I only included the default variables from the professor, which were “key” and “ref”. Then I forgot to add the “@” sign to the beginning of item, further frustrating me until I re-read the document again and noticed the discrepancy with the original examples.”

I think the biggest thing that was difficult about this last step was that because I had received the stylesheet from the professor, I was using it without truly understanding what each component meant or how to modify it. I had no frame of reference and so I was totally confused when it didn’t work the first time. I didn’t panic and reached out to my professor and brother. Once I understood the different components, I was able to be directed towards a close comparison between the various mistakes I made over different layers of the project and the original versions.

This marked the end of encoding of the file and I was ready to begin manipulating the data in ways that would finally address the questions I set out to answer.

August 13-15: Continuing TEI encoding

As mentioned previously, TEI encoding is relatively easy once you get the hang of it, but it takes a long time. On August 13-15, I continued working on the TEI encoding, moving from people, to place names, to medicines, and to sale items.

I had to change some of the URLs from the people references from “&” which as discussed previously, is something that the .xml format won’t read, to &amp; which is the accepted format for the same thing. When it loads on the internet, it renders it the exact same way, so that the URL loads anyway.

When encoding places, I decided to work only on recognised place names (cities mostly) rather than grocery stores other buildings. I also decided to focus only on cities and states, because there really is not much to say about “Canada” or “Portugal”. I also did not include any actual information about the location other than city, province, country, and a reference to the appropriate website.

I chose to present medicine as a separate category from other sales items because I had a personal interest in patent medicines (quack medicine). Here I used the PDF copy of the original newspaper to search for medical advertisements, because I was less familiar with this category and because this would ensure I did not mislabel any of the medicines (they all seemed to focus on the same type of ailments and key words).

I had some trouble with the medicine category precisely because the OCR jumbled up the text and separated labels from product descriptions. As I noted that day:

“Problem of finding these items by keyword (eg. Hair for Hall’s Hair Renewer) is that because the advertisements are often broken up, other references such as scalp, bald, re-growth, etc are not caught. This is the type of thing that the topic modeling would catch and group together to give a more accurate representation of how often Hall’s Hair Renewer was mentioned and in what contexts (especially if searching for the topic modeling that uses positive and negative sentiments).
This is interestingly the case for “Warners safe kidney and liver cure”. While “Kidney” is sprinkled throughout the document, it is done without an explicit connection to Warner, so it is unethical and unfair to assign it to his medicine. This is especially true because Doan also has a kidney cure advertised within this edition and without being connected to a specific name, it would be poor methodology to include it.”

The sale was challenging in that I did not know exactly what I wanted to do with the information. Originally, I thought about categorising it merchant and then listing what they sold. In the end, I decided it would make better sense for me later on to search by type of sales item (e.g. staples, meat, clothing). This would present me a list of meats and allow me to do further work with frequency of type of meat and vendor. If this work was taken to a larger scale, it would be possible for the historian to track these variables over time and compare by season. For example, we would be able to determine when merchant x started selling fur coats in 1897 and if this became earlier or later by 1904. This might be an indication of larger consumer patterns or market changes.

I had a discussion with my paradata document about what qualifies as “luxury” and “staple” items. Conversations like this are important to note because it preserves my integrity as a researcher aware of the potential problems within my methodology and it raises issues that other researchers may not have considered. Digital history is a community of practice and it is perfectly fine to work with others, to inform and be informed by the work, failures, and successes of others, and to base your initial work off of others’. In order to properly document this, be honest, and acknowledge the collaboration and expertise of others, I have done my best to note where I received help (largely class tutorials and my brother).

August 11: Beginning the Encoding Journey

Having sorted out the paragraph tags (<p> and </p>) and making sure everything was aligned properly, it was time to begin the actual encoding. For this, I used Professor Graham’s examples set forward in his tutorial.

My motivation had changed from the beginning of the project and so too had my questions. As I wrote that day “I will encode it and then people will be able to quickly search it for people’s names and such. If combined with many different but similar Equity files that have also been similarly coded, this can be visualised and used for much more useful research.”

My new research questions focused on the skimming I had done when breaking the text into paragraphs. I had noticed that the paper focused heavily on advertisements and political reporting. From this, I created these questions that framed my work going forward:

Research Questions:
How often are well-known people vs. regular people mentioned?**
How might this speak to the function or readership of the paper?

How frequently are locations mentioned? Does this speak to the relative “world” they lived in? E.g. closer contact with locals – what they were interested in?

Are some grocery items sold more than others? What and why?

 

These questions are vastly different than the ones I started out with, but they represent a different focus. Rather than taking a strictly big data approach and doing topic modelling and comparisons of political sentiment over time, I focused on the local situation. I wanted to know how people were represented in the paper, how and why place names were talked about, and what domestic life was talked about. I chose to look for these answers by identifying, counting, and mapping people’s names, place names, and sale item types.

 

Based on the prof’s example encoding to create a stylesheet. A stylesheet is necessary to tell the .xml file (a web-readable, encodable version of my text file) how to do the actual encoding. I began encoding for people, places, medicine, and sales.

Persons <persName key=”Last, First” from=”YYYY” to=”YYYY” role=”Occupation” ref=”http://www.website.com/webpage.html”> </persName>

Places <placeName key=”Sheffield, United Kingdom” ref=”http://tools.wmflabs.org/geohack/geohack.php?pagename=Sheffield&params=53_23_01_N_1_28_01_W_type:city_region:GB””> </placeName>

Medicine <medicineName key=”Name” from=”Business” claim=”medProperties” ref=”website”> </medicineName>

Sale <saleType key=”type” from=”Business” to=”amount”> </saleType>

 

I began with people since I realised it would take the longest. Here is an excerpt from my fail-log:

“I used the formula below to encode the first name (Cation Thornloe), which appears to be a poorly rendered Captain Thornloe).
<p> <CationThornloe <key=”Thornloe, Cation” from=”?” to=”?” role=”Bishop” ref=”none”> </persName>
I received a parse error saying it was improperly formed.”

Prof. Graham asked if I had a stylesheet for the xml file. I did not, nor did I know what it was. He explained and provided a sample. To me, it seems like a legend that formats the tags we put in the xml file.

After much trial and error and help from my brother and the kind professor, I made sure the .xsl was referred to in the .xml file. I also had to go through and make sure the tags were properly closed again. It worked eventually.

I also had some ethical and methodological concerns about the encoding I was doing, specifically about the information I was including in the tags which were meant to help future researchers. I discussed this at length in my notes and worked out a methodology that I felt comfortable with. A snippet of that conversation concludes this posting. I also made the decision to only encode the first 200 lines of the document. I did this because of the enormous amount of names in this issue (because of the municipal elections reporting) and because I was short on time and believed that 200 lines was sufficient to gain a preliminary understanding of the answers to my research questions and provide a proof of concept.

 

“The information I include in the brief description of the person also needs to be problematized because each person could have at least a full essay written about them, their politics, significance to early Ottawa history, influence at the Bank, trade influence, religious and social views, etc.
Each decision about what to include should be explicitly given, although I am not sure where it would be most appropriate to do so or how that type of paradata should be shared. For example, in looking up George Hay, I found one biography for him that seemed to give a rather complete history of his interactions with several different aspects of life. I chose not to look at any addition resources after briefly searching to make sure George Hay was the person I was looking for (context included connection to the Bank of Ottawa, that he was wealthy and influential). I also chose to exclude some of the information about his complex political ties more detailed religious leanings. I did include that he was the leader of the Ottawa Bible Society because it seemed to include his broader position within the religious circles about Ottawa. I also included it because as a person of faith myself, I found myself identifying with him on this level.”

 

August 9: Regex tag fixing

The OCR was messy and things that are not regular text “*” “&”  and unicode were causing the file to return error messages and not align the tags properly. I spent the evening trying to create regex expressions to remove these, before deciding to use the find and replace key and do it carefully one at a time. This took much longer, but ultimately worked better for me in the long-run. However, because I was making conscious decisions to modify the text, I wrote about it, noting the ethical and methodological ramifications it could have on my research and on future research.

For example, when I decided to remove some incorrectly formatted unicode characters that were disrupting the TEI coding, I was doing something fairly straight-forward on the surface. However, those unicode symbols stood for something and I was making a conscious decision to remove some information from the text that I will theoretically be analysing later on. Since it was in unicode, I have no idea what it actually said, and it could have been important or not. I tried to mitigate the damage I did my making clear notes on what I removed/replaced and why. The hope is that this paradata can be used by future students and historians not only to help them understand the steps I took and my thought-process, but to also gain an understanding of the manipulating and analysis I ran, so that they can decide how it impacts their planned work and adjust accordingly.

After some modifications to my regex expression and experimentation using RegExr, I created an expression to ignore the ” [“*”])(\”). It didn’t work and then I realised I could remove the markings in the gedit program using the find and replace function.

Using the find and replace function I looked up the unicode meaning of the following symbols and replaced “*”, “&” with nothing. ’ With “‘”, “ with “””, „ with “””, ‘ with “‘” , ” with “””, — with “-“, ™ with nothing because the suggested symbol was a TM and made no sense based on the context. • with nothing because the suggested symbol was a picture of a sword and made absolutely no sense. I made sure to examine each case before replacing it and I found that there was no real option.

I re-ran the command to automatically align the tags, but it presented me with a lot of errors. I will copy the errors into a separate document and clean them up manually. There are over 300 lines of errors similar to the ones below:

/dev/stdin:97: parser error : StartTag: invalid element name
established at Haley a Station.</p> <p>1 hey are <>lltif remarks that the contin
^

To my frustration, I found that the majority of errors were caused by sloppy, late night tagging and I was missing the backslash in many of the closing tags. Eventually, I was able to fix it.

I then tried to use OpenRefine to fix some of the spelling and OCR errors. I converted the file into a .csv, but it did not make sense because there were not really any lines like a spreadsheet would have. I decided against OpenRefine once again.

August 8: TEI continues

Like I mentioned in my previous post, TEI is deceptively simple because it takes a long time and requires a sound methodology and careful reading.

For example, I wrote a note to myself as part of my research notes process, reminding myself and future users that I was unclear about what the consequences of using and then modifying this poorly OCRd data was.

“I’m not sure the ethical restrictions on using this data yet because all the stories about land valuation, medical breakthroughs, huge store sales are all intertwined – does this impact topic modelling?
This is why people have to be critical and include close reading of their texts – can’t just begin with topic modelling”

 

After I added all the paragraph tags, I tried to use a plugin installed on my computer to properly align them. As mentioned earlier, I learned the importance of aligning tags when working with python.  I ran into a whole string of errors which indicated that the script couldn’t be run because of the extra “<“‘s spaced throughout the text because of poor OCR.
I realized I could use regex to find and replace these extra, useless aspects. With my brother’s help, I was able to contruct a regex that found the “less than” symbol and whenever it was connected to a character that wasn’t “p” or “/”, replace it with the html version. This would allow it to be easily read and solve the problem.

The regex looked like this: <([^p\/]), and was replaced by &gt;\1

 

August 7: RStudio

Having previously tried python and regex to clean my Equity files, I switched to RStudio. RStudio is an easier-to-use version of the programming language R. It allows you map stuff and manipulate data.

I used the RStudio in the DH Box set up for the class and based my actions on the tutorial.

Throughout the course of the tutorial and my work, I kept receiving these error messages:

documents <- read.csv(text = x, col.names=c(“Article_ID”, “Newspaper Title”, “Newspaper City”, “Newspaper Province”, “Newspaper Country”, “Year”, “Month”, “Day”, “Article Type”, “Text”, “Keywords”), colClasses=rep(“character”, 3), sep=”,”, quote=””)
Error in textConnection(text, encoding = “UTF-8”) : object ‘x’ not found
> topic.model$loadDocuments(mallet.instances)
Loading required package: rJava
Error in topic.model$loadDocuments :
$ operator not defined for this S4 class >

 

I realised that I wasn’t using a .csv file and that it wouldn’t work. Sometimes I really need to slow down and think things through better. Part of this is because I didn’t really understand what I was doing, and part of it was just poor habit.

Either way, my documents did not have clear headings. I tried to think of clear headings, but realised the data was too messy.
I went back to my original plan of cleaning the data, but this time decided to use plain regex.

At this point, I was feeling pretty overwhelmed and worried about my ability to do digital history. I found the TEI tutorial (which I wasn’t able to work through the first time) and noted the Prof’s note that doing this with an Equity file would be an appropriate project. TEI stands for Text Encoding Initiative and presents a standardized method of encoding and reading texts.

I used the blank template to create my TEI file, pasting in one single text file. Here the project changed from one that was doing a broad reading on six years of files to one that was focusing on January 14, 1897. I did not intentionally chose this file for any particular reason, but I happened to open the first one I saw and began experimenting with it. This process became part of my final project, so I stuck with the file.

I started by marking the beginning and end of what I considered to be “paragraphs” or similar ideas, with opening and closing tags. The open tag was <p>, where “p” stood for paragraph. The closing tag included the backslash </p>.

While doing a close skimming of the text to add the paragraph tags, I made some observations. I noticed the OCR has mashed different advertisements and ads together, most likely as a result of too many small letters being placed together and getting mistaken as one column of text.
Sometimes it is hard to figure out if the abbreviations are a result of the OCR, or if the original newspaper used them to save space.
Most of the entries are very small – have not yet come across the “large” feature articles we are currently used to.

For any potential digital historians out there, be forewarned. Marking up text is not easy and it requires plenty of time and good documentation of why each decision was made. For example, I thought this looked like a list of items on sale, so I separated it from what appears to be prizes for a contest, even though both sections contained indistinguishable characters.

<p>A large, fln«dy-enulpp*d. old established I lut Ion- NON! BETTER IN CANADA.
Bae-lnf##* Kdueatioo at LowofI PoudMe Graduates always eur#p -ful. Write Š oatnlo.ua W. J. Hl.LlOTT, Prlnolphi
Wrappers
Soap
Samuel Rogers, Pres
ïxjîTxwiwrï-six ilau».
DUNNS
BAKING
POWDER</p>
<p>me
as follows:
10 First Prizes, $100 Stsarns’ Bicycle,! 1,000 26 Sececd ” $25 Odd Watch Bleyolee and Watches given each month 1,625
Total given dur’gyear ’97, $19,500
HOW TO For rules and full particulars, iiv tt Š v eee |hfl Toronto Ôlobb
ŠŠŠŠŠ</p>

I addressed this ethical and methodological dilemma by establishing guidelines to follow, documenting everything, and staying consistent to minimise future concerns.