Having previously tried python and regex to clean my Equity files, I switched to RStudio. RStudio is an easier-to-use version of the programming language R. It allows you map stuff and manipulate data.
I used the RStudio in the DH Box set up for the class and based my actions on the tutorial.
Throughout the course of the tutorial and my work, I kept receiving these error messages:
documents <- read.csv(text = x, col.names=c(“Article_ID”, “Newspaper Title”, “Newspaper City”, “Newspaper Province”, “Newspaper Country”, “Year”, “Month”, “Day”, “Article Type”, “Text”, “Keywords”), colClasses=rep(“character”, 3), sep=”,”, quote=””)
Error in textConnection(text, encoding = “UTF-8”) : object ‘x’ not found
Loading required package: rJava
Error in topic.model$loadDocuments :
$ operator not defined for this S4 class >
I realised that I wasn’t using a .csv file and that it wouldn’t work. Sometimes I really need to slow down and think things through better. Part of this is because I didn’t really understand what I was doing, and part of it was just poor habit.
Either way, my documents did not have clear headings. I tried to think of clear headings, but realised the data was too messy.
I went back to my original plan of cleaning the data, but this time decided to use plain regex.
At this point, I was feeling pretty overwhelmed and worried about my ability to do digital history. I found the TEI tutorial (which I wasn’t able to work through the first time) and noted the Prof’s note that doing this with an Equity file would be an appropriate project. TEI stands for Text Encoding Initiative and presents a standardized method of encoding and reading texts.
I used the blank template to create my TEI file, pasting in one single text file. Here the project changed from one that was doing a broad reading on six years of files to one that was focusing on January 14, 1897. I did not intentionally chose this file for any particular reason, but I happened to open the first one I saw and began experimenting with it. This process became part of my final project, so I stuck with the file.
I started by marking the beginning and end of what I considered to be “paragraphs” or similar ideas, with opening and closing tags. The open tag was <p>, where “p” stood for paragraph. The closing tag included the backslash </p>.
While doing a close skimming of the text to add the paragraph tags, I made some observations. I noticed the OCR has mashed different advertisements and ads together, most likely as a result of too many small letters being placed together and getting mistaken as one column of text.
Sometimes it is hard to figure out if the abbreviations are a result of the OCR, or if the original newspaper used them to save space.
Most of the entries are very small – have not yet come across the “large” feature articles we are currently used to.
For any potential digital historians out there, be forewarned. Marking up text is not easy and it requires plenty of time and good documentation of why each decision was made. For example, I thought this looked like a list of items on sale, so I separated it from what appears to be prizes for a contest, even though both sections contained indistinguishable characters.
<p>A large, fln«dy-enulpp*d. old established I lut Ion- NON! BETTER IN CANADA.
Bae-lnf##* Kdueatioo at LowofI PoudMe Graduates always eur#p -ful. Write Š oatnlo.ua W. J. Hl.LlOTT, Prlnolphi
Samuel Rogers, Pres
10 First Prizes, $100 Stsarns’ Bicycle,! 1,000 26 Sececd ” $25 Odd Watch Bleyolee and Watches given each month 1,625
Total given dur’gyear ’97, $19,500
HOW TO For rules and full particulars, iiv tt Š v eee |hfl Toronto Ôlobb
I addressed this ethical and methodological dilemma by establishing guidelines to follow, documenting everything, and staying consistent to minimise future concerns.