Final Project Work

This week I have been working on my final project. It has been an interesting journey which has changed much since I envisioned it last Sunday.  I originally planned to use several years worth of data (1897-1902) to do some topic modelling and see if there was an increase in French/English conflict as I would have suspected based on the divide created as a result of the conscription crisis for the Boer War. I wanted to see especially what an English Paper within Quebec had to say on the subject.

I first had trouble downloading the files because the command I used needed to be modified more than I had it.  I had further trouble with the newly created DH Box account, so I just downloaded the .txt files onto my computer using the modified code: wget -r –no-parent -w 2 –limit-rate=20k http://collections.banq.qc.ca:8008/jrn03/equity/src/1897/ -A .txt

I was interested in using python to cleanup the messy OCR, so I pulled up that tutorial and started working off the python script, modifying it as I thought appropriate. I kept receiving error messages that basically all looked like this: clare@clare-fun1:~/School/Final_EquityProject$ python PthyonScript.py
File “PthyonScript.py”, line 19
nodash = re.sub(‘.(-+)’, ‘,’, line)
^ IndentationError: expected an indented block

 

My brother explained that the indents weren’t properly aligned. I tried to re-align everything. I ran it again, but it deleted 1700 lines of text, leaving me with 14. I learned about how precise python has to be and I got a rough understanding on how it should work.

I then tried to fix the readability of the Equity papers by using RStudio to remove poorly OCRd symbols. I ran into problems because even though I was using a csv format, my file didn’t really make sense because it didn’t have useful headings of easy to separate headings.

I decided on marking up one Equity paper using TEI. I am modifying the class tutorial as I go.  I started by making the <p> and </p> tags where I thought they should go. This was complicated by the fact that the OCR hadn’t respected the newspaper column lines and the lines were intermixed and often did not make very much sense. I had to make methodological and ethical decisions about where to put the breaks. I discuss this more in my ongoing final project notes and fail-log.

Next I had to use a program to properly align the <p> </p> tags and first encountered many errors of incorrectly closed tags and such that I had to fix. Then my program wouldn’t do the aligning, so my brother quickly ran it through his text editor.

I tried to use Regex to identify and remove useless symbols that were causing problems with the computer’s recognition. I couldn’t get the expression to just pickup what I wanted in some instances and not others, so I used the find function and replaced them that way, searching the context to make sure the switch wasn’t a problem. I made a note of the substitutions in my fail-log and discussed in more depth the reasoning and ethical debates I wanted future users to be aware of.

Closer to the end of the week, I began the actual encoding of people and the other categories I identified through my familiarity of the text through skimming. I began with the people and as of today, have gotten through the first 200 lines. I didn’t realise just how long it would take, so I may just do a really good job on the first 400 lines and use it as a proof-of-concept.

I used the formula below to encode the first name (Cation Thornloe), which appears to be a poorly rendered Captain Thornloe).
<p> <CationThornloe <key=”Thornloe, Cation” from=”?” to=”?” role=”Bishop” ref=”none”> </persName>

I received a parse error saying it was improperly formed.
I examined the format and noted that there was an extra “>” before the end tag </persName>. Even though it appeared to follow the format laid out in the template, I modified it to close out the tag:
<p> <CationThornloe/> <key=”Thornloe, Cation” from=”?” to=”?” role=”Bishop” ref=”none”> </persName>
It then returned a parse error again, saying it was poorly formed.

In short, I was confused and didn’t make sure the stylesheet was referenced in the .xlm file.

I encoded every name, looking up what I could about them online. Some people were easier to find, like the lumber barons or bank board members. Some, like Miss Jennie who left for North Bay to take up dressmaking, were less easy to find. In my fail-log, I discuss that the project would have been more complete if I had had time to properly confirm each person was correct and that in cases where lots of information was given, I had to chose what to include, based on the context and influenced (as unavoidable) by me.

I ended the week by trying to preview all the work I did and receiving errors from sources with “unusual characters” (=, &). Once I resolved that problem, I had this problem that I have not yet resolved:

**XML Parsing Error:** mismatched tag. Expected: </p>.
Location: file:///home/clare/School/Final_EquityProject/TEI_OCR_Tagging.xml
Line Number 829, Column 15: </body>

Questions that drive my markup at this point include:

  1. How often are well-known people vs. regular people mentioned?
  2. How might this speak to the function or readership of the paper?

Leave a Reply

Your email address will not be published. Required fields are marked *