August 6: First Fail – Regex with Python

Today I was excited to begin cleaning up the Equity files I had previously downloaded. I naively thought it would be easy and that some simple regex (regular expression) cleanup would work. I was wrong. Instead, I entered a period of rapidly changing my project design and research questions, based on the limitations of the research medium and my own abilities with the text.

I began with a regex tutorial and started by modifying the python script provided. Regex is a way of searching for characters or phrases within the document and python is a programming language which can use regex to execute commands. I made the modifications to file names and expressions that I spotted needed changing, but kept receiving error messages. My error messages all looked similar to this:

clare@clare-fun1:~/School/Final_EquityProject$ python PthyonScript.py
File “PthyonScript.py”, line 19
nodash = re.sub(‘.(-+)’, ‘,’, line)
^
IndentationError: expected an indented block

 

Luckily, my wonderful, computer-oriented brother took a look and noticed that my python indentations were not all as precise as they should be. We worked on it and got it to correctly fill the .csv file. My goal was to turn the newspaper into a .csv file and clean it using OpenRefine.

Unfortunately, when I opened the file, I realised we had done something wrong and deleted 1700 lines of text. I was frustrated and it was late, so I set it aside and made sure I had a backup file, which I did. I decided that this python script method was not going to work, so I made the first significant change to my strategy and decided that Rstudio was what I needed.

August 5th, 2017: The beginning

Today is the start of my final research project for my digital history class. The assignment is to do something digital history-like with a subset of the archived Shawville Equity files that have been scanned and made accessible on the internet.

I am going to use the papers from 1897-1902 (6 years) to see if there is an increase of French/English conflict in Canada from before and after the Second Boer War began in South Africa. Questions include whether there was a difference of opinion between the French and English about supporting the British troops in the war effort. This will be especially interesting because the Shawville Equity is an English paper within Quebec and will have an interesting take on the matter. I picked 6 years because it seemed like a reasonable spread to gauge public opinion or official reports of the war. I recognize that six years’ worth of data is not a large sample, but I hope that the change will be great enough to notice over the six years. I also made this decision practically because I do not have unlimited internet and did not want to download a decade’s worth of materials.

I modified commands from Module 2 to use as a base for downloading the Equity files I was looking for. I also used Ian Milligan’s tutorial on wget.

My original command returned errors because I made spelling and structural mistakes. Compare it to the correct form I arrived at after playing around with formats.

wget http://collections.banq.qc.ca:8008
/jrn03/equity/src/1897/ -A .txt -r –no-parent -nd âw 2 –limit-rate=20k

 

wget -r –no-parent -w 2 –limit-rate=20k http://collections.banq.qc.ca:8008/jrn03/equity/src/1897/ -A .txt

 

Having successfully downloaded the files, I began thinking about what I had to do to clean them up enough so that they were usable for extracting the data I wanted. At this point, my questions were focused on public sentiment about conscription and fighting with the British in the Second Boer War in South Africa.

 

Equity: A Diving-in Note

At the beginning of this project, I did not know how I wanted to easily present my paradata to my audience. In further thinking about the assignment, I realised I envisioned my audience as English-speaking residents of Canada (most likely those in the Ottawa/Western-Quebec region) or those who were already interested or familiar with the Shawville Equity. The project could also be useful for (amateur) digital historians who also are working with an OCRd newspaper and could benefit from an explanation of my processes. I decided to use a blog format for easy accessibility and to visually illustrate the changes and progression of the project over time.  I also chose the presentation method because it will allow me to include graphics, links, and coding aspects.

This posting is written near the completion of the project. The rest will be written as if they are in the present, but they represent the chronological entries from my original rough project notes. This format should help the reader to engage with the subject matter and encourage them to understand the progression of my ideas over the last two weeks. It also divides the paradata into smaller, easier-to-handle chunks, and, because of the way that I generally tackled different problems each day, sets the project up as a serial, in which the audience is invited to follow through my journey and explore the different processes as I experienced them. It will include research questions, ideas, successes, and “fails”. My hope is that by following along, you, the reader, will be engaged, entertained, educated, and finish each posting with a sense of my process, ethical and methodological considerations, and an interest in continuing the project’s research aims.

This project is based off the January 14, 1897 edition of the Shawville Equity. I initially downloaded it as a poorly OCRd text file.

Initial questions I had focused on using several years’ worth of the paper to explore the changing French/English sentiments regarding the Second Boer’s War in South Africa. By the end of the project, I restricted my analysis to one file and asked questions such as: How often are well-known people  vs. regular people mentioned? Does this speak to the readership or function of the paper? Does the frequency of place names mentioned in the report mean anything in terms of how close or far away the locations were or was there something else contributing to the frequency of named locations throughout the text.