Today I was excited to begin cleaning up the Equity files I had previously downloaded. I naively thought it would be easy and that some simple regex (regular expression) cleanup would work. I was wrong. Instead, I entered a period of rapidly changing my project design and research questions, based on the limitations of the research medium and my own abilities with the text.
I began with a regex tutorial and started by modifying the python script provided. Regex is a way of searching for characters or phrases within the document and python is a programming language which can use regex to execute commands. I made the modifications to file names and expressions that I spotted needed changing, but kept receiving error messages. My error messages all looked similar to this:
clare@clare-fun1:~/School/Final_EquityProject$ python PthyonScript.py
File “PthyonScript.py”, line 19
nodash = re.sub(‘.(-+)’, ‘,’, line)
IndentationError: expected an indented block
Luckily, my wonderful, computer-oriented brother took a look and noticed that my python indentations were not all as precise as they should be. We worked on it and got it to correctly fill the .csv file. My goal was to turn the newspaper into a .csv file and clean it using OpenRefine.
Unfortunately, when I opened the file, I realised we had done something wrong and deleted 1700 lines of text. I was frustrated and it was late, so I set it aside and made sure I had a backup file, which I did. I decided that this python script method was not going to work, so I made the first significant change to my strategy and decided that Rstudio was what I needed.