Like I mentioned in my previous post, TEI is deceptively simple because it takes a long time and requires a sound methodology and careful reading.
For example, I wrote a note to myself as part of my research notes process, reminding myself and future users that I was unclear about what the consequences of using and then modifying this poorly OCRd data was.
“I’m not sure the ethical restrictions on using this data yet because all the stories about land valuation, medical breakthroughs, huge store sales are all intertwined – does this impact topic modelling?
This is why people have to be critical and include close reading of their texts – can’t just begin with topic modelling”
After I added all the paragraph tags, I tried to use a plugin installed on my computer to properly align them. As mentioned earlier, I learned the importance of aligning tags when working with python. I ran into a whole string of errors which indicated that the script couldn’t be run because of the extra “<“‘s spaced throughout the text because of poor OCR.
I realized I could use regex to find and replace these extra, useless aspects. With my brother’s help, I was able to contruct a regex that found the “less than” symbol and whenever it was connected to a character that wasn’t “p” or “/”, replace it with the html version. This would allow it to be easily read and solve the problem.
The regex looked like this: <([^p\/]), and was replaced by >\1