August 9: Regex tag fixing

The OCR was messy and things that are not regular text “*” “&”  and unicode were causing the file to return error messages and not align the tags properly. I spent the evening trying to create regex expressions to remove these, before deciding to use the find and replace key and do it carefully one at a time. This took much longer, but ultimately worked better for me in the long-run. However, because I was making conscious decisions to modify the text, I wrote about it, noting the ethical and methodological ramifications it could have on my research and on future research.

For example, when I decided to remove some incorrectly formatted unicode characters that were disrupting the TEI coding, I was doing something fairly straight-forward on the surface. However, those unicode symbols stood for something and I was making a conscious decision to remove some information from the text that I will theoretically be analysing later on. Since it was in unicode, I have no idea what it actually said, and it could have been important or not. I tried to mitigate the damage I did my making clear notes on what I removed/replaced and why. The hope is that this paradata can be used by future students and historians not only to help them understand the steps I took and my thought-process, but to also gain an understanding of the manipulating and analysis I ran, so that they can decide how it impacts their planned work and adjust accordingly.

After some modifications to my regex expression and experimentation using RegExr, I created an expression to ignore the ” [“*”])(\”). It didn’t work and then I realised I could remove the markings in the gedit program using the find and replace function.

Using the find and replace function I looked up the unicode meaning of the following symbols and replaced “*”, “&” with nothing. ’ With “‘”, “ with “””, „ with “””, ‘ with “‘” , ” with “””, — with “-“, ™ with nothing because the suggested symbol was a TM and made no sense based on the context. • with nothing because the suggested symbol was a picture of a sword and made absolutely no sense. I made sure to examine each case before replacing it and I found that there was no real option.

I re-ran the command to automatically align the tags, but it presented me with a lot of errors. I will copy the errors into a separate document and clean them up manually. There are over 300 lines of errors similar to the ones below:

/dev/stdin:97: parser error : StartTag: invalid element name
established at Haley a Station.</p> <p>1 hey are <>lltif remarks that the contin
^

To my frustration, I found that the majority of errors were caused by sloppy, late night tagging and I was missing the backslash in many of the closing tags. Eventually, I was able to fix it.

I then tried to use OpenRefine to fix some of the spelling and OCR errors. I converted the file into a .csv, but it did not make sense because there were not really any lines like a spreadsheet would have. I decided against OpenRefine once again.

Leave a Reply

Your email address will not be published. Required fields are marked *