August 18: A series of errors and a helpful friend

This posting is really about things I tried to do most recently with the project, how they didn’t work, and how I received help to figure them out. I wanted to find a program that would allow me to not only count the number of words in the text like I did previously, but sort them by the TEI tags so that I could count word frequencies from within the collection of tags.

For example, I wanted to be able to address the question about what items were on sale and how they might have changed over time. All the sales items were differently named though, so it would not have been possible to search for them easily by keyword. However, each Sale tag did have the <sale> tag itself in common, and if I could search that, I could categorise from within it and determine how much of the sale focused on meat, how much on animal furs, etc.

I reached out to the professor via our Slack channel and received advice directing me towards a program called Catmandu. I began following the tutorial after I had installed it.  I showed it to my brother, hoping he was familiar with it and could give me a head start.

We decided after some exploration that we would prefer to not only count the number of tags each word totaled to, but to sort the place names and determine their relationship to each other. We abandoned Catmandu and began to work with regex and python.

The plan was the create a regex searching for the <placeName> tag, including up to the end of the “ref” line and then use a python expression to tabulate all instances of the same place name.

I wrote a regex that I thought would do this
(“<placeName” + “>)

The rest of the blog basically reads like a “fail-log”, an account of what I wanted to do, how I tried to do it, what help I had, and how it didn’t work.

I was right about the beginning section of the regex and the later need for the “+” to grab additional information. I was not precise enough and while I remembered some of the conventions about writing the expression, I forgot others, such as the “\” and the strategic use of brackets.
My brother modified my expression so that it continued until the “ref” at the end. <placeName key=\”(.+)\” ref “> – inserted to close the tag and prevent the rest of the text from changing colour” This correctly identified the place names, but the “ref” it was picking up was not always the one it started with, especially when the place names occurred on the same line

We then turned to the python part, looking for ways to isolate each line independently and then add re-occurring place names into a counter. This required the creation of the python file, using “for” to identify the regex phrase to be searched for, “if” means that if it finds more than one for the phrase, it adds it up and creates frequencies. If the place name is mentioned for the first time, it is simply noted. This is referred to as the dictionary.
He kindly created and sent me this python script:

It works in theory, but the “key=” causes problems with the script and it will not work. However, this process demonstrated the initial steps of this analysis. If I had more time and was following this project through, the goal would be to learn more about why the **placenames[m.group(1)] = placenames[m.group(1)] + 1 section isn’t working properly.

Theoretically, if it did, it would search for all place names that were in tags, and tabulate how many times each was mentioned. I would then take the information and place it into context, asking what, if any, the relationship was between a locations’s location and Shawville. Other data analysis could also be conducted, similar to what I have discussed previously. The same type of process would also be done for well-known vs. regular people, and items sold at the groceries what, if anything, it would have to do with seasonal products.

Leave a Reply

Your email address will not be published. Required fields are marked *