August 6: First Fail – Regex with Python

Today I was excited to begin cleaning up the Equity files I had previously downloaded. I naively thought it would be easy and that some simple regex (regular expression) cleanup would work. I was wrong. Instead, I entered a period of rapidly changing my project design and research questions, based on the limitations of the research medium and my own abilities with the text.

I began with a regex tutorial and started by modifying the python script provided. Regex is a way of searching for characters or phrases within the document and python is a programming language which can use regex to execute commands. I made the modifications to file names and expressions that I spotted needed changing, but kept receiving error messages. My error messages all looked similar to this:

clare@clare-fun1:~/School/Final_EquityProject$ python
File “”, line 19
nodash = re.sub(‘.(-+)’, ‘,’, line)
IndentationError: expected an indented block


Luckily, my wonderful, computer-oriented brother took a look and noticed that my python indentations were not all as precise as they should be. We worked on it and got it to correctly fill the .csv file. My goal was to turn the newspaper into a .csv file and clean it using OpenRefine.

Unfortunately, when I opened the file, I realised we had done something wrong and deleted 1700 lines of text. I was frustrated and it was late, so I set it aside and made sure I had a backup file, which I did. I decided that this python script method was not going to work, so I made the first significant change to my strategy and decided that Rstudio was what I needed.

August 5th, 2017: The beginning

Today is the start of my final research project for my digital history class. The assignment is to do something digital history-like with a subset of the archived Shawville Equity files that have been scanned and made accessible on the internet.

I am going to use the papers from 1897-1902 (6 years) to see if there is an increase of French/English conflict in Canada from before and after the Second Boer War began in South Africa. Questions include whether there was a difference of opinion between the French and English about supporting the British troops in the war effort. This will be especially interesting because the Shawville Equity is an English paper within Quebec and will have an interesting take on the matter. I picked 6 years because it seemed like a reasonable spread to gauge public opinion or official reports of the war. I recognize that six years’ worth of data is not a large sample, but I hope that the change will be great enough to notice over the six years. I also made this decision practically because I do not have unlimited internet and did not want to download a decade’s worth of materials.

I modified commands from Module 2 to use as a base for downloading the Equity files I was looking for. I also used Ian Milligan’s tutorial on wget.

My original command returned errors because I made spelling and structural mistakes. Compare it to the correct form I arrived at after playing around with formats.

/jrn03/equity/src/1897/ -A .txt -r –no-parent -nd âw 2 –limit-rate=20k


wget -r –no-parent -w 2 –limit-rate=20k -A .txt


Having successfully downloaded the files, I began thinking about what I had to do to clean them up enough so that they were usable for extracting the data I wanted. At this point, my questions were focused on public sentiment about conscription and fighting with the British in the Second Boer War in South Africa.


Equity: A Diving-in Note

At the beginning of this project, I did not know how I wanted to easily present my paradata to my audience. In further thinking about the assignment, I realised I envisioned my audience as English-speaking residents of Canada (most likely those in the Ottawa/Western-Quebec region) or those who were already interested or familiar with the Shawville Equity. The project could also be useful for (amateur) digital historians who also are working with an OCRd newspaper and could benefit from an explanation of my processes. I decided to use a blog format for easy accessibility and to visually illustrate the changes and progression of the project over time.  I also chose the presentation method because it will allow me to include graphics, links, and coding aspects.

This posting is written near the completion of the project. The rest will be written as if they are in the present, but they represent the chronological entries from my original rough project notes. This format should help the reader to engage with the subject matter and encourage them to understand the progression of my ideas over the last two weeks. It also divides the paradata into smaller, easier-to-handle chunks, and, because of the way that I generally tackled different problems each day, sets the project up as a serial, in which the audience is invited to follow through my journey and explore the different processes as I experienced them. It will include research questions, ideas, successes, and “fails”. My hope is that by following along, you, the reader, will be engaged, entertained, educated, and finish each posting with a sense of my process, ethical and methodological considerations, and an interest in continuing the project’s research aims.

This project is based off the January 14, 1897 edition of the Shawville Equity. I initially downloaded it as a poorly OCRd text file.

Initial questions I had focused on using several years’ worth of the paper to explore the changing French/English sentiments regarding the Second Boer’s War in South Africa. By the end of the project, I restricted my analysis to one file and asked questions such as: How often are well-known people  vs. regular people mentioned? Does this speak to the readership or function of the paper? Does the frequency of place names mentioned in the report mean anything in terms of how close or far away the locations were or was there something else contributing to the frequency of named locations throughout the text.

Final Project Work

This week I have been working on my final project. It has been an interesting journey which has changed much since I envisioned it last Sunday.  I originally planned to use several years worth of data (1897-1902) to do some topic modelling and see if there was an increase in French/English conflict as I would have suspected based on the divide created as a result of the conscription crisis for the Boer War. I wanted to see especially what an English Paper within Quebec had to say on the subject.

I first had trouble downloading the files because the command I used needed to be modified more than I had it.  I had further trouble with the newly created DH Box account, so I just downloaded the .txt files onto my computer using the modified code: wget -r –no-parent -w 2 –limit-rate=20k -A .txt

I was interested in using python to cleanup the messy OCR, so I pulled up that tutorial and started working off the python script, modifying it as I thought appropriate. I kept receiving error messages that basically all looked like this: clare@clare-fun1:~/School/Final_EquityProject$ python
File “”, line 19
nodash = re.sub(‘.(-+)’, ‘,’, line)
^ IndentationError: expected an indented block


My brother explained that the indents weren’t properly aligned. I tried to re-align everything. I ran it again, but it deleted 1700 lines of text, leaving me with 14. I learned about how precise python has to be and I got a rough understanding on how it should work.

I then tried to fix the readability of the Equity papers by using RStudio to remove poorly OCRd symbols. I ran into problems because even though I was using a csv format, my file didn’t really make sense because it didn’t have useful headings of easy to separate headings.

I decided on marking up one Equity paper using TEI. I am modifying the class tutorial as I go.  I started by making the <p> and </p> tags where I thought they should go. This was complicated by the fact that the OCR hadn’t respected the newspaper column lines and the lines were intermixed and often did not make very much sense. I had to make methodological and ethical decisions about where to put the breaks. I discuss this more in my ongoing final project notes and fail-log.

Next I had to use a program to properly align the <p> </p> tags and first encountered many errors of incorrectly closed tags and such that I had to fix. Then my program wouldn’t do the aligning, so my brother quickly ran it through his text editor.

I tried to use Regex to identify and remove useless symbols that were causing problems with the computer’s recognition. I couldn’t get the expression to just pickup what I wanted in some instances and not others, so I used the find function and replaced them that way, searching the context to make sure the switch wasn’t a problem. I made a note of the substitutions in my fail-log and discussed in more depth the reasoning and ethical debates I wanted future users to be aware of.

Closer to the end of the week, I began the actual encoding of people and the other categories I identified through my familiarity of the text through skimming. I began with the people and as of today, have gotten through the first 200 lines. I didn’t realise just how long it would take, so I may just do a really good job on the first 400 lines and use it as a proof-of-concept.

I used the formula below to encode the first name (Cation Thornloe), which appears to be a poorly rendered Captain Thornloe).
<p> <CationThornloe <key=”Thornloe, Cation” from=”?” to=”?” role=”Bishop” ref=”none”> </persName>

I received a parse error saying it was improperly formed.
I examined the format and noted that there was an extra “>” before the end tag </persName>. Even though it appeared to follow the format laid out in the template, I modified it to close out the tag:
<p> <CationThornloe/> <key=”Thornloe, Cation” from=”?” to=”?” role=”Bishop” ref=”none”> </persName>
It then returned a parse error again, saying it was poorly formed.

In short, I was confused and didn’t make sure the stylesheet was referenced in the .xlm file.

I encoded every name, looking up what I could about them online. Some people were easier to find, like the lumber barons or bank board members. Some, like Miss Jennie who left for North Bay to take up dressmaking, were less easy to find. In my fail-log, I discuss that the project would have been more complete if I had had time to properly confirm each person was correct and that in cases where lots of information was given, I had to chose what to include, based on the context and influenced (as unavoidable) by me.

I ended the week by trying to preview all the work I did and receiving errors from sources with “unusual characters” (=, &). Once I resolved that problem, I had this problem that I have not yet resolved:

**XML Parsing Error:** mismatched tag. Expected: </p>.
Location: file:///home/clare/School/Final_EquityProject/TEI_OCR_Tagging.xml
Line Number 829, Column 15: </body>

Questions that drive my markup at this point include:

  1. How often are well-known people vs. regular people mentioned?
  2. How might this speak to the function or readership of the paper?

Discovering how to play with data

What I was trying to do
I chose to work through Gephi and RStudio for the two exercises this week.
As I understand it, Gephi allows you to visually see and manipulate data and RStudio is more mathematical
and command-line based, allowing for data to be pulled by topics (which could be then fed into Gephi after they are cleaned up)

What I did
I had trouble running the “force-directed layout” until I read through the rest of the instructions and hen it made sense.
*still having problems with my tendacy to not read all the text before trying to do something*
**while working on the data, it looked messy and I was unsure what it
was really showing me because it didn’t include the names in the
“working part” – so it took me extra time to make sure that I was doing it right
**The instructions are pretty easy to follow using this one layout – Force Atlas 2**
I did have some initial trouble understanding what Force Atlas was supposed to do and how it worked – again, simple issues
that I sorted out easily by referring to the Slack channel and running ideas past my brother

I have gotten to this point
“, topic.words[7,])”
Although I don’t really understand the technicalities of how I got here and what is going wrong.
RStudio returns this error:
Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :

I checked Slack for any ideas and posted a note asking for help. Prof. responded with what I feared: the file was empty.
Consulted my brother who pointed out that I accidently skipped a step because I misunderstood the instructions to read “if you already did the above, ignore this step” and that
by doing this, I deleted the contents of the file. We went back and redid that step, and were able to continue with the tutorial.

The way it manipulated the data reminded me a bit of gephi. Not sure what I will do with it, but looking forward to trying it out tomorrow.

I also pulled the github repo for this week because I created (with help) a new one since I’m now back on DH Box. Then I synced it again and pushed it back.

Things that were hard
**One thing I did not like about Gephi is that it appeared rather utilitarian and I had trouble intuitively navigating it – not user friendly**
-this was especially true for the *difficulty* I had in figuring out
how to follow the instructions on filtering out the unconnected points – I kept deleting the wrong information while following the instructions until I changed the “O” to “1” – I found that the program-specific language wasn’t really clear or easy for beginners to use

Probably my biggest challenge is still not completely understanding what I was doing and why it was either
working or not working (this often means that even for simple mistakes I need help pointing out
what I am working on so I can see why it is right or wrong)
I found that for this RStudio especially, I did not understand the instructions. However, I have also found that as I begin to work
on my final project using the Equity files, I am more comfortable changing parts of the tutorials and
modifying them to use the files/commands/ etc I want, so I am hoping that I will have success with replicating
this tutorial.
I do know that my brain doesn’t specialize in this type of work and learning and that I probably will
bring forward some ideas and methods for future use, rather than radically switching the way I do research or use the computer.
There are things I am starting to do without the tutorials (like the command line stuff and markup), but
my grasp on the rest is still mostly tied to the tutorials.

Thoughts on where to go next
I am hoping that after doing some
OCR cleanup on my Equity files, I can use RStudio to pull out important themes and words to explore further
and I can use Gephi to visually graph it. My ideas and aims will probably change over the coming week as I work on it.

Final Project
I began working on the final project this week. I used wikipedia to decide what timeframe I wanted to focus on, based on
topics that might have been in the news during that time. I chose 1897-1902 (before and during the Boer War and it’s
potential impact on English/French Canadian relations due to conscription). I downladed the files, using a mixture of the
wget command from the workbook and the command suggested in the tutorial the workbook linked to.
I was proud of the way I did not give up, but added and switched things around until I had a command that did what I wanted it to.
wget -r –no-parent -w 2 –limit-rate=20k -A .txt


At this point, I am most interested in all the different ways that research can be manipulated and displayed. I only touched on two ways, but may use different methods in my final project. I have been thinking more seriously and deliberately about what I do, how I do it, and why. This week, I was especially proud of myself for not only (mostly) correctly following the tutorials without significant assistance, but also beginning to feel comfortable modifying the command lines to create different tables that interested me.

I also started working on my final project and after some trial and error in working with the command line for wget and using critical thinking skills to modify previous examples, I got what I wanted downloaded and stored where I wanted it. Looking forward to the next steps.

Who benefits from this work?

Select one of the articles behind the links above (and/or in the exercises) to annotate, asking yourself, ‘who benefits from this? who is hurt from this?’. Make an entry in your blog on this theme.

I found Michelle Moravec’s blog posting “Corpus Linguistics for Historians” most interesting. I appreciated how she explicitly set out the main reasons why she liked using Corpus Linguistics and then briefly explained what tools she used and why she chose them. She then gave visual examples of each of the tools and explained her thinking/use of them.

I found that this was a good example of how historians can practice principles of digital history such as being “open source” while also producing an education and easy-to-follow document. I think that anyone interested in using any of the tools she identifies would benefit from her post. Her use of “in-process” visuals and her explanations of what is happening and why is useful in helping the reader to follow the text, while also critically engaging with the tools and methods being presented. From this perspective, I do not see how people could be harmed.

However, the situation does get a bit more ambiguous when we explore the examples she provides and the way she uses the tools. By viewing files based on word frequency, she is assuming that frequently used words mean they are more important. This potentially misses the few instances where it is used “more importantly”. Depending on the likelihood that this happened (which is impossible to know really), it is possible that a vital piece of her argument has been left out. This could potentially impact how people use her work and continue forward. However, the fact that she is aware of how and why she focused on particular words or clusters (her methodology), and provides it in narrative form to the reader, is beneficial.

Again, her use of the cluster feature is troubling for the same reason. While it seems logical for people to assume frequency equates to importance, this is a particular assumption held by North Americans. We cannot be sure others hold it, nor that this assumption was meant to be held by those who created the program.

The fact that the author presents her paper as an “exploration” can be helpful or harmful, depending on how familiar the reader is with history, particularly digital history, and how dedicated they are to good practice and methodology (as explored in the previous module). For example, when the author says that she notes that the densest file is once again Stanton’s and that she “[continues] exploring”, she presents both the strengths and weaknesses of digital technology such as this.

One strength of this type of analysis is that it allows for rapid and easy manipulation of data and lets the user quickly see if an idea or theory they have had is feasible, based on how they manipulate data. For example, a csv file can produce many different types of charts and use multiple variables. A user can find an interesting trend and quickly follow it through, checking against the data or cross-referencing with another variable. The problem is part of the same advantage because it is the user who manipulates and changes the way the data is being viewed or compared against. Someone who is not familiar with the ethics and methodology of good digital history can easily work off their biases and produce potentially misleading data. This could harm their reputation and the scholarship at large.

Ultimately, as the author writes, these tools are useful because they allow for broader comparison of “patterns and shifts over time and space“.  These practices can also be harmful if not done in conjunction with typical academic methods, a fact that the author also points out when she argues that these are good tools before beginning close readings.

In conclusion, I think the same tools can be both beneficial and harmful. It is not the tool that has a negative or positive value, but the people and methods that are using them. That is why this class is about learning how to responsibly and helpfully use the tools to contribute to history at large. The fact that it makes it easier to use vast quantities of data provides the same level of harm to the unsuspecting user that a casual scan of library shelves does or the use of only one archive source. For those inclined to do poor history, these tools will allow them to continue and for those concerned about methodology, these tools assist in broadening the data they can explore.

Making Headway

What I was trying to do
I was trying to download and “clean” a group of diplomatic correspondence from Texas that had gone
through OCR. I used Regex to clean it into a usable file and Google Refine to make it easy to
search and manipulate for later use.

What I did
I finally had success with the command line and am able to navigate the simple functions in this
with ease! I made note of this success here:
I’m finally getting to understand this!
**So Cool**
*Look what I can do
#Brief fail-log before actually creating the final one
I’m creating this markup file in nano in my terminal. This in itself is a positive step for me
because it means I found it, opened it, and will be able to “add” “commit” and “push” it to a safe place (github).
I found this module easier than the previous ones, which means I am learning something finally!
Now I’m actually looking forward to mucking around with the final project.

Used built-in Terminal – can see physical files on my computer – way easier
had some trouble at the beginning understanding the task and how to get started
*note: look through command line and find other notes to consolidate into one “fail-log”*
I made much more use of the Slack channel other people posted *review and credit people*
and some initial use of my brother to walk me through the regex concepts *still a bit fuzzy,
but the notes and place to play around are helping*
Other than simple typing errors and not reading closely enough (yes, we were warned),
things went pretty well.

**Ran it on whole file, not just sender on top**
**deleted letter contents – kept stuff at top, then completed the exercise, highlight**
*problem with reading and understanding – new language issues*
*Tried Pallidum **checkname** and it made a picture of connected words,
but it did not make sense to me and I couldn’t figure out how to use it properly – try again*

Had assistance from my brother. Jeffblackadar helped me identify why my OpenRefine file did not look right.
Dr. Graham further assisted by pointing out that the data probably contained the letter contents
it and suggested ways to fix it (somehow I missed or unsuccessfully completed that step).

My main “fails” this week were thinking that I had installed the Java and OpenRefine files (did all the
command stuff I thought), but only extracting the files and not actually running the correct install.
Ex. “git apt install Java” instead of “sudo …. (correct name)”.
As mentioned above, I also incorrectly deleted the letter contents from the file I was working on and
completed the steps without getting rid of the contents. This led to confusion and some help via Slack
from Jeffblackadar and the Prof. My brother and I went back to an old version (luckily had many backups)
and properly got rid of the contents and re-ran the tutorial.

Things that were hard
I still have trouble following the instructions when so many new terms and programs are installed.
I tend to try to follow them without knowing what I am doing, thus being unable to identify the mistakes
when they occur and causing more trouble down the line. Thinking before doing still needs to be applied better.
I did not figure out how to use the pallidio software.

Thoughts on where to go to next
I did do down and work with the command strings at the bottom of the tutorial after
thinking about it and trying to piece it together myself.
**this is the next step I will have to work on** – I expect using the testing website to
“play around” with regex a bit more will be useful.”
I also need to continue carefully reading the instructions before doing things – it will make the frustrations less later.


NB: This is a copy of the fail-log written straight to github. I realised I was duplicating my work and am trying to only do it once. Please let me know if this is unacceptable. Thanks

Data Cleaning

I have not really had any experience doing data cleaning in other classes. This is mainly because I’ve worked with comparatively small amounts of written and published data. Anything remotely like what we are doing in our digital history class was not even on my radar until very recently. I picked a topic, found what felt like a reasonable amount of documents to support it (say 15 or so) and wrote the paper. The most data cleaning in my history classes was probably more to do with my less than stellar note-taking skills and the need to discover exactly what I wrote.

That being said, my current coop placement is with the Federal government and it involves large-scale data analysis of information inputted into a statistical spreadsheet program. I have been involved in cleaning up that data.  This type of data cleanup is similar to what we have been doing this week. It involves making sure the agreed-upon variables are the same (e.g. 9999 encoded as a missing value rather than 999), using search and replace functions, and manually inspecting the data before beginning to turn it into meaningful charts.  We discuss this work in detail while we’re doing it. We compare methodology, save and share our syntax, double-check each other’s work for inaccuracies, and work in a conscious and systematic way. This is necessary because if I do something differently from my colleague, we will not be able to be sure that our values are both “accurate” and comparable. In the world of statistical analysis, how and why data is cleaned up matters, is easy to collaborate on, and is replicable.

I suspect that the lack of close collaboration between historians and the unique nature of each research project hinders the discussion on data cleaning. For those who do it, it is seen as a necessary step to beginning the “real” research. This is also why, as per Ian Milligan’s post on online newspapers, historians have not embraced critical analysis of their database usage. While Ian argues for its inclusion, I identify that I too was unaware of these issues surrounding methodology because of it being “just the way it needs to work”. It has gotten to the point where we would be lost if the search programs all broke down and we had to figure out how to get what we didn’t know we needed without them. Of course it should be included in our scholarly methodology, but it isn’t considered part of the research or findings (only a tool), so it is not included.

Part of the problem is also based on the solitary nature and sense of possessiveness historians feel over their research, partly because of the work they have done to find the materials, make them meaningful, and then craft an argument. Just as in our previous discussion about open data and sharing our research, I think historians are afraid that people will unfairly take their data cleanup methods. This is especially true in a time when some historians are still very new to digital history and are much less proficient with the programs available.

However, while these reasons make sense and seem natural, they could be negatively impacting the type and quality of research we put out. This is especially true for those new at digital history and unfamiliar with the additional biases and considerations it involves. As we saw in our use of Google Refine, we can group things, remove things, change names and do so much more. Some changes are obvious and make sense, such as merging entries for “John Smith” with “Jon Smith”. Other changes might be more risky, especially in dealing with poorly rendered OCR. We might have chosen to merge files whose dates place them in 1884 into one new file and have included a group of poorly rendered “18J4″s. This decision could have significant impact and when not discussed, can lead to different conclusions than if it was not included.

As well, the ordinary considerations regarding personal and source bias apply. What data was chosen for cleaning, how it was cleaned, why it was cleaned in that manner, and what the end results did are all important questions to ask.

In addition, if the original data was not kept separately (or backups at various stages), different layers of analysis could be lost. In the data cleanup we did, we turned it into a csv spreadsheet. This is good for some work, but doesn’t give the full picture or allow for easy work on other issues. It also didn’t allow for the searching of the text, but did provide the sender, recipient, and date. If we were to organise the data to indicate frequency of letter writers, we would not see which years are the most written in. Multiple layers would have to be investigated, making it a different source of data, but no less challenging or critical to work with.

I wouldn’t say that failure to disclose full data cleaning methods makes the argument weaker or less relevant, any more so than failing to disclose what traditional sources you did or didn’t use and why, would. It is definitely best practice, and could make the argument stronger, but the typical reader of a monograph also relies on the author to have done a thorough job in their research practices.

However, in terms of replicability and history as a science, following the rules and conventions that are generally agreed on, such as TEI, is necessary. In addition to easing the burden of coming up with new ways of doing things and making it easier (via open access research) for others to replicate and confirm your findings, following standard conventions makes it easy for others to verify and lend credibility to your findings. As I relate in my anecdote, not following conventions and carefully recording your steps and concerns is like not leaving notes on complex statistical programs. Ultimately, as sarahmcole notes, “better digital methods make for better scholarship, period.”

Module 2: An Exercise in Frustration

Last week, it took more time than I thought it should to start understanding command line. This week I learned that while I seem to have clung to the idea of a few of the commands, I didn’t remember them in order, and I did not remember some of the commands at all. They say “practice makes perfect”, so I am fairly confident that by the end of this all, I will have mastered the command line.

Seriously though, this week I found that the instructions were relatively easy to follow and that while I am not yet sure what I would do with them myself, I understand the general purpose of how they worked. I really enjoyed searching through the databases and using the csv files of war dead to look for patterns by year, location, surname, etc. Within the search I did for Maier, I found that one entry was for a civilian woman killed in Jamaica. There are so many interesting things about this entry that I can see how the whole method would be useful in confirming theories (the Maier’s were German and mainly male) through big data. I can also see how it can highlight interesting cases to focus deeply on (the civilian woman in Jamaica).

Unfortunately, my troubles with the command line continued and I waged a war fighting incorrectly entered commands, non-existent commands, and out-of-order commands. I had trouble with Exercise 1 and 5 mostly. In Exercise 1, I failed to check which repo I was saving files to and accidentally pushed them to github in the wrong repo. I tried to fix it and ended up deleting the file, causing panic and confusion trying to see if I could recover it from my DH command line. The solution was to pull the incorrect repository from github, extract the files, create a new repository (the key step I forgot), and re-push it. I had help from my brother and sarahmcole kindly gave it a shot through her advice on Slack.

In Exercise 5, the majority of the problems now appear to have been caused by me not waiting long enough for the Twitter files to download, and then coming grouped as one line rather than many. Since I have completed the exercise, it appears others have had the same problem, so that is somewhat reassuring. Again, my brother and sarahmcole assisted me in creating a python code to turn the file into many lines in order to create a csv.

The good thing about all of this is that it forced me to really be aware of what I was inputting into the command line and start to understand the sequences of commands better. I now know it is important to check which repo you are in before adding and committing files. And as Dr. Graham pointed out, I now have first hand knowledge of the importance of version control.

For a much more detailed look at what commands I entered, see my command line files (Exercise 1, Exercise 4, Exercise 5) and my fail-log.

Librarians are heroes!

Librarians are often considered nerdy, anti-social, not very much fun, and slightly quirky. These stereotypes may be true for some librarians, but it doesn’t accurately demonstrate how smart and technologically sophisticated librarians are.  Indeed, while they often look like they are simply putting in key words, it is important not to forget they have also sorted those keywords, attached them to the document through the library database, and indexed many different types of information for our benefit. The “background” work they have done and their knowledge about different ways to search and what key words might be helpful is invaluable.

Let’s take the scenarios described in Cameron Blevins’ Topic Modeling Martha Ballard’s Diary as our example. Blevin’s provides a simple example, explaining that searching for “God” does not bring up words that effectively mean “God” but are different. He gives the example of “Author of all my Mercies” also meaning “God”, but in a more descriptive way. Sarahmcole and I have a conversation about this notion, where sarahmcole agrees with the author about the limited nature of the search function and I agree with them both, but ask how we are supposed to know our search is missing vital key words if we are solely relying on “big data” and data mining. The system is even more difficult when cultural differences are involved, such that “God” means something else or is described differently.

I suspect this is where people trained in this task come in: Librarians. This is especially true at a university or archive where the librarians are more likely to be specialised or experts in their particular field. Having a greater knowledge of the subject area than ourselves, they can suggest alternative key terms to broaden our search.

However, before the librarian is able to run that deceptively-simple search, they have made the material searchable. From the brief exercises we have done on finding, extracting, mining, comparing, and labelling data, it is clearly not an easy task. When presented with a copy of A Midwife’s Tale, how is the librarian expected to know the contents, put a bibliographic entry together, connect it to search terms and make it available? It is certainly not possible that each document or piece of information destined for the digital world is carefully read and annotated. But yet, I suspect this must be the case, unless all key words are generated by digital frequency analysis, in which case, the potential for inaccurate and unrealised patterns and discoveries is great. Rachel_johnson alludes to these difficulties in her annotation.

In Academic Journals: The Most Profitable Obsolete Technology in History, the author explains that making and publishing academic journals is hard. I agree, but suggest that as hard as making the journal is, compiling and making searchable any data is harder. Let’s think this one through for a minute, taking our exercises and the discussions about GeoCities as our starting point. With the old newspapers, they were originally print. Someone had to scan them and ensure they were readable by OCR. This takes time and technological skill. Then key words had to be identified and a bibliographic entry made for the document. Well done OCR may need to be partially retyped and all the data has to be linked to similar data so that associations and searches can be usefully made. In the case of public work like GeoCities, in order for it to become available for searching, the process most likely looks like something Ian Milligan and others in our readings have discussed. Even when each web-page is archived, someone has to make it searchable. This can be done through a complicated process that I do not fully understand, but sort of attempted (I think) in this week’s assignments.

Assuming the librarian has read all of A Midwife’s Tale, issues around bias present themselves, both in the constructing and identifying of key words and in the assistance provided to the researcher. Sarahmcole and Csamuelson consider this, with Csamuelson suggesting an “open source” approach to counter the bias. Might this be a good place for librarians to go as well? Would it be helpful (and lesson the workload and responsibility) for librarians to share their initial work beforehand? The idea is certainly interesting and fits well with the readings and discussions we have been having recently.

Further questions to consider include the context, time period, and cultural background of the librarian and what effect this might have on the way the source is viewed. Might a librarian be more conscious of war-themed topics during a time of war? Does an African-American librarian see a text the same way a librarian of European descent see it? Twenty years from the time it was inputted, might a previously glanced-over topic become relevant? Do we regularly need to revisit old search programs and update them? What about “politically correct”? If the data mining and word frequency of a 1920s US novel identifies words which we now consider unacceptable, does the librarian include them? Is there a difference between information made available for historians, where the identification of such words may be valuable, and the use of such terms in a public library catalogue?

Finally, how is it possible for librarians to accurately and effectively catalogue the billions of millions of entries that the public would like to put on the web? Is it reasonable to expect librarians to verify, separate and log every key word, create unique programs to process and store that information, and spit it out in a neat structure the public can understand? Remember, people are not born knowing how to use various databases, but they have to be taught on user-friendly programs. And these thoughts are based solely on relatively simple search engine work. The more specialised the material gets, the more work is involved and the more support the librarian needs to provide to users.

So, the next time the librarian tells you to “pipe it down a little”, be considerate. After all, he or she may be pouring over 100 years of a small-town newspaper on microfiche, trying to decide how to digitise it, how to categorise it, and how to make it easy for you to type in “hospital” and “1919” to discover something truly groundbreaking, or in the case of big data, illustrating, confirming, or challenging what you already know.