Final Project Work

This week I have been working on my final project. It has been an interesting journey which has changed much since I envisioned it last Sunday.  I originally planned to use several years worth of data (1897-1902) to do some topic modelling and see if there was an increase in French/English conflict as I would have suspected based on the divide created as a result of the conscription crisis for the Boer War. I wanted to see especially what an English Paper within Quebec had to say on the subject.

I first had trouble downloading the files because the command I used needed to be modified more than I had it.  I had further trouble with the newly created DH Box account, so I just downloaded the .txt files onto my computer using the modified code: wget -r –no-parent -w 2 –limit-rate=20k http://collections.banq.qc.ca:8008/jrn03/equity/src/1897/ -A .txt

I was interested in using python to cleanup the messy OCR, so I pulled up that tutorial and started working off the python script, modifying it as I thought appropriate. I kept receiving error messages that basically all looked like this: clare@clare-fun1:~/School/Final_EquityProject$ python PthyonScript.py
File “PthyonScript.py”, line 19
nodash = re.sub(‘.(-+)’, ‘,’, line)
^ IndentationError: expected an indented block

 

My brother explained that the indents weren’t properly aligned. I tried to re-align everything. I ran it again, but it deleted 1700 lines of text, leaving me with 14. I learned about how precise python has to be and I got a rough understanding on how it should work.

I then tried to fix the readability of the Equity papers by using RStudio to remove poorly OCRd symbols. I ran into problems because even though I was using a csv format, my file didn’t really make sense because it didn’t have useful headings of easy to separate headings.

I decided on marking up one Equity paper using TEI. I am modifying the class tutorial as I go.  I started by making the <p> and </p> tags where I thought they should go. This was complicated by the fact that the OCR hadn’t respected the newspaper column lines and the lines were intermixed and often did not make very much sense. I had to make methodological and ethical decisions about where to put the breaks. I discuss this more in my ongoing final project notes and fail-log.

Next I had to use a program to properly align the <p> </p> tags and first encountered many errors of incorrectly closed tags and such that I had to fix. Then my program wouldn’t do the aligning, so my brother quickly ran it through his text editor.

I tried to use Regex to identify and remove useless symbols that were causing problems with the computer’s recognition. I couldn’t get the expression to just pickup what I wanted in some instances and not others, so I used the find function and replaced them that way, searching the context to make sure the switch wasn’t a problem. I made a note of the substitutions in my fail-log and discussed in more depth the reasoning and ethical debates I wanted future users to be aware of.

Closer to the end of the week, I began the actual encoding of people and the other categories I identified through my familiarity of the text through skimming. I began with the people and as of today, have gotten through the first 200 lines. I didn’t realise just how long it would take, so I may just do a really good job on the first 400 lines and use it as a proof-of-concept.

I used the formula below to encode the first name (Cation Thornloe), which appears to be a poorly rendered Captain Thornloe).
<p> <CationThornloe <key=”Thornloe, Cation” from=”?” to=”?” role=”Bishop” ref=”none”> </persName>

I received a parse error saying it was improperly formed.
I examined the format and noted that there was an extra “>” before the end tag </persName>. Even though it appeared to follow the format laid out in the template, I modified it to close out the tag:
<p> <CationThornloe/> <key=”Thornloe, Cation” from=”?” to=”?” role=”Bishop” ref=”none”> </persName>
It then returned a parse error again, saying it was poorly formed.

In short, I was confused and didn’t make sure the stylesheet was referenced in the .xlm file.

I encoded every name, looking up what I could about them online. Some people were easier to find, like the lumber barons or bank board members. Some, like Miss Jennie who left for North Bay to take up dressmaking, were less easy to find. In my fail-log, I discuss that the project would have been more complete if I had had time to properly confirm each person was correct and that in cases where lots of information was given, I had to chose what to include, based on the context and influenced (as unavoidable) by me.

I ended the week by trying to preview all the work I did and receiving errors from sources with “unusual characters” (=, &). Once I resolved that problem, I had this problem that I have not yet resolved:

**XML Parsing Error:** mismatched tag. Expected: </p>.
Location: file:///home/clare/School/Final_EquityProject/TEI_OCR_Tagging.xml
Line Number 829, Column 15: </body>

Questions that drive my markup at this point include:

  1. How often are well-known people vs. regular people mentioned?
  2. How might this speak to the function or readership of the paper?

Discovering how to play with data

What I was trying to do
I chose to work through Gephi and RStudio for the two exercises this week.
As I understand it, Gephi allows you to visually see and manipulate data and RStudio is more mathematical
and command-line based, allowing for data to be pulled by topics (which could be then fed into Gephi after they are cleaned up)

What I did
Gephi
I had trouble running the “force-directed layout” until I read through the rest of the instructions and hen it made sense.
*still having problems with my tendacy to not read all the text before trying to do something*
**while working on the data, it looked messy and I was unsure what it
was really showing me because it didn’t include the names in the
“working part” – so it took me extra time to make sure that I was doing it right
**The instructions are pretty easy to follow using this one layout – Force Atlas 2**
I did have some initial trouble understanding what Force Atlas was supposed to do and how it worked – again, simple issues
that I sorted out easily by referring to the Slack channel and running ideas past my brother

RStudio
#Stuck
I have gotten to this point
“mallet.top.words(topic.model, topic.words[7,])”
Although I don’t really understand the technicalities of how I got here and what is going wrong.
RStudio returns this error:
Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
java.lang.NullPointerException

I checked Slack for any ideas and posted a note asking for help. Prof. responded with what I feared: the file was empty.
Consulted my brother who pointed out that I accidently skipped a step because I misunderstood the instructions to read “if you already did the above, ignore this step” and that
by doing this, I deleted the contents of the file. We went back and redid that step, and were able to continue with the tutorial.

The way it manipulated the data reminded me a bit of gephi. Not sure what I will do with it, but looking forward to trying it out tomorrow.

I also pulled the github repo for this week because I created (with help) a new one since I’m now back on DH Box. Then I synced it again and pushed it back.

Things that were hard
Gephi
**One thing I did not like about Gephi is that it appeared rather utilitarian and I had trouble intuitively navigating it – not user friendly**
-this was especially true for the *difficulty* I had in figuring out
how to follow the instructions on filtering out the unconnected points – I kept deleting the wrong information while following the instructions until I changed the “O” to “1” – I found that the program-specific language wasn’t really clear or easy for beginners to use

RStudio
Probably my biggest challenge is still not completely understanding what I was doing and why it was either
working or not working (this often means that even for simple mistakes I need help pointing out
what I am working on so I can see why it is right or wrong)
I found that for this RStudio especially, I did not understand the instructions. However, I have also found that as I begin to work
on my final project using the Equity files, I am more comfortable changing parts of the tutorials and
modifying them to use the files/commands/ etc I want, so I am hoping that I will have success with replicating
this tutorial.
I do know that my brain doesn’t specialize in this type of work and learning and that I probably will
bring forward some ideas and methods for future use, rather than radically switching the way I do research or use the computer.
There are things I am starting to do without the tutorials (like the command line stuff and markup), but
my grasp on the rest is still mostly tied to the tutorials.

Thoughts on where to go next
I am hoping that after doing some
OCR cleanup on my Equity files, I can use RStudio to pull out important themes and words to explore further
and I can use Gephi to visually graph it. My ideas and aims will probably change over the coming week as I work on it.

Final Project
I began working on the final project this week. I used wikipedia to decide what timeframe I wanted to focus on, based on
topics that might have been in the news during that time. I chose 1897-1902 (before and during the Boer War and it’s
potential impact on English/French Canadian relations due to conscription). I downladed the files, using a mixture of the
wget command from the workbook and the command suggested in the tutorial the workbook linked to.
I was proud of the way I did not give up, but added and switched things around until I had a command that did what I wanted it to.
wget -r –no-parent -w 2 –limit-rate=20k http://collections.banq.qc.ca:8008/jrn03/equity/src/1897/ -A .txt

 

At this point, I am most interested in all the different ways that research can be manipulated and displayed. I only touched on two ways, but may use different methods in my final project. I have been thinking more seriously and deliberately about what I do, how I do it, and why. This week, I was especially proud of myself for not only (mostly) correctly following the tutorials without significant assistance, but also beginning to feel comfortable modifying the command lines to create different tables that interested me.

I also started working on my final project and after some trial and error in working with the command line for wget and using critical thinking skills to modify previous examples, I got what I wanted downloaded and stored where I wanted it. Looking forward to the next steps.

Who benefits from this work?

Select one of the articles behind the links above (and/or in the exercises) to annotate, asking yourself, ‘who benefits from this? who is hurt from this?’. Make an entry in your blog on this theme.

I found Michelle Moravec’s blog posting “Corpus Linguistics for Historians” most interesting. I appreciated how she explicitly set out the main reasons why she liked using Corpus Linguistics and then briefly explained what tools she used and why she chose them. She then gave visual examples of each of the tools and explained her thinking/use of them.

I found that this was a good example of how historians can practice principles of digital history such as being “open source” while also producing an education and easy-to-follow document. I think that anyone interested in using any of the tools she identifies would benefit from her post. Her use of “in-process” visuals and her explanations of what is happening and why is useful in helping the reader to follow the text, while also critically engaging with the tools and methods being presented. From this perspective, I do not see how people could be harmed.

However, the situation does get a bit more ambiguous when we explore the examples she provides and the way she uses the tools. By viewing files based on word frequency, she is assuming that frequently used words mean they are more important. This potentially misses the few instances where it is used “more importantly”. Depending on the likelihood that this happened (which is impossible to know really), it is possible that a vital piece of her argument has been left out. This could potentially impact how people use her work and continue forward. However, the fact that she is aware of how and why she focused on particular words or clusters (her methodology), and provides it in narrative form to the reader, is beneficial.

Again, her use of the cluster feature is troubling for the same reason. While it seems logical for people to assume frequency equates to importance, this is a particular assumption held by North Americans. We cannot be sure others hold it, nor that this assumption was meant to be held by those who created the program.

The fact that the author presents her paper as an “exploration” can be helpful or harmful, depending on how familiar the reader is with history, particularly digital history, and how dedicated they are to good practice and methodology (as explored in the previous module). For example, when the author says that she notes that the densest file is once again Stanton’s and that she “[continues] exploring”, she presents both the strengths and weaknesses of digital technology such as this.

One strength of this type of analysis is that it allows for rapid and easy manipulation of data and lets the user quickly see if an idea or theory they have had is feasible, based on how they manipulate data. For example, a csv file can produce many different types of charts and use multiple variables. A user can find an interesting trend and quickly follow it through, checking against the data or cross-referencing with another variable. The problem is part of the same advantage because it is the user who manipulates and changes the way the data is being viewed or compared against. Someone who is not familiar with the ethics and methodology of good digital history can easily work off their biases and produce potentially misleading data. This could harm their reputation and the scholarship at large.

Ultimately, as the author writes, these tools are useful because they allow for broader comparison of “patterns and shifts over time and space“.  These practices can also be harmful if not done in conjunction with typical academic methods, a fact that the author also points out when she argues that these are good tools before beginning close readings.

In conclusion, I think the same tools can be both beneficial and harmful. It is not the tool that has a negative or positive value, but the people and methods that are using them. That is why this class is about learning how to responsibly and helpfully use the tools to contribute to history at large. The fact that it makes it easier to use vast quantities of data provides the same level of harm to the unsuspecting user that a casual scan of library shelves does or the use of only one archive source. For those inclined to do poor history, these tools will allow them to continue and for those concerned about methodology, these tools assist in broadening the data they can explore.

Making Headway

What I was trying to do
I was trying to download and “clean” a group of diplomatic correspondence from Texas that had gone
through OCR. I used Regex to clean it into a usable file and Google Refine to make it easy to
search and manipulate for later use.

What I did
I finally had success with the command line and am able to navigate the simple functions in this
with ease! I made note of this success here:
“#Finally!
I’m finally getting to understand this!
**So Cool**
*Look what I can do
#Brief fail-log before actually creating the final one
I’m creating this markup file in nano in my terminal. This in itself is a positive step for me
because it means I found it, opened it, and will be able to “add” “commit” and “push” it to a safe place (github).
I found this module easier than the previous ones, which means I am learning something finally!
Now I’m actually looking forward to mucking around with the final project.

Used built-in Terminal – can see physical files on my computer – way easier
had some trouble at the beginning understanding the task and how to get started
*note: look through command line and find other notes to consolidate into one “fail-log”*
I made much more use of the Slack channel other people posted *review and credit people*
and some initial use of my brother to walk me through the regex concepts *still a bit fuzzy,
but the notes and place to play around are helping*
Other than simple typing errors and not reading closely enough (yes, we were warned),
things went pretty well.

**Ran it on whole file, not just sender on top**
**deleted letter contents – kept stuff at top, then completed the exercise, highlight**
*problem with reading and understanding – new language issues*
*Tried Pallidum **checkname** and it made a picture of connected words,
but it did not make sense to me and I couldn’t figure out how to use it properly – try again*

Had assistance from my brother. Jeffblackadar helped me identify why my OpenRefine file did not look right.
Dr. Graham further assisted by pointing out that the data probably contained the letter contents
it and suggested ways to fix it (somehow I missed or unsuccessfully completed that step).

My main “fails” this week were thinking that I had installed the Java and OpenRefine files (did all the
command stuff I thought), but only extracting the files and not actually running the correct install.
Ex. “git apt install Java” instead of “sudo …. (correct name)”.
As mentioned above, I also incorrectly deleted the letter contents from the file I was working on and
completed the steps without getting rid of the contents. This led to confusion and some help via Slack
from Jeffblackadar and the Prof. My brother and I went back to an old version (luckily had many backups)
and properly got rid of the contents and re-ran the tutorial.

Things that were hard
I still have trouble following the instructions when so many new terms and programs are installed.
I tend to try to follow them without knowing what I am doing, thus being unable to identify the mistakes
when they occur and causing more trouble down the line. Thinking before doing still needs to be applied better.
I did not figure out how to use the pallidio software.

Thoughts on where to go to next
I did do down and work with the command strings at the bottom of the tutorial after
thinking about it and trying to piece it together myself.
**this is the next step I will have to work on** – I expect using the testing website to
“play around” with regex a bit more will be useful.”
I also need to continue carefully reading the instructions before doing things – it will make the frustrations less later.

 

NB: This is a copy of the fail-log written straight to github. I realised I was duplicating my work and am trying to only do it once. Please let me know if this is unacceptable. Thanks

Data Cleaning

I have not really had any experience doing data cleaning in other classes. This is mainly because I’ve worked with comparatively small amounts of written and published data. Anything remotely like what we are doing in our digital history class was not even on my radar until very recently. I picked a topic, found what felt like a reasonable amount of documents to support it (say 15 or so) and wrote the paper. The most data cleaning in my history classes was probably more to do with my less than stellar note-taking skills and the need to discover exactly what I wrote.

That being said, my current coop placement is with the Federal government and it involves large-scale data analysis of information inputted into a statistical spreadsheet program. I have been involved in cleaning up that data.  This type of data cleanup is similar to what we have been doing this week. It involves making sure the agreed-upon variables are the same (e.g. 9999 encoded as a missing value rather than 999), using search and replace functions, and manually inspecting the data before beginning to turn it into meaningful charts.  We discuss this work in detail while we’re doing it. We compare methodology, save and share our syntax, double-check each other’s work for inaccuracies, and work in a conscious and systematic way. This is necessary because if I do something differently from my colleague, we will not be able to be sure that our values are both “accurate” and comparable. In the world of statistical analysis, how and why data is cleaned up matters, is easy to collaborate on, and is replicable.

I suspect that the lack of close collaboration between historians and the unique nature of each research project hinders the discussion on data cleaning. For those who do it, it is seen as a necessary step to beginning the “real” research. This is also why, as per Ian Milligan’s post on online newspapers, historians have not embraced critical analysis of their database usage. While Ian argues for its inclusion, I identify that I too was unaware of these issues surrounding methodology because of it being “just the way it needs to work”. It has gotten to the point where we would be lost if the search programs all broke down and we had to figure out how to get what we didn’t know we needed without them. Of course it should be included in our scholarly methodology, but it isn’t considered part of the research or findings (only a tool), so it is not included.

Part of the problem is also based on the solitary nature and sense of possessiveness historians feel over their research, partly because of the work they have done to find the materials, make them meaningful, and then craft an argument. Just as in our previous discussion about open data and sharing our research, I think historians are afraid that people will unfairly take their data cleanup methods. This is especially true in a time when some historians are still very new to digital history and are much less proficient with the programs available.

However, while these reasons make sense and seem natural, they could be negatively impacting the type and quality of research we put out. This is especially true for those new at digital history and unfamiliar with the additional biases and considerations it involves. As we saw in our use of Google Refine, we can group things, remove things, change names and do so much more. Some changes are obvious and make sense, such as merging entries for “John Smith” with “Jon Smith”. Other changes might be more risky, especially in dealing with poorly rendered OCR. We might have chosen to merge files whose dates place them in 1884 into one new file and have included a group of poorly rendered “18J4″s. This decision could have significant impact and when not discussed, can lead to different conclusions than if it was not included.

As well, the ordinary considerations regarding personal and source bias apply. What data was chosen for cleaning, how it was cleaned, why it was cleaned in that manner, and what the end results did are all important questions to ask.

In addition, if the original data was not kept separately (or backups at various stages), different layers of analysis could be lost. In the data cleanup we did, we turned it into a csv spreadsheet. This is good for some work, but doesn’t give the full picture or allow for easy work on other issues. It also didn’t allow for the searching of the text, but did provide the sender, recipient, and date. If we were to organise the data to indicate frequency of letter writers, we would not see which years are the most written in. Multiple layers would have to be investigated, making it a different source of data, but no less challenging or critical to work with.

I wouldn’t say that failure to disclose full data cleaning methods makes the argument weaker or less relevant, any more so than failing to disclose what traditional sources you did or didn’t use and why, would. It is definitely best practice, and could make the argument stronger, but the typical reader of a monograph also relies on the author to have done a thorough job in their research practices.

However, in terms of replicability and history as a science, following the rules and conventions that are generally agreed on, such as TEI, is necessary. In addition to easing the burden of coming up with new ways of doing things and making it easier (via open access research) for others to replicate and confirm your findings, following standard conventions makes it easy for others to verify and lend credibility to your findings. As I relate in my anecdote, not following conventions and carefully recording your steps and concerns is like not leaving notes on complex statistical programs. Ultimately, as sarahmcole notes, “better digital methods make for better scholarship, period.”

Module 2: An Exercise in Frustration

Last week, it took more time than I thought it should to start understanding command line. This week I learned that while I seem to have clung to the idea of a few of the commands, I didn’t remember them in order, and I did not remember some of the commands at all. They say “practice makes perfect”, so I am fairly confident that by the end of this all, I will have mastered the command line.

Seriously though, this week I found that the instructions were relatively easy to follow and that while I am not yet sure what I would do with them myself, I understand the general purpose of how they worked. I really enjoyed searching through the databases and using the csv files of war dead to look for patterns by year, location, surname, etc. Within the search I did for Maier, I found that one entry was for a civilian woman killed in Jamaica. There are so many interesting things about this entry that I can see how the whole method would be useful in confirming theories (the Maier’s were German and mainly male) through big data. I can also see how it can highlight interesting cases to focus deeply on (the civilian woman in Jamaica).

Unfortunately, my troubles with the command line continued and I waged a war fighting incorrectly entered commands, non-existent commands, and out-of-order commands. I had trouble with Exercise 1 and 5 mostly. In Exercise 1, I failed to check which repo I was saving files to and accidentally pushed them to github in the wrong repo. I tried to fix it and ended up deleting the file, causing panic and confusion trying to see if I could recover it from my DH command line. The solution was to pull the incorrect repository from github, extract the files, create a new repository (the key step I forgot), and re-push it. I had help from my brother and sarahmcole kindly gave it a shot through her advice on Slack.

In Exercise 5, the majority of the problems now appear to have been caused by me not waiting long enough for the Twitter files to download, and then coming grouped as one line rather than many. Since I have completed the exercise, it appears others have had the same problem, so that is somewhat reassuring. Again, my brother and sarahmcole assisted me in creating a python code to turn the file into many lines in order to create a csv.

The good thing about all of this is that it forced me to really be aware of what I was inputting into the command line and start to understand the sequences of commands better. I now know it is important to check which repo you are in before adding and committing files. And as Dr. Graham pointed out, I now have first hand knowledge of the importance of version control.

For a much more detailed look at what commands I entered, see my command line files (Exercise 1, Exercise 4, Exercise 5) and my fail-log.

Librarians are heroes!

Librarians are often considered nerdy, anti-social, not very much fun, and slightly quirky. These stereotypes may be true for some librarians, but it doesn’t accurately demonstrate how smart and technologically sophisticated librarians are.  Indeed, while they often look like they are simply putting in key words, it is important not to forget they have also sorted those keywords, attached them to the document through the library database, and indexed many different types of information for our benefit. The “background” work they have done and their knowledge about different ways to search and what key words might be helpful is invaluable.

Let’s take the scenarios described in Cameron Blevins’ Topic Modeling Martha Ballard’s Diary as our example. Blevin’s provides a simple example, explaining that searching for “God” does not bring up words that effectively mean “God” but are different. He gives the example of “Author of all my Mercies” also meaning “God”, but in a more descriptive way. Sarahmcole and I have a conversation about this notion, where sarahmcole agrees with the author about the limited nature of the search function and I agree with them both, but ask how we are supposed to know our search is missing vital key words if we are solely relying on “big data” and data mining. The system is even more difficult when cultural differences are involved, such that “God” means something else or is described differently.

I suspect this is where people trained in this task come in: Librarians. This is especially true at a university or archive where the librarians are more likely to be specialised or experts in their particular field. Having a greater knowledge of the subject area than ourselves, they can suggest alternative key terms to broaden our search.

However, before the librarian is able to run that deceptively-simple search, they have made the material searchable. From the brief exercises we have done on finding, extracting, mining, comparing, and labelling data, it is clearly not an easy task. When presented with a copy of A Midwife’s Tale, how is the librarian expected to know the contents, put a bibliographic entry together, connect it to search terms and make it available? It is certainly not possible that each document or piece of information destined for the digital world is carefully read and annotated. But yet, I suspect this must be the case, unless all key words are generated by digital frequency analysis, in which case, the potential for inaccurate and unrealised patterns and discoveries is great. Rachel_johnson alludes to these difficulties in her annotation.

In Academic Journals: The Most Profitable Obsolete Technology in History, the author explains that making and publishing academic journals is hard. I agree, but suggest that as hard as making the journal is, compiling and making searchable any data is harder. Let’s think this one through for a minute, taking our exercises and the discussions about GeoCities as our starting point. With the old newspapers, they were originally print. Someone had to scan them and ensure they were readable by OCR. This takes time and technological skill. Then key words had to be identified and a bibliographic entry made for the document. Well done OCR may need to be partially retyped and all the data has to be linked to similar data so that associations and searches can be usefully made. In the case of public work like GeoCities, in order for it to become available for searching, the process most likely looks like something Ian Milligan and others in our readings have discussed. Even when each web-page is archived, someone has to make it searchable. This can be done through a complicated process that I do not fully understand, but sort of attempted (I think) in this week’s assignments.

Assuming the librarian has read all of A Midwife’s Tale, issues around bias present themselves, both in the constructing and identifying of key words and in the assistance provided to the researcher. Sarahmcole and Csamuelson consider this, with Csamuelson suggesting an “open source” approach to counter the bias. Might this be a good place for librarians to go as well? Would it be helpful (and lesson the workload and responsibility) for librarians to share their initial work beforehand? The idea is certainly interesting and fits well with the readings and discussions we have been having recently.

Further questions to consider include the context, time period, and cultural background of the librarian and what effect this might have on the way the source is viewed. Might a librarian be more conscious of war-themed topics during a time of war? Does an African-American librarian see a text the same way a librarian of European descent see it? Twenty years from the time it was inputted, might a previously glanced-over topic become relevant? Do we regularly need to revisit old search programs and update them? What about “politically correct”? If the data mining and word frequency of a 1920s US novel identifies words which we now consider unacceptable, does the librarian include them? Is there a difference between information made available for historians, where the identification of such words may be valuable, and the use of such terms in a public library catalogue?

Finally, how is it possible for librarians to accurately and effectively catalogue the billions of millions of entries that the public would like to put on the web? Is it reasonable to expect librarians to verify, separate and log every key word, create unique programs to process and store that information, and spit it out in a neat structure the public can understand? Remember, people are not born knowing how to use various databases, but they have to be taught on user-friendly programs. And these thoughts are based solely on relatively simple search engine work. The more specialised the material gets, the more work is involved and the more support the librarian needs to provide to users.

So, the next time the librarian tells you to “pipe it down a little”, be considerate. After all, he or she may be pouring over 100 years of a small-town newspaper on microfiche, trying to decide how to digitise it, how to categorise it, and how to make it easy for you to type in “hospital” and “1919” to discover something truly groundbreaking, or in the case of big data, illustrating, confirming, or challenging what you already know.

Open Access Research

One thing that really struck me about the way that the writers presented their blogs and themselves was their openness. The writers were upfront about their topics, presenting their arguments in what was often a very easy to follow manner. I think part of this is because they are aware of the relatively new manner of the digital history field. The writers were also very open about their credentials and often included links to their work that they had published online.

For example, in Trevor Owens‘ piece on Sunrise on Methodology and Radical Transparency of Sources in Historical Writing, he explains the shift from text-based work towards big data and online research tools. He highlights the benefits, from ideals of transparency and collaboration to simple but powerful improvements in access to finding sources and increasing scholarly accountability. He mirrors some of this accountability in his “Bio” section when he describes the numerous roles he has held and currently holds.  This not only establishes his credentials, but lays out his fields of expertise and biases.

Other others do something similar on their sites. W. Caleb McDaniel lists all his qualifications as well as the years and institutions he acquired them at. He also links to the free online introduction to his book. By doing so, he is promoting his Open Source argument and reinforcing his argument about the benefits of open intellectual exchange and the value of hyperlinks.

McDaniel’s article raised many questions for myself and my classmates. Primarily, they seemed to focus on the concerns about intellectual property, copyrights, and “stealing” ideas. Ktamg’s annotation does some work to clarify things, writing that the author is most likely alluding to version control rather than modification of the original. Natalie214 raised a question about other researchers taking and modifying another’s work. Csamuelson began a discussion which myself and several others participated in which focused on “messing” with other people’s notes or work because it is all publicly available.

It is through these conversations that the class began to really wrestle with the dangers and opportunities of “open access research”. One one hand, the benefits were well laid out by a variety of authors. Open access research can be verified and carefully nuanced through collaboration. It is not restricted to one person and has a much higher likelihood of being retrieved in a readable format than proprietary means. It allows remote access and avoids duplication of notes and sources, and it fosters a closer-knit community of historians and the general public.

On the other hand, our experience as history students within the traditional institution, has made us question its utility. While not explicitly taught, the emphasis on rigorous footnoting practices and carefully cited primary sources has instilled a fear of “stealing” from others or being accused of less than honest behaviour. After all, we are all quite aware of what we don’t know and what others know. That’s why we are doing research and looking through other historians’ work. The emphasis on doing primary research and compiling detailed research notes has embedded within us an understanding of the time, commitment, and effort involved in doing history and we see the unfairness of just taking that from others. We have been conditioned to do our own work and to value it as our contribution. Ian Milligan touches on these concerns in his blog posting and Kathleen Fitzpatrick criticises the way we were taught: to work in opposition to other historians by criticising their methods and arguments, unpacking their findings and by presenting parallel histories. Finally, Sheila Brennan’s honest discussion about the difficulties she has had in producing her publication Stamping American Memory highlights the challenges that still exist in making digital history understandable to traditional publishers and fellow academics, while also exploring the practical challenges yet to be resolved.

So, when I consider the dangers and opportunities of open access research, I am pretty evenly torn right now. I see the dangers of stealing others work, of losing credibility from poorly thought out ideas or badly researched theories, of trying to get publishers and others on board, and of all the ethical and practical quirks still to be worked out. I also agree with others who are excited about more collaboration, safer data storage, increased access to data and new opportunities. I guess for me a student, at this point in my career, I see open access as a tool to add to the ones I already use. I would love to embrace aspects of it like increased collaboration with colleagues and see this as especially useful for working internationally or across vast research topics. I like being able to access my work from any appropriate internet-enabled device and I like having easier access to the thoughts and conversations happening around me. I also like the physical work of reading old newspapers, of visiting with people or buildings, and of taking the time to deeply immerse myself in a particular research question with all the complexities and context surrounding it. I guess I’m also a bit concerned about how applicable this approach is in the moment, especially as people are still using “old” methods and are “stuck” in the binary of “my research, my work, my findings”.

I think for digital history as we have been discussing it to really catch on, we would need to introduce it into more history classes, work seriously on the logistical, ethical, and legal questions it raises, and wait until my generation (or the next) is ready to jump wholeheartedly into the field. Who knows, by then the field may have shifted again. So, I say, it seems interesting. Tell me more and let’s give it a try and see.

Learning Curves are Steep

This week I spent countless hours working through the readings and the exercises. I got about three-quarters of the way through Exercise 4 before calling it quits for this week. I experienced a steep learning curve, with feelings of accomplishment, of failure, of frustration, and gratefulness.

First, full disclaimer: I have the added benefit of having an incredibly helpful younger brother who is in school for web design and other techie things. He was kind enough to look at my mistakes and point out where I went wrong so that I could fix them and move on.

Some of the major breakthroughs and successes I experienced this week were probably quite simplistic at one level, but represented a step forward for me. I’m hoping it’s a process and that I will continue to understand and build on what I started to understand this week. This week, I was introduced to Markup and through the two tutorials, I began to understand why it was important and how to use it. Significantly, I learned to use several symbols to differentiate the text, as I demonstrate in this document. However, I had trouble with this task, especially with adding images. Using the free search tool linked by the professor, I searched for “bookshelves” and chose the picture I wanted. I copied the link into Markup, but it only showed up as an image icon and not the actual image. My brother examined my commands and noticed that the link was incorrect (not a huge deal, but still). After putting in the correct link, it formatted correctly and I felt like I understood the tool and its uses. This isn’t to say I don’t still get confused, but a quick reminder is all it takes to format **bold**, # headings, and so on.

I was able to translate some of that knowledge and the experience into Exercise Two, which focused on using the DHBox. Luckily, I was familiar with the concept of command lines and typing in strings of text because of the fact that I use a Linux computer. However, I still made simple mistakes such as typing in the wrong word “get” vs “git”, or forgetting to “add” before “committing” a new file. For this reason, my file is most likely unnaturally long and includes many lines of useless commands and backtracking. See my fail log command lines here. I understand the concept, but would like more practice, which I’m sure I will get.

Exercise Three was fairly easy and it made sense to me. I just haven’t gotten the hang of it yet. I plan on figuring out how to make sub-folders or something to separate all the files that will be in there by Module. Right now I find it hard to navigate. Perhaps I just need to work on my naming conventions.

I found this week challenging and that it took up most of my after-work hours. I’m hoping that as I get more familiar with the technology, I will be able to work quicker and more confidently. I think I also need to stop relying on my brother for some of my questions. As the professor is always pointing out, we have classmates to rely on and work together with. It’s important to learn to work together with them and build those more professional relationships. It will also take some time to remember to do my notes and brainstorming in Markdown and I noticed that I still took notes in LibreOffice. Looking forward to doing better next week and collaborating more.

 

Digital History as an Aid

In the past few years and definitely throughout my degree program, my professors have been doing their best to argue that anthropology is an interesting, vital, helpful, ever-changing, and unique field. This makes sense because as a major in anthropology, the majority of my professors have been anthropology professors. Luckily, I have also had some professors who have argued for taking an interdisciplinary approach, explaining that in order to fully understand a culture, we need to know about their past, their economic system, their language, and so on. In addition to encouraging my love for history, hence my minor, I strongly believe it helps to root the anthropology we do and makes it more valuable and meaningful.

This is kind of the way that I see digital history and the way I understand the readings that I did this week. Digital history is not separate from “academic” history, nor is it better than regular old university history. Rather, as Graham, Milligan, and Weingart explain, digital history as expressed through the ideas of the macroscope and its application to “big data”, allow us to explore much larger amounts of information at once through a variety of different lenses. The example focusing on the Old Bailey records explains this well. Using various data analysis and mapping programs that I do not yet understand, the data can show and be combined with almost countless other data sets to really broaden the picture of any particular criminal case. The posting on Historyonics: Big Data for Dead People illustrates this further as the author explains how by focusing on the case of one women within a big data set, he was able to gain a fuller understanding of all sorts of trends about the prison system, politics, publishing, women and so much more.

However, classmates bethanypehora and sarahmcole rightly raise concerns about the over-reliance and misinterpretation of the use of big data. They stress that big data is useful when used alongside micro histories.  JW Baker’s blog post about soft digital history and sarahmcole’s clarification of his argument support this position and speak to the ways that  I am beginning to understand the usefulness of digital history.

I am excited to be taking this class. Originally, I was not quite sure what it would entail, but chose it because it was a summer course that could accommodate my coop placement by being online. Delving into the material this week, I see that the class will be much more technical than I had realised. Coming from a family of open-source, Linux users, I should be more tech savvy, but I am not. I am interested in learning, but my current experience ends with basic computer and internet literacy and a vague ability to use the terminal function when prompted by more technologically-gifted family members.

My interests in history heavily tie into my love of anthropology and I argue that in order to do good anthropology, historical awareness is key. I enjoy cultural history focusing on people and their everyday lives and really enjoy documentaries such Victorian Farm. I can see that an analysis of big data would be very useful in public history projects such as this.

I am excited to be learning new ways to think about data and the relationships between events, people, historical trends, and places. These emerging technologies and ways of questioning how academics operate will help me to become a more inquisitive and explorative anthropologist as well. I am sure it will open up new doors and career opportunities as well.