Data Cleaning

I have not really had any experience doing data cleaning in other classes. This is mainly because I’ve worked with comparatively small amounts of written and published data. Anything remotely like what we are doing in our digital history class was not even on my radar until very recently. I picked a topic, found what felt like a reasonable amount of documents to support it (say 15 or so) and wrote the paper. The most data cleaning in my history classes was probably more to do with my less than stellar note-taking skills and the need to discover exactly what I wrote.

That being said, my current coop placement is with the Federal government and it involves large-scale data analysis of information inputted into a statistical spreadsheet program. I have been involved in cleaning up that data.  This type of data cleanup is similar to what we have been doing this week. It involves making sure the agreed-upon variables are the same (e.g. 9999 encoded as a missing value rather than 999), using search and replace functions, and manually inspecting the data before beginning to turn it into meaningful charts.  We discuss this work in detail while we’re doing it. We compare methodology, save and share our syntax, double-check each other’s work for inaccuracies, and work in a conscious and systematic way. This is necessary because if I do something differently from my colleague, we will not be able to be sure that our values are both “accurate” and comparable. In the world of statistical analysis, how and why data is cleaned up matters, is easy to collaborate on, and is replicable.

I suspect that the lack of close collaboration between historians and the unique nature of each research project hinders the discussion on data cleaning. For those who do it, it is seen as a necessary step to beginning the “real” research. This is also why, as per Ian Milligan’s post on online newspapers, historians have not embraced critical analysis of their database usage. While Ian argues for its inclusion, I identify that I too was unaware of these issues surrounding methodology because of it being “just the way it needs to work”. It has gotten to the point where we would be lost if the search programs all broke down and we had to figure out how to get what we didn’t know we needed without them. Of course it should be included in our scholarly methodology, but it isn’t considered part of the research or findings (only a tool), so it is not included.

Part of the problem is also based on the solitary nature and sense of possessiveness historians feel over their research, partly because of the work they have done to find the materials, make them meaningful, and then craft an argument. Just as in our previous discussion about open data and sharing our research, I think historians are afraid that people will unfairly take their data cleanup methods. This is especially true in a time when some historians are still very new to digital history and are much less proficient with the programs available.

However, while these reasons make sense and seem natural, they could be negatively impacting the type and quality of research we put out. This is especially true for those new at digital history and unfamiliar with the additional biases and considerations it involves. As we saw in our use of Google Refine, we can group things, remove things, change names and do so much more. Some changes are obvious and make sense, such as merging entries for “John Smith” with “Jon Smith”. Other changes might be more risky, especially in dealing with poorly rendered OCR. We might have chosen to merge files whose dates place them in 1884 into one new file and have included a group of poorly rendered “18J4″s. This decision could have significant impact and when not discussed, can lead to different conclusions than if it was not included.

As well, the ordinary considerations regarding personal and source bias apply. What data was chosen for cleaning, how it was cleaned, why it was cleaned in that manner, and what the end results did are all important questions to ask.

In addition, if the original data was not kept separately (or backups at various stages), different layers of analysis could be lost. In the data cleanup we did, we turned it into a csv spreadsheet. This is good for some work, but doesn’t give the full picture or allow for easy work on other issues. It also didn’t allow for the searching of the text, but did provide the sender, recipient, and date. If we were to organise the data to indicate frequency of letter writers, we would not see which years are the most written in. Multiple layers would have to be investigated, making it a different source of data, but no less challenging or critical to work with.

I wouldn’t say that failure to disclose full data cleaning methods makes the argument weaker or less relevant, any more so than failing to disclose what traditional sources you did or didn’t use and why, would. It is definitely best practice, and could make the argument stronger, but the typical reader of a monograph also relies on the author to have done a thorough job in their research practices.

However, in terms of replicability and history as a science, following the rules and conventions that are generally agreed on, such as TEI, is necessary. In addition to easing the burden of coming up with new ways of doing things and making it easier (via open access research) for others to replicate and confirm your findings, following standard conventions makes it easy for others to verify and lend credibility to your findings. As I relate in my anecdote, not following conventions and carefully recording your steps and concerns is like not leaving notes on complex statistical programs. Ultimately, as sarahmcole notes, “better digital methods make for better scholarship, period.”

Leave a Reply

Your email address will not be published. Required fields are marked *