Digital Humanities - Data Cleaning

Cleaning data is an essential process for anyone working with data in their research. 

“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003). Data preparation is not just a first step, but must be repeated many times over the course of analysis as new problems come to light or new data is collected.”
Hadley Wickham, Tidy Data

Clean or tidy data relates to the opposing concept of ‘messy’ data. Simply put, messy data is data containing inconsistencies or structural issues. This may include duplicate records, empty values and inconsistent spelling or formatting. For example, where you have the same value entered differently it will hide or misrepresent how often that value appears. If left unaddressed, these kinds of issues can lead to inaccuracies and misrepresentations in any subsequent data analysis, visualisation or research output. 

Some examples of data cleaning activities:

  • Standardizing abbreviations or spellings
  • Cleaning data generated from OCR
  • Formatting addresses so they can be geocoded in bulk

Data cleaning can be time-consuming, but done properly, it helps provide a data set that can be easily combined with a variety of analytical tools. We can advise on tools and skills to speed up repetitive tasks and help ensure your data is clean, portable and reusable.


Managing your data effectively

Before you start cleaning your data set, keeping some best practices in mind when working with data will help make it easier to manage:

  • Save your files in the UTF-8 encoding to ensure they maintain their integrity and are usable with a variety of tools
  • When possible, opt for human-readable file formats over proprietary, binary file formats (i.e. choose .csv over .xslx for spreadsheets; choose .txt over .doc for text files). Contents of files in proprietary formats, such as those created by Microsoft Excel and Word, are less transparent and harder to manage over the years. Human-readable formats, like .csv, can be used directly by many tools and can be opened and manipulated with a simple text editor.
  • Backup your data often
  • Consider sharing your data on GitHub, publishing it on your scholarly blog, or sharing it on a research website. We also offer a workshop on Git and Github

Cleaning Data with OpenRefine Workshop

We offer training and support in data cleaning skills using OpenRefine, an open source tool which allows you to (a) find out whether your dataset is messy and (b) take steps to clean your data prior to analysis or visualisation. It also enables you to carry out powerful bulk operations on your data without having to learn programming syntax or commands.

Our workshop covers how to:

  • Get an overview of a dataset
  • Find and resolve inconsistencies
  • Split data up into more granular parts
  • Enhance a dataset with data from other sources

It is a hands-on workshop where you work through and learn the main functions of OpenRefine such as:

  • Facets and Filters
  • Clusters and Transforms 
  • GREL and Regular Expressions

No prior knowledge of OpenRefine is assumed.