DH Lab 4: Data doesn’t speak for itself

This semester I am taking a Digital Humanities course designed and taught by Dr. Jeri Wieringa. Part of this class includes writing blog posts about various topics discussed in class. I have already crafted a few posts (one on accessibility in DH and another assessing and critiquing a DH project) and there will be several more to follow.

Last class, we read and discussed Hadley Wickham‘s “Tidy Data” as a way to re-evaluate the options for organizing and presenting data. For homework, we were tasked with tidying a table from the PEW Research Center on the frequency of prayer. Below is the original table:

A table designed and presented by The PEW Research Center that demonstrates an ‘untidy’ organization of data.

According to Wickham’s argument, a table should be made of columns and rows. The columns should consist of a single variable while the rows should be filled with a single observation of what is described. The rest of the table is filled with values that represent the recorded data. Based on Hadley Wickham’s criteria, this Pew research presentation is a bit untidy. What is being described is the percentage of various religious traditions that pray. The frequency of prayer is divided into categories (‘At least daily’, ‘weekly’, ‘monthly’, ‘seldom/never’ ‘don’t know’). These categories represent various observations and as such, should exist in rows, not columns. The column headers should represent the variables being measured.

Below is my attempt at tidying the Pew table. The ‘Frequency of Prayer’ categories have been copied into repeating rows, organized according to religious tradition. The percentages of individuals who report their frequency of prayer are measured to the right of each religious tradition. I expanded on the Pew data to help organize my thoughts. I converted the percentages to decimals and then multiplied them with the Sample Size of each religion to find the fraction of individuals who identified with a particular category within the total sample size for that particular group.

In tidying this data, I may have made it more complex for viewers. Wickham’s tidying functions best for computational statistics; it’s not so easy on the eyes for non-computers.

Resorting the data in this way led me to several conclusions; First, the tidy table is a lot. Wickham’s recommendations seem to work best for computational statistics, not so much the human mind. I found myself creating another table just to sort out the basics of the information. But this table, just like any method of presenting data, was also flawed, though still useful for my needs. The horizontal layout satisfied me, likely because English readers move their eyes from left to right when deciphering texts. While I cannot explain the technicalities of it, it turns out that vertical, repetitive data works better for computer brains (as I learned in class last week).

The second conclusion I drew from tidying the PEW graph was that data is not self-evident. As a Master’s student in Religious Studies, I know that nothing explains itself just by existing, but this exercise really solidified that concept. If you look at the tidy graph, you can see that I calculated the number of individuals for each frequency in each different religious group, rounded them to whole numbers, and then added those together.

I realized in this process that I had to choose which numbers to round up and which to round down. At first, I made an active effort to match the rounded sample total to the original sample total given in the PEW data, but I noticed that matching the two totals with or without rounding was difficult. I also realized that in my choice to round one number up but not the others I was actively changing the data. People obviously don’t exist in half or quarters, but the way the PEW data adds up makes it appear that way. My decision to add an extra ‘person’ to the “at least daily” category of Buddist’s could sway the conclusions of the data.

I ended up rounding every number up or down based on the traditional method of 0.5 and higher gets rounded up, 0.49 and under is rounded down. The results, naturally, did not add up but they made me feel a smidge more honest.

In the end, data is never about data. It can be clean and tidy or messy and untidy but that all depends on who is calling the shots. Wickham made his decision based on computational statistics, I made my organizational decisions to clarify the thoughts in my head. Neither is right or wrong, but more or less useful for the each of us in that particular moment. The question to ask, as always, is what is accomplished in presenting data in one way rather than another?

Leave a Reply

Your email address will not be published. Required fields are marked *