Along the Hudson River in Manhattan (at least as of 2018), an awkwardly cut stump looks like a chair. However, upon closer inspection, the tree had grown over some old iron fencing, creating a stump impenetrable to chainsaw operators.
Hello, new subscribers to the newsletter! This is an uncharacteristically long post. It’s only when the muses show up with a bucket of ideas, which isn’t often.
TL;DR: Some individuals  perceive data cleaning to be a menial job that is somehow “below” the attractive “real” data science work. B.S., I say.
Cleaning data puts values, judgments, and interpretations on it for downstream analytic algorithms to work and produce results. That’s the same as performing data analysis. In truth, “cleaning” is merely a set of repeatable data transformations on the way to a complete data analysis.
The steps we need to take to clean data flow more readily if we accept that structure. We want to execute our analysis code, control unnecessary variation, eliminate bias, and document for others to utilize, all in the service of the future analyses we wish to run.
Existing Data Cleaning documentation is often ineffective.
Let’s begin by outlining the problem with current writing on “Data Cleaning.”
According to Wikipedia’s article on data cleansing, validity, Accuracy, Completeness, Consistency, and Uniformity are all significant criteria of data quality. It also has a section on “process” that is quite dry and academic (in a bad sense) and will not assist you in cleaning any data.
Next, I’m just going to pick a few posts from the top Google results for “data cleaning.” I’ll include links as examples so you can see what I’m talking about.
This one, which has a high PageRank, is like a nicer enlargement of the Wikipedia article from the beginning. Fortunately, it redeems itself in the process section by presenting a long list of examples of data cleaning strategies, such as cleaning spaces, eliminating irrelevant values, and so on. There are some examples and images included!
This is a strange blog article for a data product website I’ve never heard of. It’s short and mostly says, “Data quality is critical; you must monitor it regularly.” They’re not entirely incorrect, but they’re also not very helpful. Then they show you how to clean, verify, and monitor your data using a process loop… I assume this is because they’re trying to sell you on some of the product’s features.
Tableau then joins in on the repetition game, explaining why clean data is vital, outlining a similar checklist of activities, and concluding with those data quality criteria! Not terrible, but it’s eerily identical to the rest of the entries so far.
This one was found on a website dedicated to data science education. It says things like “Good data > Fancier algorithms,” and it has cartoon robots (which I enjoy), so that’s a bonus. Then it’s onto the checklists: removing bad observations, correcting errors, filling in missing values, and so on. I get their “selling” talents here, but it’s still a mystery to me.
The brain’s pattern-spotting ability detects patterns!
Take the concepts of “Data quality is vital,” “Audit data, detect mistakes, fix them, monitor, repeat,” and “these are the criteria of good data quality” and turn them into a top-ranked Data Cleaning post. Include a few instances. Do it in 1,000 words or less.
Make money! (Literally, through ads and product sales)
My main complaint is that these are all superficial checklists that say things like, “Go find the bad stuff and clean it!” Make use of these tools and strategies. Easy!” We wouldn’t spend so much time doing it if it were so simple. You’ll often be given a list of “good data has these attributes,” so check your data to see whether it has them. If not, devise a method of imputing those properties to the data. Meanwhile, every reader is asking themselves, “How?”
Note: If you search for “data cleansing theory,” you’ll find some better talks.
Additionally, there are efforts to automate data cleaning (usually with the help of AI because obviously, AI makes everything better). Part of the motivation for these endeavors arises from the fact that there is too much data for humans to examine. These products/functions, on the other hand, are targeted at decreasing the tedium of data cleaning without even looking at it, which I find exceedingly dubious. Cleaning should be automated, but it should also be done manually.
Why do we treat cleaning like laundry when it’s a complex and intricate task? We don’t provide checklists for data analysis, do we? (Oh no, evidently some people do…)
Data cleansing is a type of analysis.
Until this post, I had never given data cleaning much thought. But when I examined it more closely, it became clear that we’ve given a nasty title to a tiresome and painful portion of the data analysis process (who wants to do CLEANING, call it “scooping the data kitty litter”). Then, due to layers of abstraction and a lack of education, we’ve lost sight of how important it is.
We then relay horror stories and present “concerned” studies showing that data cleaning consumes 80 percent, 60 percent, 40 percent, or whatever percentage of an expensive data scientist’s time. The statistic itself appears to be more of a hazy indication of direction than concrete data. Here’s a more in-depth examination into that hazy statistic by Leigh Dodds.
Whatever the reality of the assertion is, the implication and call to action are clear: we’d hit Data Science Productivity NIRVANA if we made the cleaning process easier, faster, and automated. CEOs can reduce staff and compensation costs. From a user-friendly interface, each employee may perform simple analyses on their own. Data scientists would focus on the most interesting challenges and employ the most advanced algorithms. Wine goblets would be overflowing, and insights would fall from the sky like rain.
Aside from the hilarity, consider what data cleaning entails. Not in the mechanical sense of “removing errors to improve data quality.” Neither does the end-goal explanation of “greater data quality leads to better results.” Let’s talk about ontology for a moment.
We clean data because we believe there is a meaningful signal about a topic we care about. For example, I believe the number of complaints in my tweets increases when a deadline approaches. But there are many noise and inaccuracies in my signal of interest: my shitposting volume isn’t associated with deadlines, and my time zone has shifted over time. Because I obtained the data in odd bits, there may be gaps. I have two languages in which I write. Because I tweet frequently, I am unable to read and hand-code all of the tweets.
Cleaning is necessary because we want to remove the relevant signal from the noise, and we have determined that specific pieces of noise are “correctable” at the data point level for this purpose. As a result, we sanitize those parts so they don’t interfere with our analysis.