Exploring a new dataset is exciting, but it’s important to understand the data before jumping in and starting to look around. It’s too easy to ignore basic data cleaning steps or make assumptions about what columns or specific values mean, or how complete the data is.

Below, we list a few steps that will help you make sure that your analysis is true to the data and useful to the people who might be making decisions based on what you’ve uncovered. At the very end, we collect all the steps into a convenient checklist, so you’ll be better equipped for your next data exploration and analysis project.

Before you explore: Know your data, its source and journey

This first issue is often overlooked, and it leads to many problems down the road: know your data! The danger here is making assumptions about the data and not checking them, or getting data without knowing the basics such as when it was last updated, whether it has been filtered, or what else has already been done to it.

First, make sure you know the data’s actual source. Is this coming straight from a database? Which database, exactly? How was it exported (if you’re not accessing the database directly)? When was it exported? Has the data been cleaned?

Even if you know the source, you should find out if the data has been processed in some way. Are you looking at a subset of the data or all of it? Has it been filtered? Have values been added to rows? Has somebody already cleaned it up (and if so, what exactly did they do)?

Depending on your data and how familiar you are with it and its context, a data dictionary can also be crucial. It tells you what the different columns contain and what the values in each column mean. This isn’t always straightforward, because some datasets encode special values using certain numbers. 9999 might mean a missing value, or all 99xx values might mean different things. Knowing what each column contains, and what the values in it mean, are important for the next steps in data checking, but also for analysis and modeling.

Look for inconsistencies and outliers when exploring data

Once you know where your data is coming from and what has been done to it, it’s time to look at the data values themselves. This is best done by opening up your data set in a table that lets you scroll, sort, and filter.

The first thing to look for are values that stick out when just scrolling through the data. Are there any values that seem wrong, is there text in numeric columns, or anything that just looks off? Maybe values that are unusually large, or negative values where there shouldn’t be any.

Next, sort by the different columns and look at the maximum and minimum values. Do they all seem reasonable? It is often difficult to set strict upper and lower limits for numerical values, so doing this by hand can be helpful. Some datasets also use specific numbers as codes for missing values, which makes this more challenging as well.

Extreme and unusual values can be data errors, but they can also be a first hint of outliers to look for in the actual data analysis. 

Find duplicate and missing values in your dataset

The next step is a closer look at your data values to check for problems beyond individual data points. That means both missing and duplicate values.

Duplicate values tend to be easier to spot by simply looking through your data when sorted by some field. You can usually notice the duplication of rows. You might have already spotted them in the previous step, in fact.

Missing values are trickier, since there are two different kinds. The more obvious are individual values within rows that are empty, null, or encoded as missing. These should be easy to find if you know how missing values are encoded. Perhaps they’re obvious like empty fields or NULLs, but they might also be some special, usually large number (or -1 for values that can otherwise only be positive).

Missing rows are more difficult to detect (see also the next point). If missing values inside of rows are known unknowns, these are the unknown unknowns. One way to look for them is by counting the values by some category, perhaps departments or product groups, and comparing those counts. Should they all have roughly similar numbers of records? Another way is to look for values per time period, keeping an eye out for days, weeks, or months with few or no records.

Watch out for incomplete time periods before analyzing data

Perhaps the most insidious problem in data analysis are incomplete time periods at the end of a dataset. If you’re looking at your data by month, for example, the current month might look worse than previous ones. But is that because your sales were lower, you had fewer clicks, etc., or simply because you don’t have a full month’s worth of data?

The difficulty with this is not only that the current period is usually the most interesting, but also that you might not know what granularity will be of interest to your users. On top of that, time periods don’t always line up in a convenient way, especially weeks and months.

Here, it becomes an issue of knowing your users and what they’re looking for. If the current time period has to be included, make yourself a note to point out that it’s incomplete, and present your findings in that context. If you can, it is usually better to exclude incomplete time periods to avoid confusion. However, in the reality of data analysis this is rarely possible or desirable.

Checklist

Here are the above steps for checking your data again, in convenient checklist form:

  • Where does the data come from?

  • When was it exported (if not directly accessed from a database)?

  • Has it been filtered?

  • Has it been cleaned?

  • Have computed fields been added?

  • Do you have the data dictionary with column definitions and special values?

  • Have you checked for extreme values and outliers?

  • Have you checked for duplicate values?

  • Have you checked for missing values within rows?

  • Have you checked for missing records?

  • Do you know what incomplete time periods there are at the beginning and end of your dataset?

Checking your data is not the most exciting part of data analysis, but it is a crucial step. Any analysis you do on data you haven’t carefully checked might end up being invalid or misleading the people you do it for. By following these steps to sanity check your data, you’ll be better prepared to analyze and explore your data. Learn more about data analysis on our blog.