4 ways to avoid common gotchas in data analysis

Data analysts are under pressure to churn out more insights, faster. But in the rush, they can skip over critical background research, exploration, and scrutiny needed for robust and trustworthy outputs.

In this post, we outline practical habits that early career data practitioners should build into their workflows to avoid common gotchas in data analysis. Analysts can deliver more accurate and trustworthy insights with less thrash by taking a beat to deeply understand the data (what’s there, and what’s missing!) before diving in, keeping close eyes on each data wrangling operation, and scrutinizing AI-generated outputs.

Know before you code

We know the feeling: getting your hands on a new dataset is exciting, and you’ve got a whole queue of requests to get through. These pressures can tempt analysts to immediately write code and build visualizations without fully understanding the data. If you don’t have a robust understanding of both the data and the questions you need to answer before jumping in, you’re putting yourself on a path to misinformed analyses and wasted effort.

Start by familiarizing yourself with the data. Get a broad overview of what’s in your database, perhaps using AI to help with quick data profiling. Then, dig further into the metadata to answer:

What does each row represent?
What does each variable represent?
What is each variable type (categorical, numeric, temporal)?
What are the variable units?
How fresh is the data, and how frequently is it updated?
Are there duplicate records?

Depending on your database and analyses, you might also ask questions about the data collection, quality assurance, and more.

Just as important is understanding why this analysis matters, and for whom. Who will use the output from this work? How will it be used? Could they already find what they need in an existing report or dashboard? Taking the time up front to answer these questions may feel like it’s holding you back from pressing analyses, but it will save you from costly mistakes and unused work down the line.

Don’t miss missing values

When working with data, most analysts understandably focus on the data they have — not on what’s missing. When missing values are ignored, your model outputs may be biased. As a result, your analyses might lead to invalid or overstated conclusions. Investigating missingness should be a non-negotiable part of exploratory data analysis. Build these habits into your data exploration, so you don’t get bit by missing values.

Know how missing values show up in your data

Someday, we may live in a world where there are universal standards for recording missing values. But we’re not there yet. Missing values are entered in a number of creative ways: as blank cells, NAs, impossible values for the field (e.g., -999 for a team size), character strings (such as “missing” or “no value”), and more. On top of that, missing values don’t always mean the same thing across datasets, or even across fields. For example, does a missing value mean the value doesn’t exist? Or that the value exists, but couldn’t be recorded? Or the value exists, but is undefined or unknown?

To make informed decisions about handling missing values, you need to know how they are represented, and what missingness actually means.

Look to the metadata to confirm how missing values are stored, and how they should be interpreted. If missing values aren’t documented, investigate further: ask members of your team who have worked with the data previously, and check for the usual suspects in each field (blank character strings, nulls, etc.) to find the answers. Once you do, be sure to update the documentation — your future self, and your colleagues, will appreciate it!

Explore the prevalence and patterns of missing values

Next, get a sense of just how much of your data is missing. Start broadly by finding the proportion of missingness for the entire table. Then, narrow down your exploration to assess missingness by variable, and then by group or category within each field. Doing so can reveal non-random or disproportionate missingness (at times indicating missing at random, or missing not at random mechanisms), which, if left ignored, can lead to biased outputs in downstream analyses.

Many data visualization tools apply listwise deletion by default, which omits an entire record if any variables used in a chart channel are missing. So, don’t assume you’ll see missing values in a chart. Be sure to check if and how missing values are handled in whichever visualization tool you’re using.

Never stop exploring

Even with rigorous data collection standards and quality checks, data wrangling is an unavoidable part of any analysis. Whether you’re cleaning up inconsistent entries, filtering to assess a particular subset, or aggregating hi-frequency or resolution values into larger bins, your data will undergo some transformation between raw table and final report.

When multiple data wrangling steps are chained together without checking the intermediate outputs, it’s not always clear if or how something has gone awry. For example, you might botch a unit conversion, which becomes less obvious once you’ve aggregated values by group in the next step. Even minor data entry differences like capitalization (e.g. "enterprise" versus "Enterprise"), whitespace, or punctuation can make it easy to unintentionally filter out entries you meant to include.

While these mistakes are often straightforward to fix, they can also be easy to miss if you don’t keep eyes on your data at each step. The solution is to inspect the resulting data after each operation, but doing so can be time consuming.

Modern BI tools are evolving to keep your data visible by default, so you can see changes to the data at each step without building new views from scratch. When visual summaries available throughout the data analysis pipeline, analysts can identify mistakes earlier, and collaborators can more confidently interpret, engage with, and trust analyses regardless of their coding experience.

Always scrutinize AI

The growing role of AI in data analysis is undeniable. Data practitioners and teams are rapidly embracing AI as a tool to supercharge data profiling, exploration, wrangling, app development, and beyond. But AI-powered data analysis also comes with risks: AI hallucinates, makes mistakes, and lacks context and domain expertise to deliver trustworthy analyses on its own.

To mitigate these risks and ensure sound results, data analysts must approach AI with scrutiny. We can't have AI work in a black box — it must operate transparently, allowing for close human scrutiny by those with the necessary skills and understanding to responsibly validate its results.

AI can’t replace the need for coding knowledge, contextual understanding, and domain expertise. Keep in mind that just because some AI-generated code runs doesn’t mean that it’s doing the right thing. In most cases, analysts should go even further by validating results against the underlying data, and gut-checking the results based on their own judgment and domain knowledge.

Learn more

The habits we describe above take time, but they don’t slow analysis down. By doing the necessary research beforehand, and closely inspecting each data wrangling operation and AI-generated output, analysts can avoid costly mistakes that — if eventually discovered 🤞— take much more work to address after-the-fact.

Looking for more resources to get started with robust data exploration and analysis? Check out our recent posts:

4 ways to avoid common gotchas in data analysis

Allison Horst

Know before you code

Don’t miss missing values

Know how missing values show up in your data

Explore the prevalence and patterns of missing values

Never stop exploring

Always scrutinize AI

Learn more

Related posts

Choosing the right chart for your data

Databases 101 for data analysts

Shifting stakeholder expectations in the age of AI-powered analytics

Get started today