31 Oct, 2016

Developing Robust Data Analysis Techniques

31 Oct, 2016
data-analysis-techniques-laptop-computer icon

Data analysis can be challenging. Sometimes, it might seem like you’re playing a game of Minesweeper: You’re looking at little white boxes with numbers inside of them, and one small error could have a catastrophic effect on the entire system. The difference is that Minesweeper is a harmless but challenging brain game, while data analysis has real world consequences.

A small typo or mathematical mistake somewhere in your data analysis could lead to a model that predicts drastically different outcomes than what actually happens. “Best practices” are essential when it comes to data analysis and interpretation. Here’s a basic rundown of some of these practices and how they can help streamline and optimize your data analysis techniques for better accuracy and efficiency.

What Causes Data Analysis Errors?

There are a number of factors that can lead to errors in datasets and in data analysis. Some of the most common include the following:

  • Source data. External data supplied by a third party could have errors by the time you receive it.
  • Manual data entry. People make typos, and missing or inaccurate data is a real possibility.
  • Conversion, aggregation, and tracking of data through systems. Errors can occur every step of the way as data is processed.
  • Data processing. Data can be accidentally lost at various stages as it is processed through a business.
  • Data cleansing. This is data that is changed or removed by anyone other than the source.
  • Overwhelming data storage. “Junk drawer syndrome” occurs when there’s a truly massive amount of data to deal with.
  • Data parsing. Errors can occur while breaking down data into its component sets.

Simple human error can cause plenty of mistakes in a business’s datasets. Whether it’s a data entry employee’s typo or an arithmetic mistake by an analyst, data quality is at risk at multiple stages of its journey from initial collection to final analysis. For data analysts, it’s important to mitigate the role of human error by using best practices that decrease the possibility of seemingly small mistakes that could have disproportionately significant consequences.

Best Practices for Data Analysis Techniques

Making “best practices” a habit serves a data analyst well. Hours and hours of work could be lost if you finish an in-depth review of a massive dataset only to find that you made a small error somewhere. Suddenly, you need to go back and search through your work, inspecting each tiny data point individually. Or you might need to start over completely, which is even more disruptive and time consuming. To prevent this, ensure that your analytical techniques are robust, accurate, and efficient. This saves money, saves time, and ensures accurate results from the data into which you’ve invested so much of your time and mental energy.

Each individual data analyst has their own methods and techniques for ensuring accuracy and avoiding errors. It can be hard to figure out where to start, but in his book Turning Numbers into Knowledge, Dr. Jeremy Koomey of Stanford University gives some solid advice. Here’s an overview of his guidelines for error-free data analysis.

1) Before analysis, always check the raw data for errors and inconsistencies.

Anomalies in your dataset can mean any number of things. It might mean that there’s an outlier, which you might need to strike from your analysis to prevent it from leading to an inaccurate conclusion. However, anomalies might also provide insight: If you are presented with a number of anomalies within a dataset and you compare them to other variables, you might find that a pattern emerges. This pattern could reveal a flaw in your data collection methodology, or it could provide analytical insight. It’s also possible that after further investigation the anomaly is nothing more than what it appears to be on its surface. Regardless of what you end up finding out, the best time to deal with outliers is before you start analyzing your data.

2) Make sure to re-perform any necessary calculations, such as verifying columns of formula-driven data.

It’s like your middle school math teacher always told you: check your work. A typo or a very minor arithmetical error could completely alter the outcome of your data analysis, potentially leading to misguided strategic decisions that could cost a business dearly. Human error is always a factor that needs to be accounted for. The old woodworking adage applies figuratively to data analysis: “measure twice, cut once.”

3) Confirm that main totals are the sum of subtotals.

Professor Koomey separates this out from the point above to provide extra emphasis on the importance of double-checking your work. It seems simple, but you should double- and triple-check the simple addition of subtotals to create main totals. Although it’s simple grade school math, a minor error could necessitate a total redo of your data analysis. No one wants to go back through their data and play a game of “I Spy” to spot such a simple error.

The amount of time you might waste as the result of an arithmetical error can be quite disproportionate to the amount of time it takes to simply double check your math. Pretend you’re back in your high school math class finals: double check your work like your grades and your future depend on it. No one’s going to flunk you if your analysis is flawed; the grown-up consequences are real-world financial losses for businesses.

4) Check relationships between numbers that should be related in a predictable way.

The tips above are almost insultingly common sense, although the more comfortable you get in your profession the easier it is to forget them. But along with checking basic addition and subtraction, it’s also important not to always take your data at face value.

Some datasets will always incorporate collections of numbers that should theoretically relate to each other in predictable ways. For example, if the E. coli present in a culture sample should reproduce exponentially over time, your data should be able to predict that. If it doesn’t, either the dataset is flawed or you’ve failed to account for some mitigating factor like predation or environmental carrying capacity.

For most individual data sets, you should be able to look at certain numbers and their relationships and tell intuitively whether the relationships seem plausible or whether something stands out as unusual or incorrect. If there’s a very large error in a data set, you have some serious work to do.

5) Normalize numbers to make comparisons easier.

With a constant that provides a base of comparison, comparing variables is a lot more challenging. You could see correlations mistakenly where none actually exist or miss a correlation that might otherwise be obvious. This tip is a bit less obvious and common-sense than those above, but it’s just as vital.

As an example, let’s compare amounts of money per person with GDP. This comparison will let patterns—or inconsistencies—rear their heads a bit more clearly.

Let’s say 100 people took an anonymous survey about how many gummy bears they have in their pockets. The survey also asks them about important personal and demographic information: their gender, age, monthly income, and other data.

Of these 100 people, 25 have more than 20 gummy bears in their pockets. This isn’t realistic, and a pocket full of melted gummy bears sounds awful, but bear with us. Fifty people had 10-20 gummy bears. 15 had 1-9. The final five people had none. These amounts are “gummy bears per person.”

On its own, comparing incomes of these people might be pointless. But if we relate their incomes to their number of gummy bears, we can see how those two variables relate. We’ll probably find that people with higher incomes tend to have more gummy bears. This makes sense intuitively, because the more money you have the more candy you can afford.

But did you notice that the subtotals didn’t equal the main total? That was in there to test you. That entire contrived little example wasn’t to educate you about comparing variables so much as to demonstrate how easy it is to overlook really basic arithmetical mistakes.

Mitigating Human Error in Data Analysis

The importance of reducing human error cannot be stressed enough. It’s important to double check for human error as early as possible in the data analysis process. The best possible outcome of a mathematical error is someone catching the mistake and then going back to fix it. This could take hours. The worst possible outcome is that the mistake goes unnoticed, completely throwing off the conclusions that you arrive at after analyzing the data.

A seemingly simple error can create a snowball effect, leading to wildly inaccurate conclusions that ultimately guide a business’s strategy and actions in a totally wrong direction. Accuracy is of the utmost importance for data analysis, and adopting best practices in your data analysis techniques to avoid errors and verify even the simplest mathematical operations can pay off substantially.

PulaTech can help your business develop custom software applications that streamline and standardize your business’s data collection and analysis, reducing the potential for human error across the board. From simple, task-specific apps and minor customizations to enterprise-wide custom app development, PulaTech is your partner. Our professional project management makes coordinating and implementing your app development projects simple and straightforward.

Contact us today to talk about how your business can benefit from custom software solutions and put the power of Pula to work for you.