Bad Data | Data Integrity

Hi, in this blog post we are going to talk about some of the most common mistakes that can lead to bad data or a poor analysis of that data. We are also going to recommend steps you can take to ensure better data integrity.

Why is Bad Data a Bad Idea?

The only reasons to collect and analyze data is to derive insights and understand performance. If the data you are using is bad data, the insights and understanding will be unlikely to match reality. This leads to poor decisions, bad strategies and at best, an inefficient management of your digital activities.

Bad data can arise from a number of different problems of situations, being aware of them in advance can help prevent them. You can also use the below as a checklist if you are experiencing problems with inaccurate data or insights.

1. Incorrect Implementation of Tracking Code

One of the most common causes of bad data is one of the following types of incorrect implementation with regards to the tracking code that enables the collection of the data.

Running a tag audit on your site is a great way to ensure that you are not suffering from the problems described below.

Missing Code

If your site is missing code from any of its pages, you will not be collecting all of the data you should be. The scale of the problem will be proportional to the quantity of affected pages and the value of those pages. This is always going to result in a loss of data.

This can apply to any tracking code, whether its Google Analytics or AdWords conversion tracking code, if its missing, it’s not working!

You should keep a record of what tracking code should be on what pages and where it should be located. Run regular checks to ensure continuity in your tracking code setup.

Wrong Place

Implementing the code in the wrong place on the page can result in the code not triggering either constantly or at all. If the implementation of your tracking code is managed systematically, you may find that its implementation is incorrect on every page.

This is always going to result in a loss of data.

Multiple Sets of Tracking Code

Having multiple sets of tracking code on the same page can be ok, but only in certain circumstances. Having duplicate tracking code can cause problems. More than likely this could inflate figures.

Having multiple accounts used on different pages, if unintended, this can result in the data from a single site being split into different Analytics accounts. This can be a manual task, replete with headaches to resolve, as you cannot merge accounts.

Bad Code

Simply having typos in your tracking code can prevent it from triggering, but also has the potential to cause a range of unintended and undesirable functions. The effects of this will likely cause a loss of data.

Poor Load Times

If a page takes too long to load, you can get discrepancies creeping into the data, as a result of tracking code and scripts not firing either early or at all. You can also get problems when tracking between platforms if AdWords records a Click of an add, but Google Analytics doesn’t register the user or session.

This is a hard problem to identify and would require systematic testing to identify this as the cause of a problem with inaccurate data or data discrepancies.

2. Not Comparing Like for Like

Conflating Google Analytics goal completions with AdWords conversion data would be a good example of two things that seem similar but are not the same. This mistake arises from a lack of understanding about what metrics are, how they are defined, measured, tracked and counted. Sessions are not unique people visiting your site, Clicks are not Sessions, etc.

More often than not, you cannot use metrics from different platforms in the same table or calculation. Search Console and Google Analytics (GA) will provide wildly different values for traffic. Therefore, using Search Console traffic and GA goal completions to derive a conversion rate, is not going to produce anything useful.

Another example of this mistake is comparing irrelevant date ranges, seasonality plays a significant role in your traffic for example. Comparing an off-season date range with a peak season date range, while ignoring that fact will lead to bad insights. Or selecting a date range in which some major sitewide change occurred and comparing to a business-as-usual month, would be a mistake. You can always compare these types of things, so long as the fact is recognized.

If you want to know the difference in performance between peak and off-peak times, that is a reasonable question to answer.

3. Incorrect Use of Data

Some data is known to be inaccurate or ‘fluffy’, in fact almost all data is inaccurate to a degree, but we can accept something that is, say, 98% accurate. If we take Keyword Planner data as an example, this data is notoriously inaccurate, but it is often internally consistent. As such we can take some insights from it, like this keyword is a lot more searched on than that keyword.

Planning with any greater granularity can start to lead to inaccurate or bad insights. You need to know the limitations of the data you are working with; is it meant to provide a broad trend or a precise granular count?

Data requires context and so data either presented without context or with the context removed, often had little capacity for useful insight. Using data without context is often going to be consistent with the definition of ‘incorrect use of data’. A great example of this is ‘cherry picking’, which often means choosing a date range that supports your claim while ignoring the data that doesn’t.

If a site had a wholesale drop in traffic down to 10% of what it was 6 months ago, and since then traffic had come back to 90% what it was before the drop, it would be disingenuous to choose a date range of six months and claiming a huge increase in traffic, if you were basing this comparison on when the drop occurred. This is only a problem if the fact is ignored, saying; “since the drop in traffic 7 months ago, traffic has increased 900%”, is very different.

4. Lack of Granularity

Looking at a top-level metric and trying to make granular decisions is a problem for many businesses unfamiliar with how to interrogate data properly. You often need to address channels differently, split out outliers and deal with smaller subsets of data to derive proper insights.

An example of this would be looking at a declining conversion rate and assuming a problem. This may be due to a new kind of marketing that delivers a lot of traffic but has a lower conversion rate. This could reduce the overall conversion rate of the site but be entirely intentional and desirable. A breakdown of traffic with some visualizations could help you to understand what is actually occurring.

5. Statistical Significance (Use Maths!)

Not having enough data or making a decision with insufficient data is the equivalent of having bad data. You should apply statistical significance calculations to data where relevant, typically when running split testing or MVT (Multivariant Testing).

The wider point is that you should apply maths to data to derive insights, whether that’s in identifying trend lines or forecasting. Using the wrong formulas or eyeballing data is as bad as having bad data and will often result in producing bad data.

6.Tracking Discrepancies (Multiple Technologies)

All platforms track data differently. As a result, you will almost always find that the numbers do not match 100%. If this is fundamental to your business or for any reason unavoidable, you need to understand the standard deviation.

If you consistently see something like a 10% difference either over or under, you will need to be on the lookout for discrepancies that fall outside that range. Anything that falls outside of the natural variability should be flagged as it may indicate a problem with one of the tracking platforms.

You should also look for trends, is the difference growing, shrinking or staying the same? Can you correlate a growing divergence with other activity or changes?

You also need to know where to draw the line with striving for accuracy when dealing with these problems, as they will never match entirely all the time. You need to know what an acceptable and useful level of accuracy is.

7. Bots & Spam

Bots and other forms of spammy traffic can skew data, there are various ways and reasons for doing this. One way to remove this junk from your data is to identify a pattern in the behavior of the bot or spam traffic such as 0.00sec time on page or coming from a certain source. There are some off the shelf solutions to this issue, such as IP blockers and other specific anti-spam tools.

If you suffer from this type of issue, it’s well worth investing in one of the many relatively cheap solutions.

8. Failing to Spot Trends& External Factors

Failure to notice an environmental factor affecting the data essentially means that the data is bad because you believe it to be something it is not. This could be a major problem for example if your SEO agency is reporting and claiming responsibility for a rise in organic traffic, when in fact, the rise is due to a seasonal fluctuation.

Conversely, if you start to see the traffic declining and instead of seeing if this a problem you go ahead and start panicking over a solution, you’re not going to have a good day! It may be that the TV campaign just ended and that was propping brand traffic up in an off-peak season. That insight, once verified, could be a valuable piece of information that could feedback into your marketing.

Using annotations is a good way of making notes on the data to inform people reviewing the data of relevant impacting external factors. While trend analysis can provide input for understanding the impact of seasonality.

9.Improper Tests or Poor Hypothesis

Failure to control or account for variables is often the cause of bad data in the case of poorly executed tests. You need to construct proper tests, based on logical hypothesis, and minimize noise from variables. Running a test just as a TV campaign runs, or a busy season begins could lead to very unclear data.

If part of 1 of a test is done under different conditions to part 2, then the test is essentially rendered useless. Significantly changing the mix of traffic would be a good example of how to ruin the results of a test.

You must be able to disprove a hypothesis, so if you run a test that can’t disprove it, the results cannot support the conclusion. A good hypothesis, therefore, needs to be testable and you need to ability to disprove it with data. In some cases, you may need to do multiple tests or analyze multiple datasets to be able to draw a conclusion of any kind.

10.Poor Visualisation of Data

Using the wrong type of chart or table can be very misleading, a simple example of this would be in the charts below which both show the same data:

The only difference in the charts above is the starting / minimum value on the vertical axis, but the difference in the visual impact is substantial. This is just one example, but the principle applies to all charts, tables and data.

Ten Most Common Bad Data Mistakes

Why is Bad Data a Bad Idea?

1. Incorrect Implementation of Tracking Code

Missing Code

Wrong Place

Multiple Sets of Tracking Code

Bad Code

Poor Load Times

2. Not Comparing Like for Like

3. Incorrect Use of Data

4. Lack of Granularity

5. Statistical Significance (Use Maths!)

6.Tracking Discrepancies (Multiple Technologies)

7. Bots & Spam

8. Failing to Spot Trends& External Factors

9.Improper Tests or Poor Hypothesis

10.Poor Visualisation of Data

Search

Latest Posts

Comments

Search

Data assurance

Agile analytics implementations

Privacy compliance

Ready to get started?

Start a demo and a free 30 day trial. Learn how DataTrue can protect your data automatically.

Join our Newsletter

Ten Most Common Bad Data Mistakes

Why is Bad Data a Bad Idea?

1. Incorrect Implementation of Tracking Code

Missing Code

Wrong Place

Multiple Sets of Tracking Code

Bad Code

Poor Load Times

2. Not Comparing Like for Like

3. Incorrect Use of Data

4. Lack of Granularity

5. Statistical Significance (Use Maths!)

6.Tracking Discrepancies (Multiple Technologies)

7. Bots & Spam

8. Failing to Spot Trends& External Factors

9.Improper Tests or Poor Hypothesis

10.Poor Visualisation of Data

Related posts:

Search

Latest Posts

Comments

Search

Start a demo and a free 30 day trial. Learn how DataTrue can protect your data automatically.