Here we will be talking about:

In this post, I will talk about some of the statistical biases that I learned the hard way.

Survivorship Bias

The phenomenon where only those that ‘survived’ a long process are included or excluded in an analysis, thus creating a biassed sample

While collecting the data, we did not take into account the factor of survivorship bias, which led to huge problems. This bias existed at the data source level, but looking at the data, it is tough to find out. When the model accuracy was not optimal, we were trying to find out why it wasn’t that good and finally found that it was due to different data distribution at the training level vs. the production level, which led to this problem. A lesson to learn about the data and how it is sampled This lesson holds true for most of the data bias problems.

Selection Bias

This bias happens when the selection is not random enough, ultimately resulting in a sample that is not representative of the population.

This is the mistake that I personally made during a POC. I was not sure what the impact of the bias would be, but after going ahead with some iterations, I was told to check the data distribution, which would have been validated by the business, but they found something was not right with the distribution, so we looked into the root cause of the issue and finally found that the way I was filtering the data was creating the issue, and it was conceptually entirely wrong and led to bias in the data analysis and modelling. So, now I know the hard way that selection bias plays a huge role in these kinds of models.

Different types of Selection Bias

Sampling bias

This happens when non-random sampling takes place from the given data.

Confirmation bias

A practise to choose favourable data points that confirm one’s belief.

Time interval bias

Intentionally selecting an interval from which we can support the desired conclusion.

Omitted Variable bias

This happens due to the absence of the relevant variable for the predictor. Removing relevant variables or having too many variables does affect the fit of the model.

This is the most common problem I face. I worked in multiple domains, and I am not an expert in any of them, so it is very tough to know the business logic of each and every feature and how it will have a larger effect on the project. This required huge domain knowledge to know how some decisions are intertwined with the final feature variable. While performing feature selection, it is common to have different sets of important features as per statistics compared to business understanding. This leads to omitted variable bias, and the best way to avoid this is to have a session with the business end to avoid removing important business-centric features.

Recall bias

This bias happens when some forget or partially remember a past incident, and this is very common in different types of surveys.

I personally never encountered this, but surveys should be taken with a grain of salt, and it is tough to get correct statistics from a survey until and unless it is well-designed. I worked with surveys only once, and I used that survey to get the sentiment of the sample group that completed the survey. It is tough to get an unbiased conclusion from a survey.

Reference

  • https://datatron.com/what-is-statistical-bias-and-why-is-it-so-important-in-data-science/
  • https://www.forbes.com/sites/forbestechcouncil/2022/02/23/look-for-whats-missing-how-to-avoid-survivorship-bias-in-data-science/?sh=246e586f799e