A complete guide to Bias & Ethics in Data Collection

3 min readNov 12, 2021

Bias at the collection stage, in statistical terms, means that the data you have gathered is not representative of the group or activity about which you want to make a claim. Shortcuts and various types of errors are part of what makes us human. According to the author and psychologist Daniel Levitin (2016):

Remember, people, gather statistics. People choose what to count, and how to go about counting. There are a host of errors and biases that can enter into the collection process and these can lead millions of people to draw the wrong conclusions.

White building with data has a better idea sign board — Photo by Franki Chamaki on Unsplash

Before we jump in, let us quickly understand what Bias and ethics are all about.

What is Ethics?

Ethics seeks to answer the following

What is right or wrong?
What is good or bad?
What is justice?
What is well-being or equality?

What is Bias?

A bias is a prejudice in favor of or against one thing, person, group compared with another.

Data Collection is the most critical phase and foundation for data-driven technology. Due to lack of time and people bias occurs during the collection process. The following are some of the biases created during the data collection process and which potentially impact/ harm the ethics

Selection Bias

A kind of systematic error occurs when the data collector decides who and what is going to be studied/researched. In this approach, the selection of participants is not randomized. Let us say we want to assess the program for improving the health of working from home employees. However, those how have signed up may be different from those who don’t signup. Maybe people who have signed up are more health-conscious and hence they signed up.

If this was the case, it would not be fair to conclude that the program was effective. Also, only this self-selection would have impacted the health of the study participants more than the program

How to manage these ethical harms in Selection Bias?

The selection bias can be minimized using the following ways

Include as many samples as possible in the study
Conducting an experimental study
Draw from the sample that is not self-selecting

Sampling Bias

Sampling bias is a result of failure to ensure the proper randomization of the population sample and many times it happens unintentionally. For example, imagine there are 30 people in a classroom and you ask if they prefer Maths or Physics. If you only surveyed the boys and concluded that the majority of students like Maths, you would have demonstrated sampling bias.

There are many types of Sampling Bias like the following

Under coverage: This type happens when some of the variables are not represented / poorly represented and it is a common type of Sampling bias
Non-Response: Also referred to as participation bias, the inability of the participant to take part in the survey
Pre-Screening: This happens when the selection process deployed in a study results in a sample that is a poor representation of the population

Using the following measures of sampling bias can be managed to avoid ethical harms

Avoid Convenience Sampling
Follow up on non-responder
Clearly define the Target audience
Oversampling
Make the Survey accessible and simple

Misclassification Bias

Bias occurs due to data points assigned to incorrect categories. For example, categorizing a smoker as a non-smoker.

There are two types of misclassification bias:

Non-differential: It occurs when the degree of misclassification of exposure status is equal across all groups

Differential: It occurs when the degree of misclassification of exposure status among those with and those without are different.

People who are involved in data collection should have a proper mitigation plan to handle bias and also make it part of the process so that it will reduce the ethical harms if any. If you would like to know more about how sampling bias might affect your business, visit Payoda.

Authored by: Suri Parathasarathy