What are biases in Image Datasets in Computer Vision?

This article covers a myriad of topics ranging from theories of and about biases that occur in data collection and then proceeds to explain how these aforementioned instances affect the interpretation of that dataset, which is subdivided, annotated, and analyzed. It states the dangers of AI that lead to unjust outcomes.

Introduction

Let’s utilize a picture of lions as an example of what is called Prototype theory, wherein researchers and analysts employ the concept of categorization to differentiate among certain stimuli to condense them cognitively and behaviourally, which the ultimate goal of diminishing them into smaller, more comprehensible proportions. We see a picture of lions, and we start listing their characteristics one by one based on their appearance and their background. The theory recognizes that some prototypical notions pertaining to an object arise from pre-existing stored norms about those properties in regard to that object and the category it falls into. For example, when we say lions are brown, we are generalizing brown as being prototypical for the animal, when indeed white lions exist too, but our brain seems to resort to a prototypical outcome when someone mentions a lion.

The first example we employ in order to introduce the concept of biases is that of doctors and surgeons not operating on their own families, probably because their personal bias for them clouds their professional medical judgments and rather puts their loved ones’ lives at undue risk. The study by Wapman and Belle of Boston University indicates that there is also a subject bias at play here, wherein upon reading this aforementioned example, most test subjects simply assume the doctor in question is male, even women and self-identified feminists themselves.

Human Reporting Bias

This bias refers to the concept that the prevalence and recurrence of words that depict properties, actions, outcomes, or properties is not an appropriate rumination of their personal traits or group them into a certain classification of individuals and in no way depicts their frequency in the real world correctly.

To figure out the human reporting biases that exist in data, the premier step demonstrates the process of the collection and annotation of training data and is linked to the plethora of biases and the errors that occur thereafter upon doing so. The listed biases in the data include:

Reporting Bias, which was talked about in the previous paragraph that states that it is not the facts but how the facts are presented which introduces a bias. If we have a human annotator and an image captioning problem is supposed to be solved, the same image containing a man and a straw hat can be captioned as “a man is wearing a straw hat,” “a man on a beach on a sunny day,” “hats provide shade to men” and so on.
Selection Bias, which states that a random sample cannot be justly represented by selection.
Out-group homogeneity bias, wherein individuals are more likely to find similarities in out-group members than in-group ones when comparing traits and characteristics. For example, a black-furred dog relates more to a black-furred cat than other dogs.

Biases in Interpretation

Let us now look at another category of bias that occurs while working with image datasets which are the biases that get included due to the way images are interpreted by humans.

Confirmation Bias is the propensity of an individual to seek or recollect information that affirms one’s pre-existing beliefs.
Overgeneralization, that is, reaching an outcome simply on the basis of information that is too broad and not as specific as it should be.
Correlation fallacy is a mistaken juxtaposition wherein correlation and causation are interchanged. Something that is just a correlation can be confused for being a cause or reason.
Automation Bias is the idea that humans are more inclined to believe the suggestions of an automated decision-making system rather than those that do not come from automation.

Biased Data Representation

Biased data representation states that even in the event where one has a myriad of data for every group possible, a portion of those groups may be depicted less positively in comparison to the others. Biased Labels are annotations in the dataset that align with the personal views of the author or annotator of that data.

After we are done with figuring out the annotation and data collection step, the second step is training the model, the third is media filtration, ranking, aggregation, and generation and the fourth and final step is the output that people see. There is a feedback loop between the first and last bubble of Human Bias, from training data to the outcome.

Bias Laundering

Bias Network Effect or Bias Laundering is the act wherein ML learns from human data, which already has biases and ends up perpetuating them, which creates this bias network effect. Biases are not always bad and can also be either good or neutral. In statistics and ML, there can be the bias of an estimator, and wherein there is a difference between the predicted and the correct values of a dataset.

Some More Biases

Cognitive biases include confirmation, recency, and optimism Bias. Algorithmic Bias is when the AI or algorithm system injects biases that lead to unjust and unfair treatment of individuals on the basis of race, sexual orientation, income, gender, or a plethora of other things that should not be taken into consideration and may lead to discrimination against such groups.

Example: Criminal Activity Prediction

Policing algorithms have a tendency to identify probable crime hot spots on the basis of where it was reported rather than where it originally occurred. They make predictions of the future based on the past. So how do biases play a role here? Sentencing subjects rated as high or low risk on the basis of their skin color as opposed to the crimes is a clear depiction of automation bias.

Conclusion

In Conclusion, these instances depict that AI can unintentionally lead to Unjust Outcomes due to a myriad of reasons, such as lack of insight into biased sources of data, feedback loops, disaggregated evaluation, and perpetuating human biases during the interpretation of results.

Read more about learning paths at codedamn here.

Happy Learning!