The Deadly Data Science Sin of Confirmation Bias

Confirmation bias occurs when people actively search for and favor information or evidence that confirms their preconceptions or hypotheses while ignoring or slighting adverse or mitigating evidence. It is a type of cognitive bias (pattern of deviation in judgment that occurs in particular situations -  leading to perceptual distortion, inaccurate judgment, or illogical interpretation) and represents an error of inductive inference toward confirmation of the hypothesis under study.

Data scientists exhibit confirmation bias when they actively seek out and assign more weight to evidence that confirms their hypothesis, and ignore or underweigh evidence that could disconfirm their hypothesis. This is a type of selection bias in collecting evidence.

Note that confirmation biases are not limited to the collection of evidence: even if two (2) data scientists have the same evidence, their respective interpretations may bebiased. In my experience, many data scientists exhibit a hidden yet deadly form ofconfirmation bias when they interpret ambiguous evidence as supporting their existing position. This is difficult and sometimes impossible to detect yet occurs frequently.

A confirmation bias study (see: "Neural Bases of Motivated Reasoning - An fMRI Study of Emotional Constraints on Partisan Political Judgment in the 2004 U.S. Presidential Election" in the MIT Journal of Cognitive Neuroscience) involved participants in the 2004 US presidential election who had strong prior feelings about the candidates. The study concluded that prior beliefs about the candidates created strong bias - leading to inaccurate judgment and illogical interpretation when interpreting contradictory statements by the respective candidates. This suggestsconfirmation bias is inherent in human reasoning.

 
Dr. John Ioannidis's 2005 paper "Why Most Published Research Findings Are False" provides strong evidence of confirmation bias among professional scientists. Ioannidis analyzed 49 of the most highly regarded research findings in medicine over the previous 13 years. He compared the "45 studies that claimed to have uncovered effective interventions with data from subsequent studies with larger sample sizes: 7 (16%) of the studies were contradicted, 7 (16%) the effects were smaller than in the initial study and 31 (68%) of the studies remained either unchallenged or the findings could not be replicated."

The evidence suggests confirmation bias is rampant and out of control in both the hard and soft sciences. Many academic or research scientists run thousands of computer simulations where all fail to confirm or verify the hypothesis. Then they tweak the data, assumptions or models until confirmatory evidence appears toconfirm the hypothesis. They proceed to publish the one successful result without mentioning the failures! This is unethical, may be fraudulent and certainly produces flawed science where a significant majority of results can not be replicated. This has created a loss or confidence and credibility for science by the public and policy makers that has serious consequences for our future.
 
The danger for professional data science practitioners is providing clients and employers with flawed data science results leading to bad business and policy decisions. We must learn from the academic and research scientists and proactively avoid confirmation bias or data science risks loss of credibility.

It is critical for data scientists to create check and balance processes to guard against confirmation bias. This means actively seeking both confirmatory and contradictory evidence and using scientific methods to weigh the evidence fairly. It also means full disclosure of all data, evidence, methods and failures.

The Data Science Code of Professional Conduct of the Data Science Associationprovides ethical guidelines to help the data science practitioner. Rule 8 - Data Science Evidence, Quality of Data and Quality of Evidence - states in relevant part:

(a) A data scientist shall inform the client of all data science results and material facts known to the data scientist that will enable the client to make informed decisions, whether or not the data science evidence are adverse.

(b) A data scientist shall rate the quality of data and disclose such rating to client to enable client to make informed decisions. The data scientist understands that bad or uncertain data quality may compromise data science professional practice and may communicate a false reality or promote an illusion of understanding. The data scientist shall take reasonable measures to protect the client from relying and making decisions based on bad or uncertain data quality.

(c) A data scientist shall rate the quality of evidence and disclose such rating to client to enable client to make informed decisions. The data scientist understands that evidence may be weak or strong or uncertain and shall take reasonable measures to protect the client from relying and making decisions based on weak or uncertain evidence.

(f) A data scientist shall not knowingly:

(1) fail to use scientific methods in performing data science;

(2) fail to rank the quality of evidence in a reasonable and understandable manner for the client;

(3) claim weak or uncertain evidence is strong evidence;

(4) misuse weak or uncertain evidence to communicate a false reality or promote an illusion of understanding;

(5) fail to rank the quality of data in a reasonable and understandable manner for the client;

(6) claim bad or uncertain data quality is good data quality;

(7) misuse bad or uncertain data quality to communicate a false reality or promote an illusion of understanding;

(8) fail to disclose any and all data science results or engage in cherry-picking;

(9) fail to attempt to replicate data science results;

(10) fail to disclose that data science results could not be replicated;

(11) misuse data science results to communicate a false reality or promote an illusion of understanding;

(12) fail to disclose failed experiments or disconfirming evidence known to the data scientist to be directly adverse to the position of the client;

(13) offer evidence that the data scientist knows to be false. If a data scientist questions the quality of data or evidence the data scientist must disclose this to the client. If a data scientist has offered material evidence and the data scientist comes to know of its falsity, the data scientist shall take reasonable remedial measures, including disclosure to the client. A data scientist may disclose and label evidence the data scientist reasonably believes is false;

(14) cherry-pick data and data science evidence.

Wise counsel for the dedicated professional data scientist and absolutely necessary to maintain credibility and confidence for both clients and the public.