The Deadly Data Science Sin of Confirmation Bias
Confirmation bias occurs when people actively search for and favor information or evidence that confirms their preconceptions or hypotheses while ignoring or slighting adverse or mitigating evidence. It is a type of cognitive bias (pattern of deviation in judgment that occurs in particular situations - leading to perceptual distortion, inaccurate judgment, or illogical interpretation) and represents an error of inductive inference toward confirmation of the hypothesis under study.
Data scientists exhibit confirmation bias when they actively seek out and assign more weight to evidence that confirms their hypothesis, and ignore or underweigh evidence that could disconfirm their hypothesis. This is a type of selection bias in collecting evidence.
Note that confirmation biases are not limited to the collection of evidence: even if two (2) data scientists have the same evidence, their respective interpretations may bebiased. In my experience, many data scientists exhibit a hidden yet deadly form ofconfirmation bias when they interpret ambiguous evidence as supporting their existing position. This is difficult and sometimes impossible to detect yet occurs frequently.
A confirmation bias study (see: "Neural Bases of Motivated Reasoning - An fMRI Study of Emotional Constraints on Partisan Political Judgment in the 2004 U.S. Presidential Election" in the MIT Journal of Cognitive Neuroscience) involved participants in the 2004 US presidential election who had strong prior feelings about the candidates. The study concluded that prior beliefs about the candidates created strong bias - leading to inaccurate judgment and illogical interpretation when interpreting contradictory statements by the respective candidates. This suggestsconfirmation bias is inherent in human reasoning.
The evidence suggests confirmation bias is rampant and out of control in both the hard and soft sciences. Many academic or research scientists run thousands of computer simulations where all fail to confirm or verify the hypothesis. Then they tweak the data, assumptions or models until confirmatory evidence appears toconfirm the hypothesis. They proceed to publish the one successful result without mentioning the failures! This is unethical, may be fraudulent and certainly produces flawed science where a significant majority of results can not be replicated. This has created a loss or confidence and credibility for science by the public and policy makers that has serious consequences for our future.
It is critical for data scientists to create check and balance processes to guard against confirmation bias. This means actively seeking both confirmatory and contradictory evidence and using scientific methods to weigh the evidence fairly. It also means full disclosure of all data, evidence, methods and failures.
The Data Science Code of Professional Conduct of the Data Science Associationprovides ethical guidelines to help the data science practitioner. Rule 8 - Data Science Evidence, Quality of Data and Quality of Evidence - states in relevant part:
(a) A data scientist shall inform the client of all data science results and material facts known to the data scientist that will enable the client to make informed decisions, whether or not the data science evidence are adverse.
(b) A data scientist shall rate the quality of data and disclose such rating to client to enable client to make informed decisions. The data scientist understands that bad or uncertain data quality may compromise data science professional practice and may communicate a false reality or promote an illusion of understanding. The data scientist shall take reasonable measures to protect the client from relying and making decisions based on bad or uncertain data quality.
(c) A data scientist shall rate the quality of evidence and disclose such rating to client to enable client to make informed decisions. The data scientist understands that evidence may be weak or strong or uncertain and shall take reasonable measures to protect the client from relying and making decisions based on weak or uncertain evidence.
(f) A data scientist shall not knowingly:
(1) fail to use scientific methods in performing data science;
(2) fail to rank the quality of evidence in a reasonable and understandable manner for the client;
(3) claim weak or uncertain evidence is strong evidence;
(4) misuse weak or uncertain evidence to communicate a false reality or promote an illusion of understanding;
(5) fail to rank the quality of data in a reasonable and understandable manner for the client;
(6) claim bad or uncertain data quality is good data quality;
(7) misuse bad or uncertain data quality to communicate a false reality or promote an illusion of understanding;
(8) fail to disclose any and all data science results or engage in cherry-picking;
(9) fail to attempt to replicate data science results;
(10) fail to disclose that data science results could not be replicated;
(11) misuse data science results to communicate a false reality or promote an illusion of understanding;
(12) fail to disclose failed experiments or disconfirming evidence known to the data scientist to be directly adverse to the position of the client;
(13) offer evidence that the data scientist knows to be false. If a data scientist questions the quality of data or evidence the data scientist must disclose this to the client. If a data scientist has offered material evidence and the data scientist comes to know of its falsity, the data scientist shall take reasonable remedial measures, including disclosure to the client. A data scientist may disclose and label evidence the data scientist reasonably believes is false;
(14) cherry-pick data and data science evidence.
Wise counsel for the dedicated professional data scientist and absolutely necessary to maintain credibility and confidence for both clients and the public.