Data Scientists Sometimes Fool Themselves
The easiest person in the world to fool is yourself. Data scientists sometimes fool themselves - in matters trivial and important. Thus, I suggest we acknowledge real or subconscious biases in ourselves, the data, the analysis and group think. It is prudent for data science teams to have both internal and external checks and balances to expose potential biases and better understand objective reality. Here are a few ways data scientists sometimes fool themselves:
Confirmation bias: tendency to favor data that confirms beliefs or hypotheses.
Naive rationalism bias: thinking that the reasons for things are, by default, accessible to you.
Funding - agency bias: intentional or unconscious skewing of data, assumptions and interpretations to favor the interests of the party that financially supports the data science.
Data selection bias: skewing selection of data sources to most available, convenient and cost-effective, in contrast to being most valid and relevant for specific study. Data scientists have budget, data source and time limits - and thus may introduce unconscious bias in data sets able to select and those excluded.
Cherry picking bias: pointing to individual cases or data that seem to confirm a particular position, while ignoring a significant portion of related cases or data that may contradict that position.
Cognitive bias: skewing decisions based on pre-existing cognitive and heuristic factors (e.g., intuition) rather than on data and evidence. Biases in judgment or decision-making can also result from motivation, such as when beliefs are distorted by wishful thinking. Some biases have a variety of cognitive ("cold") or motivational ("hot") explanations.
Omitted variable bias: appears in estimates of parameters in a regression analysis when the assumed specification is incorrect, in that it omits an independent variable that should be in the model.
Sampling bias: systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample - skewing the sampling of data sets toward subgroups of the population most relevant to the initial scope of data science project, thereby making it unlikely that you will uncover any meaningful correlations that may apply to other segments.
Data dredging bias: using regression techniques that may find correlations in small or some samples - but that may not be statistically significant in the wider population.
Projection bias: tendency to assume that most folks think just like us, though there may be no justification for it. Assume that a consensus exists on matters when there may be none or the exaggerated confidence one has when predicting the winner of an election or sports match.
Modeling bias: skewing models by starting with a biased set of project assumptions that drive selection of the wrong variables, the wrong data, the wrong algorithms and the wrong metrics of fitness - including overfitting of models to past data without regard for predictive lift and failure to score and iterate models in a timely fashion with fresh observational data.
Reporting bias: skewing availability of data, such that observations of a certain kind may be more likely to be reported and consequently used in research.
Data snooping bias: misuse of data mining techniques.
Exclusion bias: systematic exclusion of certain things.
Ingroup bias: tendency to favor one's own group - causes us to overestimate the abilities and values of our immediate group at the expense of others we don't really know.
Observation selection bias: data is filtered not only by study design and measurement, but by the necessary precondition that there has to be someone doing a study. In situations where the existence of the observer or the study is correlated with the data observation selection effects occur, and anthropic reasoning is required.
Agency problem bias: means moral hazard and conflict of interest may arise in any relationship where one party is expected to act in another's best interests. The problem is that the agent who is supposed to make the decisions that would best serve the principal is naturally motivated by self-interest, and the agent's own best interests may differ from the principal's best interests. The two parties have different interests and asymmetric information (the agent having more information), such that the principal cannot directly ensure that the agent is always acting in its (the principal's) best interests, particularly when activities that are useful to the principal are costly to the agent, and where elements of what the agent does are costly for the principal to observe. Agents may hide risks and structure relationships so when he is right, he collects large benefits, when he is wrong, others pay the price.