Hypothesis Formation

Somewhat of a sequel to my earlier post on causality, where do hypotheses come from?

The ideal hypothesis:

  • Has basis in a reasonable engineering, physical, or economic, etc. model.
  • Is as simple as can be in terms of number of variables. I.e. Occam's Razor has been applied.
  • Either has been vetted against a number of other hypotheses and selected as the most reasonable, or will be tested along with other reasonable hypotheses.
  • Will be tested in the gold standard, the randomized controlled experiment.
  • Is actionable.

Real life is not ideal, so below I discuss compromises and trade-offs involved in hypothesis formation.

Basis in a Model

As discussed in my causality blog entry, the only way to assign causality is to develop a rational model about how things really work, not just from the output of some multivariate correlation done in R. The best hypotheses are rooted in causation, though it is of course possible to hypothesize anything conjecture at all, including from statistical correlations discovered during data exploration. Discovery from data as a source of hypothesis is better than pulling from thin air, but hypothesizing from a model is best of all. Hypothesizing from data is called induction and hypothesizing from a model is called deduction.


The fewer the variables, the stronger the hypothesis and the more robust it is, by which it is meant the more likely it will hold up to a variety of conditions. E.g., suppose we induce a hypothesis from data exploration that teenage girls that use Twitter like Justin Beiber. A stronger hypothesis (if it turns out to be true) would be to get rid of the Twitter condition, not only because it broadens the potential market for Justin Beiber products, but also because it is more resilient in varied circumstances, such as perhaps a time when (assuming some sort of unlikely calamity befalls Twitter) Twitter is no longer popular and something takes its place.

Vetted Against Competing Hypotheses

When forming a hypothesis, it is important to brainstorm as many different plausable hypotheses as possible, from a variety of sources:

  • As with conventional brainstorming, ask fellow team members and associates for their creative hypotheses.
  • Formulate as complete a model as possible, and from that model identify explanations. E.g., when modeling a consumer:
    • What is the consumer's budget?
    • What is the pay schedule of the consumer?
    • Are there upcoming holidays that would either enhance purchases (in anticipation) or hinder them (due to store or bank closures)?
    • What products complement the products the consumer already owns?
    • What products would enhance the social standing of the consumer?
    • Does the consumer carry credit cards that are accepted?
    • Is the consumer a student?
    The model doesn't have to be complete and fully accurate -- just enough to spark brainstorming. I.e., it's not necessary to create a Bayesian Belief Network or Root-Cause Analysis Fishbone just to hypothesize.
  • Identify leading hypotheses and test them. This is easier said than done. "Identifying" is a nice way of saying "hunch," because the alternative, "test," is very expensive if done by the gold standard, the controlled randomized experiment.

And by so "identifying a leading hypothesis," one becomes subject to the cherry-picking I discussed in the Panel on "Resolved: Traditional Statistics is Dead". It's nice to pick the best hypothesis from a bunch, but to ensure you don't stumble into a spurious correlation, it's ideally necessary to test all similar hypotheses. In the example proctored in the forum, there turned out to be a correlation between Superbowls and presidential elections. Aside from the obvious modeling deficiencies, my response was whether correlations between MLB penants or NHL cups and presidential elections had also been tested. However, the alternative to picking good hypotheses is to leave it to chance, which is not productive. So pick good hypotheses, but beware of spurious correlations, especially if your hypothesis came from induction from the data rather than deduction from a model.

Controlled Randomized Experiment

Controlled randomized experiments are the gold standard, but they are expensive and time consuming. It is much more convenient and quicker to find and test correlations in existing data sets, but such correlations are fraught with problems: population not random throughout independent variables of the new hypothesis, limited data for train vs. test that effectively lead to test data becoming training data, experimental conditions being different, etc.

But from a practical standpoint, "quasi-experiments" (experiments from an existing data set) are the general rule encountered in practice and "experiments" are, realistically, the exception. Compensating for the shortcomings of quasi-experiments will be the subject of a future article.


You can have the most interesting, perhaps even insightful, hypothesis, but if there is no reasonable course of action to take once it is proven, it's a waste of time to prove it.


Good hypothesis formation:

  • Avoids wasting time testing bad hypotheses
  • Saves time that can be redirected toward testing the best hypotheses, including testing hypothesis adjacent to the leading hypotheses to avoid spurious correlations
  • Results in more resilient, more actionable insights.