Data Veracity vs. Data Quality

There is often confusion between the definitions of "data veracity" and "data quality".

Data veracity is sometimes thought as uncertain or imprecise data, yet may be more precisely defined as false or inaccurate data. The data may be intentionally, negligently or mistakenly falsified. Data veracity may be distinguished from data quality, usually defined as reliability and application efficiency of data, and sometimes used to describe incomplete, uncertain or imprecise data.

The unfortunate reality is that for most data analytic projects about one half or more of time is spent on "data preparation" processes (e.g., removing duplicates, fixing partial entries, eliminating null/blank entries, concatenating data, collapsing columns or splitting columns, aggregating results into buckets...etc.). I suggest this is a "data quality" issue in contrast to false or inaccurate data that is a "data veracity" issue.

Data veracity is a serious issue that supersedes data quality issues: if the data is objectively false then any analytical results are meaningless and unreliable regardless of any data quality issues. Moreover, data falsity creates an illusion of reality that may cause bad decisions and fraud - sometimes with civil liability or even criminal consequences.