Introduction to Data Lineage

Sophisticated modern businesses like banks and insurers are data rich. Data is fundamental to their business effectiveness and efficiency. 

However, data is not just relevant to the business processes that create it. Many classes of data are essential outside of their main business purpose. This may be for internal reporting and analysis, for use by other applications or for exchange with third parties. Examples are to produce consolidated reporting from distributed sales applications, to feed into a general ledger and to produce regulatory reports.

Data is copied from application and data siloes into reporting and data integration solutions like data warehouses and data marts. Increasingly external data is integrated with internal data. In financial services instrument data is purchased and integrated before onward distributions to internal systems for trading and analysis. In retail, credit risk data is consumed and used for customer sales and profiling.

All this data movement requires convoluted networks of data extraction, transformation and loading to achieve the desired business outcomes.  Many millions of individual data items will be processed and moved every day. There are often huge legacy IT estates that support numerous business requirements in what is sometimes referred to as the 'integration hairball'. The processes and IT systems that join together siloes of disparate data are often incompatible and poorly documented.  All these factors mean that some data will end up being inaccurate or misleading to the business and its processes and decisions will lose effectiveness.

Data lineage is the process of understanding, documenting and visualising this data as it goes from origination to consumption. It is the process of tracking data upstream from its end point to ensure the data is accurate and consistent. It covers looking at the origin to destination path both forward and backwards and at any point along the path.

Data Lineage is used to help govern and control that data comes from a reliable source, is transformed appropriately and loaded correctly to its designated location. Data lineage has great importance in a business environment where key decisions rely on accurate information.  Without appropriate technology and processes in place tracking data can be virtually impossible or at the very least a costly and time consuming endeavour.

The main use cases where data lineage is an essential tool are for analysing data errors, for analysing the impact to downstream consumers of changes data structures or systems and for the reporting of data provenance to regulators. These use case will help to explain:-

Error resolution – a business analyst trying to figure out an unknown metric in a generated BI report. The analyst would report the problem to IT support or help desk and an IT resource would look over the source code or specifications to try to figure out where the information came from and what transformations it had gone through. It can take days solve this problem, time that could have been spent more efficiently with appropriate tooling.

Impact analysis – business data requirements are frequently changing and the IT systems that deliver the data will be in a constant cycle of development, testing and release. Having a capability to analyse and visualise data lineage permits greater control and governance of the change cycle.

Regulatory reporting - the financial crisis brought in a wide range of new regulations with the purpose of identifying trouble early and helping financial institutions become better at managing risk. Regulators started highlighting the importance of financial institutions being able to validate the accuracy of compliance reports. This has heightened the importance of data lineage and regulators are demanding transparency and mandating that data lineage is documented and reported. The enforcement of data lineage is an important milestone in this industry as historically it was more important to produce reports on time rather than to demonstrate if the data used for said reports is accurate and consistent. Modern data tools can be applied in this industry greatly automating the workload that would inherently improve the data lifecycle, decrease human errors and save funds put aside for compliance breaches that could be invested in more lucrative ventures.