Advertisment

Want RoI from Big Data? Focus on Data Quality

Poor quality data can reduce if not completely kill the RoI from data harnessing technologies like Big Data.

author-image
Preeti Gaur
New Update
Big Data Quality

The changing business environment and the unquenchable thirst of business users for more information to take appropriate decisions are some of the reasons for the explosive growth of technologies handling data. “Big Data” is the latest and most disruptive in that genre. As Big Data exploration moves from sand boxing and PoCs to main stream implementation, focus will eventually shift to the less talked sibling of this group - Data Quality.

Advertisment

Data Quality in traditional BI applications

Organizations have a perennial challenge with Data Quality. Poor Data Quality reduces if not decimates the RoI of data harnessing technologies. According to some popular studies, cost of poor data quality is a few hundreds of billions of dollars and is also one of the leading causes of failure for many DW/BI projects.

It is our experience that heavily regulated industries such as healthcare and financial industries have spent more time and efforts to prevent penalties and lawsuits for incorrect data and its associated reporting. If incorrect regulatory reporting has such an impact, one can imagine the decisions based on poor-quality data.

Advertisment

Big Data-Quality problem

Big-Data adds a new dimension (the 3 V’s and many more as the definition gets more crystallized) to the data-quality problems which the organizations are already grappled with.

With the increase in Volume and the rush to store data, the same set of quality policies may also be applied. The variety of sources from where the data comes poses a unique challenge as the source systems may not have the enforcement of similar level of data quality rules.

Advertisment

Variety of data - Data can vary anywhere from unstructured to structured or semi-structured data coming from a variety of sources such as social media feeds or machine generated data. This raises a unique issue of identification and application of Data-Quality rules.

Velocity - Speed at which the data is generated and stored increases the difficulty of identification and implementation of data-quality principles

C-level associates and businesses recognize the issue of data quality, but develop cold feet to invest in data-quality initiatives owing to the high investments and lack of perception of the in-tangible RiI that these initiatives generate. However, more than 60 percent of IT leaders agree to their organizations’ poor quality of data.

Advertisment

Many Big Data vendors that provide cloud-based services have prominently identified the quality issue on the data that they have received to generate analytics. Even our own experience is not so different.

Considerations to Address the Data Quality Problem

Data Quality can be addressed either by a pro-active or reactive means.

Advertisment

Reactive Data Quality Management means addressing the data quality issue once it is identified i.e. poor quality of (incorrect or incomplete or junk) data identified either through the ETL process or during the analytics and reporting phase. The data will then be analyzed by the business or data quality team and should apply corrective measures on the entire set of data. This can be implemented if the volume of data is low, in general over some portions of traditional data warehouse. This approach is not suitable and practically impossible to apply over the humungous data in case of “Big-Data” with it’s all other Vs. V(Velocity and Variety)’s.

Proactive measures

Constitute a data governance council with clear roles and responsibilities identified. This is critical as implementing data quality rules after initial load is herculean task and identifying bad data from the data mixed from different sources may prove quite expensive if not impossible.

Advertisment

The council should have a fair representation from business (who are the real owners of the data) and IT and should be a more involved process from the business.

Data Governance Council should develop and automate a process for data feedback to sources

Establish data quality benchmarks

Advertisment

Device measurable data quality metrics

Establish continuous compliance monitoring framework

Establish a sound meta data framework

Data quality rules should ensure the variety of data that comes in is profiled and rules implementation should be automated. Manual intervention should be avoided to the maximum possible extent for Big Data. Manual intervention may work well in case of traditional warehouse where the sample size is comparable.

Data governance should ensure the data scraped from social media or any other sources doesn’t infringe the privacy and confidential requirements lest this attracts penalty and huge investment is needed to mask the data once data is loaded.

Automating data cleaning and augmentation.

Conclusion

Data Quality is a process, attitude and is a paradigm shift in focus rather than a tool.

By following the Data Quality postulates along with key-considerations stated above, Big Data solutions can maximize the ROI on a Big Data implementation. It is right time to think or en-grain Data Quality as a part of Big Data implementation as many organizations are starting or planning to implement a Big Data solution. Neglecting Data Quality may reduce the RoI if not completely decimate the advantages of a Big Data solution.

Advertisment