Organizations of all sizes across industries are deluged with data spread
across various legacy systems, local databases, spreadsheets and other data
sources. They are constantly seeking ways to bring in an enterprise wide
datawarehouse that delivers actionable, timely and reliable business
information.
Most techniques used by organizations to build a comprehensive Information
management infrastru cture employ either a top-down or bottom-up development
approach. In the top-down approach, an enterprise datawarehouse (EDW) is built
in an iterative manner, business area by business area, and the underlying
dependent data marts are created as required from the EDW contents. In the
bottom-up approach, independent data marts are created with the view to
integrate them into an enterprise data warehouse at some time in the future.
Either way, a major blind spot in the datawarehousing project delivery
quality is testing. Unlike custom software applications or product
implementations, there tends to be minimal user involvement in testing of the
datawarehouse and its downstream data marts, given that testing processes are
technical and follow the same iterative cycle as the data warehouse development.
Datawarehouse testing
The importance of specialized data warehouse testing cannot be undermined.
The key success factor is user confidence and consistent data quality, which
leads to a high degree of user adoption. The ROI of information assets can be
determined only when the data is put to usage to get business insight.
Various data transformation layers in a typical DWH implementation and the testing process flow and coverage. |
Testing of datawarehouse & business intelligence applications involves
significant programming work as there are limited front end screens but mostly
back end processes that work on data sets. Further unlike complicated features
and functions of application software, DWH projects can be considered a simple
sequence of data transformation, changes, and aggregation through a chain of
processes. But this simple sequence of data movement leads to complication in
testing. For every transformation of a dataset, testing must ensure that
transformation is right by including the transformation logic into test scripts.
With no front end screens, most test scripts have to be created as backend
scripts (say SQL queries) for testing. Thus, DWH testing is more intensive and
more programmatic than regular application testing and requires extensive domain
knowledge and DWH concepts to create test scripts. There is no readily available
user interface to visually inspect and validate.
Testing process and methodology
Datawarehouse testing starts with identifying all the data sources that go into
building the data warehouse, downstream applications like subject specific data
marts and reports and other dependencies. Each data movement layer is determined
in terms of source data set, target data set and data transformation and
business logic. Testing is focused on verification and validation for the
correct and complete data movement from one data component or layer to the
other, and successful generation of any downstream requirements like reports.
Typically testing may be differentiated into three distinct components: data
movement from the files/ other source systems to an operational data store or a
staging area, further data transformation and creation of business logic in the
data warehouse and downstream data movement. Downstream data dependencies could
be subject specific data mart where further aggregation and integration of
business logic takes place, and reports. The key scenarios that should be tested
are:
Data completeness: Data completeness check involves some of the below
mentioned sub scenarios.
* Comparing record counts between source data, data loaded to the warehouse
and rejected records.
* Comparing unique values of key fields between source data and data loaded
to the warehouse. Typically can be done for entities like Customer, Account,
Transactions, Product etc.
* Value Range Testing: Test for the range and value distributions of fields
in a data set. This can be used during testing and in production to compare
source and target data sets and point out any data anomalies from source systems
that may be missed even when the data movement is correct.
Data correctness: Data correctness checks for inaccurate data coming from
disparate source systems, different definitions of operational data elements and
loss of data during the ETL (Extraction, Transformation and Loading) process.
Data transformation: Testing involves checks to ensure that the data has been
correctly transformed as per business rules and that the parent child
relationships are correctly defined.
Data quality: Data quality checks focus on "how the ETL system handles data
rejection, substitution, correction and notification without modifying data”.
Business logic checks: Testing will validate if all the business logic as per
the defined business specifications built in the ETL and the reports is
correct. At the report level, this involves testing for all the measures with
adequate data that mirrors production data to the extent possible so that all
the business conditions are met. Testing will also cover report level
aggregations like averages, percentages and sub totals and also graphical
representations.
Downstream data reconciliation: This involves reconciliation of data from the
source systems to the final downstream, across the various intermediate
transformation layers.
Regression testing: Given the iterative methodology of building a data
warehouse, regression testing is of utmost importance. The focus is on ensuring
that every incremental release does not break the existing system in production
and that the new functionality behaves as per specifications.
Test scripts are executed as a batch file and the results will be generated
as an output file or report that provides details on the Source —Target counts
and other test scenarios. The batch file also accounts for the execution of the
dependent scripts. Datawarehouse testing needs specialized skill sets which is a
combination of domain knowledge, testing experience and technical knowledge for
understanding the data warehouse design and overall IM infrastructure. Testing
strategy must start with the design stage itself and key challenges in terms of
data and infrastructure availability, and access to the test environment must be
determined at the initial design stages.
NK Subramaniyam - Executive Director & J Srivastava - Head Products &
Alliances, Saksoft