In the first part of this series, we talked about the benefits, market
proposition, some technologies involved and the tools that are available in the
market for business intelligence.
This time, we will be taking you into the world of data warehousing and its
structure definitions, various terms and entities associated with it. We will
talk about the data warehouse components and characteristics. We will list all
the challenges that are common while defining data warehouses. We will give you
a hands-on experience of our warehouse implementation.
Why do you need a data warehouse? This is the most common query you face
while convincing others to adopt the technology and implement structured data
system in their organization. The volume of data that one has to deal with in an
organization comprises thousands of accounts, product lists, transaction
details, and inventory and customer requests present in fragmented format (as
databases, files and spreadsheets). Each of these databases, files and sheets
has to be updated on a daily basis in a heterogenous environment of data
sourcing and collection and uploaded to stay active, which is not easy. For
those who implement data warehouses, this data would run into terabytes and
across different locations. So how do you extract any specific bit of data that
you want and within minutes? For this, specific architectures have to be laid
down to ensure data flow of your choice in a warehouse.
A properly configured warehouse has different as well as unique databases
integrated together. So if you want to build a data warehouse, you need a team
of dedicated IT professionals with good understanding of your business so that
they can correlate different data and implement data structures in the most
suitable manner for their organization. This involves categorizing databases
that house data from each department spread across locations and link them
accurately across structure in defined business process. And it's not the end
of exercise once the warehouse is implemented and has gone online. Since a
warehouse is not a static application which once installed or configured will
work painlessly, the team has to modify its properties quite often with every
change in the market conditions and company's management decisions.
Warehouse architectures
In many cases, a data warehouse contains the living history of the organization
while they usually contain historical data, often collected from different
sources such as OLTP (On-Line Transaction Processing) systems, legacy systems,
spreadsheets or even text files. A data warehouse combines this data, cleanses
it for accuracy and consistency, and organizes it for ease and efficiency of
querying. Some definitions of a data
warehouse include several elements such as a data preparation area, the
cleansing process, the database that holds the data warehouse data, and the
tools that organize and present the data to client applications. Other
definitions restrict the data warehouse to the database that contains the data
warehouse data. In large data warehousing applications, data is often segmented
into specialized components, called data marts that address individual
components of the organization. Some definitions consider data marts to be part
of the data warehouse; others consider them to be separate entities. The
intended meaning of the term data warehouse is usually clear from the context in
which it is used.
Critical solutions such as warehouses require a careful analysis of what is
needed and how a subsequent warehousing solution has to be implemented. While
going for a warehouse solution the choice of architecture matters a lot, as it
is to be based on factors such as organization's
infrastructure, business requirements and desired system structure. What
architecture one goes for is also affected by the location of the data marts and
the location of the central control. For example, an organization might go for
an architecture that allows data to be present, distributed and
controlled centrally or even independently. You can have architectures where the
data marts are independent or inter-connected or go for a warehouse that spans
through your organization. The scope here refers to the usage and not physical
location.
Architecture components
The architecture of the warehouse is made up of a set of components with clearly
defined operational and structural boundaries. At the base is what is termed as
Operational Data Sources or Data Sources which communicate with the layer above
them, namely Data Access Layer. This layer serves as an interface between the
data sources and the Data Staging Area. It is responsible for 'cleansing'
the data, ie bringing all the data from different sources to a universal format
before it is passed on to further layers in the warehouse. After this layer
comes the Data Warehouse Layer which is the organization's warehouse. Layers
above this are typically another Data Access Layer for acting as an interface to
processes and applications that are conceptually placed in the top most layer of
the architecture called the Information Access Layer or the User Analysis Layer.
These layers can be considered to be a part of the larger Application
Messaging Layer. This layer is responsible for transfer of information within
itself, ie between the components that make up this layer and also between its
component layers and external data sources like the Metadata.
Implementation
Implementing a data warehouse begins with a thorough phase of analysis of your
business, its requirements and environment, and analyzing shortcomings in the
current BI. Repeating processes iteratively leads to a better understanding of
the overall state of business and tells you where and how much the problem is.
So you need to gather business documents and analyze them to formulate the
decisions aforesaid.
After this, you analyze other requirements which affect running business
processes in the organization. The activities that constitute this phase include
end-user requirement analysis . This revolves around the data needs of a
day-to-day information worker. The functional requirements are evaluated next;
this answers questions such as what new information analyses are needed and what
pitfalls of the current information analysis
systems are, if they exist. The information infrastructure is also evaluated to
find out whether the current infrastructure suffices for the implementation or
an expansion is required.
DW with SQL Server 2000 |
Let's see one sample DW implementation as a step by step procedure. We used MS SQL Server 2000 on Windows 2003 Enterprise Server SP1.
|
After all the analyses, comes the warehouse modeling phase. This involves
modeling the warehouse at both logical and physical levels. This modeling is
quite different from the modeling of a relational database. For instance, in
relational database modeling a direct end-user interaction is never a priority
whereas here the modeling is done so that the warehouse is end-user centric.
Unlike a database, the modeling here is done keeping in mind the fact that the
warehouse is being designed for data analysis and not just data storage in a
qualified pre-defined manner.
While these issues are addressed in the conceptual modeling, the physical
modeling looks into aspects such as how the implementation will proceed.
So it is not that hard to configure a warehouse. All it takes is an
initiative to implement the process and take that first step. After that, DTS
takes care of most of the things and you can get structured answers to your
queries.
Anadi Misra and Anubhav Verma