It is quite important many a times to figure out what your database holds. It
is of course easy enough to find out something like this by using normal SQL
queries. However, sometimes the level of detail that SQL queries gives out is
not really adequate and you might want to use better techniques to get
information about your data.
This is where the upcoming version of SQL Server has something new for you.
In the SQL Server Integration Services, you get a new ability to profile the
data in tables that are stored in a database. This is done by a new task called
the 'Data Profile Task.'
SQL Server Integration Services (SSIS) can be used for a number of things.
Primarily this is a tool that lets you create data-related workflows (called
Control Flow) or data warehouse Extract-Transform-Load (ETL) processes (called
Data Flow). The Data Profile Task is a new task in SSIS 2008 that lets you do
data profiling of your tables.
To use this, open up the SQL Server Business Intelligence Development Studio
on the machine. Create a new Integration Services Project from the File | New |
Project window. Once the project is ready, you will see the SSIS Design surface.
The default design surface you get is the Control Flow Designer -which also
happens to exactly the one that you need.
Direct Hit! |
Applies To: DBA USP: Profile the data in your database Primary Link: microsoft.com\sqlserver Keyword:SQL Server 2008 |
Open up the BIDS Toolbox and look for the Data Profile Task in it. When you
find it, drag and drop it into the Control Flow Designer Surface. Once the task
is available on the surface, right click the bottom part of the screen inside
the 'Connection Managers' section and select 'New ADO.NET Connection.' In the
dialog box that opens, give the parameters that will let you connect to your
database.
You can now right click the Data Profile Task and select the 'Edit' option to
get into the detailed configuration screen. The first screen allows you to
select how the results of the profiling tests will be stored-into a variable or
into a file. Normally, you will select a file. Next, select the Destination
option and create a new File connection. Select Create file and give the path
and name for a new XML file, for instance: c:\Temp\DataProfile.xml. In the first
screen, there is also a button that allows you to create a Quick Profile. Click
it to open a simple dialog box.
Select the connection to the database and then the table you wish to profile.
You can then select a number of statistics and tests that you can run on the
selected table. A description of each of the test types is given further ahead
in this article. You can use the Quick Profile window to add profiles for
multiple tables.
The Quick Profile allows you to choose the database, table, and the profiling tests you want to run |
Once you have added the ones you want, you can then go ahead and select the
Profile Requests section in the main dialog box. This lets you configure the
individual requests for profiling.
Although all requests will work fine with defaults, you might want to
customize them so that you get meaningful results as well as faster performance
since unnecessary profile requests need not get handled. Once ready you can run
the task by right-clicking the Package.dtsx in Solution Explorer. On completion
you will get a new file created as specified.
To view the results (stored in the XML file), you need to use the new Data
Profile Viewer tool. Go to Start | Run and type in DataProfileViewer and press
Enter. This will open the DPV tool.
Click the Open menu and select the new XML file that the SSIS task created.
You will now be able to view the details of the profiling.
You can drill down to each table you chose and look at the different
statistics and results for tests on each column that you wanted. For instance,
you can check which columns qualify as key columns, you can check the regular
expression pattern of each column and more. The following is a list of tests you
can run:
Candidate Key Profile: Checks each column to see if it can uniquely
identify each row in the table as in a primary key. If yes, then it is listed as
a candidate key column.
After customizing the Pattern Profile we can customize the settings for Functional Dependency Profile |
The results of Pattern Profile test show a number of regular expressions detected and the results of the Value Distribution test are shown in the viewer |
Column Null Ration Profile: Checks the number of null values in a
column and gives you a percentage of null to non-null values in that column.
Column Statistics Profile: Gives you simple stats about the values in
a column, including minimum, maximum, average, and standard deviation. Works on
numeric and date columns.
Column Length Distribution Profile: Checks the different lengths of
values in a column, including and excluding leading and trailing whitespaces.
Column Value Distribution Profile: Shows a list of the different
values that a column has and their number and percentages in the column.
Column Pattern Profile: Tries to detect a regular expression that
defines what is the pattern to the value in a column.
Functional Dependency Profile: Checks the value of one column with any
other column to detect any functional dependencies. For instance, on finding a
10 digit number in the phone column, a functional dependency can be detected for
the country column to store a value for India.
Based on these statistics you can decide to clean up the data as well as
write business rules or validations to ensure that the data follows the rules
you set up. The Data Profile Task lets you quickly check the pattern and stats
of your tables and use them effectively.