Advertisment

Profile Your Database Using SQL Server 2008

author-image
PCQ Bureau
New Update

It is quite important many a times to figure out what your database holds. It

is of course easy enough to find out something like this by using normal SQL

queries. However, sometimes the level of detail that SQL queries gives out is

not really adequate and you might want to use better techniques to get

information about your data.

Advertisment

This is where the upcoming version of SQL Server has something new for you.

In the SQL Server Integration Services, you get a new ability to profile the

data in tables that are stored in a database. This is done by a new task called

the 'Data Profile Task.'

SQL Server Integration Services (SSIS) can be used for a number of things.

Primarily this is a tool that lets you create data-related workflows (called

Control Flow) or data warehouse Extract-Transform-Load (ETL) processes (called

Data Flow). The Data Profile Task is a new task in SSIS 2008 that lets you do

data profiling of your tables.

To use this, open up the SQL Server Business Intelligence Development Studio

on the machine. Create a new Integration Services Project from the File | New |

Project window. Once the project is ready, you will see the SSIS Design surface.

The default design surface you get is the Control Flow Designer -which also

happens to exactly the one that you need.

Advertisment

Direct Hit!

Applies To:

DBA



USP: Profile the data in your database


Primary Link: microsoft.com\sqlserver


Keyword:
SQL Server 2008

Open up the BIDS Toolbox and look for the Data Profile Task in it. When you

find it, drag and drop it into the Control Flow Designer Surface. Once the task

is available on the surface, right click the bottom part of the screen inside

the 'Connection Managers' section and select 'New ADO.NET Connection.' In the

dialog box that opens, give the parameters that will let you connect to your

database.

You can now right click the Data Profile Task and select the 'Edit' option to

get into the detailed configuration screen. The first screen allows you to

select how the results of the profiling tests will be stored-into a variable or

into a file. Normally, you will select a file. Next, select the Destination

option and create a new File connection. Select Create file and give the path

and name for a new XML file, for instance: c:\Temp\DataProfile.xml. In the first

screen, there is also a button that allows you to create a Quick Profile. Click

it to open a simple dialog box.

Advertisment

Select the connection to the database and then the table you wish to profile.

You can then select a number of statistics and tests that you can run on the

selected table. A description of each of the test types is given further ahead

in this article. You can use the Quick Profile window to add profiles for

multiple tables.

The Quick Profile allows you to

choose the database, table, and the profiling tests you want to run

Once you have added the ones you want, you can then go ahead and select the

Profile Requests section in the main dialog box. This lets you configure the

individual requests for profiling.

Advertisment

Although all requests will work fine with defaults, you might want to

customize them so that you get meaningful results as well as faster performance

since unnecessary profile requests need not get handled. Once ready you can run

the task by right-clicking the Package.dtsx in Solution Explorer. On completion

you will get a new file created as specified.

To view the results (stored in the XML file), you need to use the new Data

Profile Viewer tool. Go to Start | Run and type in DataProfileViewer and press

Enter. This will open the DPV tool.

Click the Open menu and select the new XML file that the SSIS task created.

You will now be able to view the details of the profiling.

Advertisment

You can drill down to each table you chose and look at the different

statistics and results for tests on each column that you wanted. For instance,

you can check which columns qualify as key columns, you can check the regular

expression pattern of each column and more. The following is a list of tests you

can run:

Candidate Key Profile: Checks each column to see if it can uniquely

identify each row in the table as in a primary key. If yes, then it is listed as

a candidate key column.

After customizing

the Pattern Profile we can customize the settings for Functional Dependency

Profile
The results of

Pattern Profile test show a number of regular expressions detected and the

results of the Value Distribution test are shown in the viewer
Advertisment

Column Null Ration Profile: Checks the number of null values in a

column and gives you a percentage of null to non-null values in that column.

Column Statistics Profile: Gives you simple stats about the values in

a column, including minimum, maximum, average, and standard deviation. Works on

numeric and date columns.

Column Length Distribution Profile: Checks the different lengths of

values in a column, including and excluding leading and trailing whitespaces.

Advertisment

Column Value Distribution Profile: Shows a list of the different

values that a column has and their number and percentages in the column.

Column Pattern Profile: Tries to detect a regular expression that

defines what is the pattern to the value in a column.

Functional Dependency Profile: Checks the value of one column with any

other column to detect any functional dependencies. For instance, on finding a

10 digit number in the phone column, a functional dependency can be detected for

the country column to store a value for India.

Based on these statistics you can decide to clean up the data as well as

write business rules or validations to ensure that the data follows the rules

you set up. The Data Profile Task lets you quickly check the pattern and stats

of your tables and use them effectively.

Advertisment