Archive for June, 2013

DQS Composite Domains

June 16, 2013

Update:

I  felt a little too ambitious and actually recorded a series of 3 videos at MSBI Academy which cover all the content of this post. The videos also give a little bit more back ground information about DQS for those who have not used it yet.

1. Creating a new Data Quality Services (DQS) Domain

2. Adding Business Rules to a Data Quality Services (DQS) Domain

3. Enriching data using a Data Quality Services Project

Introduction:

Data Quality Services (DQS) is a great addition to SQL Server 2012. There were many times in the past were I had wished there was a tool which allowed Data Stewards to clean , validate, and preform matching or de-duplication on the data before it went into the Data Warehouse. In my opinion there are only two kinds of companies: those that don’t recognize that they have data quality issues and those that realize that they have data quality issues! Unfortunately, most organizations don’t find out that they have data quality issues until they attempt a data warehouse project and then those issues become all too apparent.

I am not going to go into all the functional details of DQS. (See above link for great training materials on the DQS team website) However, I want to focus on one of the most important features of DQS that seems to be misunderstood.

The ability to create Composite Domains for data correction or validation is one of the most critical features of DQS and would be required in virtually any DQS solution. However, creating Composite Domain rules in the DQS client UI can be a little tricky if you don’t know what to look for. (At least it was for me hence the reason for this post)

The following example uses the AdventureWorks Sample that is available on the DQS Team website. I suggest that you watch their training videos so that you have a good understanding of what I will describe in the next section.

Scenario:

You want to create Composite Domain rules that populates Title based on the values in Marital Status and Gender such that;

when Gender = ‘M’ male

– then Title = ‘Mr.’, or

when Marital Status = ‘S’ single and Gender = ‘F’ female

– then Title = ‘Ms.’ , or

Marital Status = ‘M’ single and Gender = ‘F’ male

– then Title = ‘Mrs.’

clip_image001

To accomplish this first you need to create all the individual domains and then create the ‘Title Enrichment’ Composite Domain. Next the Tile Enrichment Composite Domain needs a few rules to correctly populate the Title domain/column.

For the Single Female Title rule the first part of the rule or the test condition needs to be created. This is similar to the “IF” of the “IF…THEN…” syntax.

In the following example shows: if Marital Status = ‘S’ then Title = ‘Ms.’

clip_image003

Everything looks good at this point. Now all that is needed is to add an additional condition to the ‘IF’ to test when Gender = ‘F’. However, when you click on the “Adds new condition to the selected clause” button, a new clause is created for the Marital Status domain.

clip_image005

This not what is needed because the Gender domain needs to be used. It would appear that DQS is missing the very important ability to have multiple domains be tested in the same Composite Domain Rule.

The key to creating this type of Composite Domain Rule is to ensure that that no part of the current condition (“Martial Status Value is equal to S”) is in focus when you click on the “Add new condition to the selected clause”. You will know when the current condition is in focus because the vertical line will be solid blue or a blue box will around a part of the condition. See below:

clip_image007

What is required is to; 1. Click away for the part of the current condition before 2. You click on the “Add new condition to the selected clause”

clip_image008

If you do that then you can create the required condition using the Gender domain

clip_image009

Now when a project is run with this Composite Domain and the data in the Title column will have the correct values:

clip_image011

Advertisements

BlueGranite: DW and BI Resources

June 9, 2013

image

BlueGranite, a company that I do a lot work for and who has a great team of BI and DW professionals, are creating great BLOG posts that will be useful for those who are not  in-the-weeds developers, such as self Smile

untitled

Another great resource is Rob Ker’s (SQL MVP) MSBI Academy which some good short videos on great variety of Microsoft BI topics. I like his videos because they get to the point and have much better sound quality. No more having to sit through long demos that ramble on and at times are very hard to understand. (I have been known to do that from time to time)

To be completely transparent I will be creating some content for these sites as well. You might even see content there before it appears in my BLOG.

Example:

My SQL Performance Tuning Test Results for Data Warehouse Architectures post compiles over a year’s worth of SQLIO work with variety of SQL Server Data Warehouse Hardware Architectures.

I also like Rob’s Big Data Series

And in case you missed TechEd North America 2013 here are my top picks:

Any Data, Any Size: Architecting Breakthrough Performance with Mission Critical Solutions Built on HP AppSystem for Parallel Data Warehouse

Do You Have Big Data? (Most Likely!)

Large-Scale Data Warehousing and Big Data with Microsoft SQL Server Parallel Data Warehouse V2

Polybase: Hadoop Integration in SQL Server PDW V2

What’s New for Columnstore Indexes and Batch Mode Processing

Performance Tuning SQL Server Database: Extended Events

June 2, 2013

Extended Events are not new to SQL Server 2012 but they have been greatly enhanced from when they were first introduced in SQL 2008. As is my style, I am not going go into to a long protracted discussion of extended events. That is what MSDN is for and they do good job of covering all the details there.

What I am going to focus on is how Extended Events can be used in data warehousing. To begin with, Extended Events (XEvent) can be incredibly valuable for data warehouse scenarios because it can measure the throughput (MB/sec) for each query executed against the data warehouse server. Knowing the throughput requirements (MB/sec) of the date warehouse queries is absolutely critical for architecting a data warehouse server that preforms as expected! See: Data Warehouse Storage Throughput (MB/sec) for more information.

If for example you know that a data warehouse query returns 2,000 MB and that the users want the query to complete in 10 seconds, and assuming that the query takes 5 seconds to aggregate on the CPU, then you know that the minimum required throughput should be 400 MB/sec:

Example:

Query MB / seconds required

2,000 / 5 = 400 MB/sec

Note: The above does not account for concurrence (number of users and the other queries running at the same time on the server which will have to be accounted for as well).

You can use IO Statistics to measure a query’s MB as explained here: Page Compression, but that is tedious to do for a lot of queries and doesn’t give a good idea of the other queries running on the server as well. Now enter Extended Events, with extended events you create a session get all that information plus much more.

Example:

1. Create an Extended Events session with the following actions:

CREATEEVENTSESSIONQueryIOStats2012

ONSERVER

ADDEVENTsqlserver.sql_statement_completed

  (ACTION (sqlserver.database_id,

           sqlserver.sql_text,

                sqlserver.nt_username,

                sqlserver.session_id,

                sqlserver.query_hash,

                sqlserver.query_plan_hash)

    )

ADDTARGETpackage0.asynchronous_file_target(SETfilename=N’c:\temp\QueryIOStats2012.xel’)

WITH (MAX_DISPATCH_LATENCY= 5 SECONDS);

 

Note: I choose to save the results to an xml file instead of keeping them in memory because they can get very large, depending on how active your server is.

2. You then let the session run for the period of time that you want to collect query data for and with a little bit of knowledge about querying xml you have access to a wealth of knowledge.

select

       a.LogTime

       ,DATEPART(DAY,a.LogTime)asDay

       ,DATEPART(hour,a.LogTime)asHour

       ,a.Duration/1000000.00 asDurationSec

       ,a.CPUTime/1000000.00  asCPUSec

       ,a.PhysicalReads

       ,a.LogicalReads

       ,a.Writes

       ,a.PhysicalReads*8/1024.00   asPhysicalReadsMB

       ,a.LogicalReads*8/1024.00   asLogicalReadsMB

       ,a.Writes*8/1024.00  asWritesMB

from (

SELECT

    EventLog.value(‘(/event[@name=”sql_statement_completed”]/@timestamp)[1]’,

                     ‘datetime’)ASLogTime,

       EventLog.value(‘(/event/data[@name=”duration”]/value)[1]’,

                     ‘int’)ASDuration,

       EventLog.value(‘(/event/data[@name=”cpu_time”]/value)[1]’,

                     ‘int’)ASCPUTime,

       EventLog.value(‘(/event/data[@name=”physical_reads”]/value)[1]’,

                     ‘int’)ASPhysicalReads,

     EventLog.value(‘(/event/data[@name=”logical_reads”]/value)[1]’,

                     ‘int’)ASLogicalReads,

       EventLog.value(‘(/event/data[@name=”writes”]/value)[1]’,

                     ‘int’)ASWrites,

     EventLog.value(‘(/event/action[@name=”sql_text”]/value)[1]’,

                     ‘varchar(max)’)ASSQLText,

       EventLog.value(‘(/event/action[@name=”session_id”]/value)[1]’,

                     ‘varchar(max)’)ASSession_ID,

       EventLog.value(‘(/event/action[@name=”nt_username”]/value)[1]’,

                     ‘varchar(max)’)ASNTUsername,

       EventLog.value(‘(/event/action[@name=”query_hash”]/value)[1]’,

                     ‘varchar(max)’)ASQueryHash,

       EventLog.value(‘(/event/action[@name=”query_plan_hash”]/value)[1]’,

                     ‘varchar(max)’)ASQueryPlanHash             

FROM (

SELECTCAST(event_dataASXML)ASEventLog

       FROMsys.fn_xe_file_target_read_file

       (‘c:\temp\QueryIOStats2012*.xel’,null,NULL,NULL))asLogRecords

)asa

orderbya.LogTime

;

 

<rant> yeah I know querying xml makes my brain hurt too, but it is so worth it !!!</rant>

 

Results:

image

With these results you can now analyze the query patterns aggregated by a certain time interval (hours by day) and compare that to actual hardware IO limitations as measures by SQLIO. You can now answer difficult questions like the following with confidence:

How will the date warehouse server preform if I add X amount of data?

What are the largest queries and peak query times?

Will my data warehouse perform well if I add X amount of users?   

Is my ETL taking too long because of lack of server resources or slow ETL logic?

 There is so much more that can be learned from the above extended event xml query, such as query and query plan hash but I hope that this gives you an idea of what is possible.

Script