Data Science and SQL

August 28, 2015

No matter which buzz words you use (Knowledge Discovery, Data Mining, Predictive Analytics, Advanced Analytics, Machine Learning etc.) the once obscure discipline of Data Science has become a red hot topic in recent years. There are now many software solutions offered by an ever increasing number of vendors.

Within the last year Microsoft has made, in my estimation, a substantially contribution to the data science ecosystem with its cloud based Azure Machine Learning. AzureML is impressive in both its depth and breadth (Azure ML Gallery). However,  I am still often met with disbelief when I mention that SQL Server has natively support some of these capabilities for over a decade now (SQL Server 2005 Data Mining).

With the raise in popularity of Data Science, more and more companies than ever are looking for people with experience with languages like “R” and Python. Data Science is now also being caught up in the “Big Data” buzz as well. I frequently hear that “Big Data” solutions like Hadoop are being touted as the end all be all of data science. Don’t get me wrong Hadoop can play an instrumental and in some cases a critical role in developing a predictive model (labels and features) but these models typically don’t magically materialize out of HDFS any more than they do out of NTFS.

The harsh reality is that creating a good predictive model usually requires a lot of work (and some amount of luck).  A model’s feature and label data typically have to be integrated from many desperate sources, cleaned, and then be conformed to specific categorical/discrete values. Similar to data warehouse ETL development, feature development can easily consume more than half of the development effort of the entire solution.   Failure to devote the appropriate resources to either feature development (or ETL) often results in failure.  “Garbage in, Garbage out” still applies.

Many people also fail to recognize the value that SQL, or Relation Database Management Systems in general, can provide to the feature development process. Data Warehouse and Business Intelligence professionals have been integrating, cleaning, and conforming data with SQL for decades and can provide valuable context to business data as well. The following is an example in which I took an existing AzureML experiment and replaced the feature development code section (originally written in python) with SQL. Not only did SQL greatly reduce the feature data processing time from over 2 hours to less than 1 hour. I was also able to easily and quickly develop almost double the number of features (from 30 to 70+).

The additional features increase the experiment’s accuracy from 0.873 to 0.910.

stats

Note this experiment is based on a 2015 KDD competition. In other words this is real data from a real world solution. Actual Data Scientists develop solutions like this one all the time. This is not carefully crafted demo data.

Below is a high level outline of how I modified the experiment:

  1. Identified the feature development code (python) and downloaded its output for comparison the SQL based feature output to ensure that both matched.
  2. Downloaded the source CSV data into SQL Server (enrollment_test, enrollment_train, log_test,log_train, object, truth_train )
  3. Wrote SQL to create the 70+ features. ( KDD2015FeaturesTrain.sql , KDD2015FeaturesTest.sql )
  4. Uploaded SQL feature output (both training and test datasets) to AzureML and used the datasets as sources in the experiment.

Modified Experiment:

sql experiment

Lift Chart:

lift chart

This real world example demonstrates that SQL (RDMS) and Data Warehouse/Business Intelligence Professionals can make significant contributions to predictive modeling and data science, especially in the area of feature development.

SQL Server 2016

May 13, 2015

Last week SQL Server 2016 was announced on the SQL Server BLOG. While there are several new features mentioned on the SQL Server Fact Sheet.   I would like to focus on just those that effect Data Warehousing and those that indicate a new direction for SQL Server towards “Advance Analytics”.

Many of these features are covered and demonstrated in the Ignite (TechEd) video New Features SQL Server 2016.

Real-Time Operational Analytics & In-Memory OLTP

columnstore

As implied by the name, Non-Clustered Columnstore Indexes have now been added to In-Memory OLTP (a.k.a. Hekaton). In the Ignite video it mentions that the addition of the report queries (Non-Clustered Indexes) to the In-Memory table adds little to “no cost to the system”. While this may seems like a farfetched claim to those who have actually had to performance tune an OLTP application that is getting bogged down by reporting queries, It really is not all that unbelievable.

The reason why reads (report queries) do not block OLTP application DML operations (Updates, Deletes, and Inserts) and visa-versa is because the In-Memory OLTP structure does not use memory latches to maintain data consistence.

In-Memory OLTP (In-Memory Optimization) – “The In-Memory OLTP engine is designed for extremely high session concurrency for OLTP type of transactions driven from a highly scaled-out middle-tier. To achieve this, it uses latch-free data structures and optimistic, multi-version concurrency control. The result is predictable, sub-millisecond low latency and high throughput with linear scaling for database transactions.”

This means that an application can support near Real Time Operational Reporting without impacting the application’s performance. In others words you *may* not have to build an ODS with say Replication/AlwaysOn Availability Groups feeding data to the ODS.

*** Provided that it fits in memory

However, this does not mean that you can get away with not building a data warehouse. You will still need a data warehouse to do all the things that a data warehouse do so well  i.e. consolidate, clean and conform data from multiple source systems. Most data warehouses are far too large to fit into memory.

Built-in Advanced Analytics

advanced analytics

Basically “Advanced Analytics” is SQL Server integrated with “R” packages. While I think that SQL Server integrated with “R” is a great step (I have been using R ever since I was introduced it in my college stat classes)  There are a lot more things that can be done with R. Just browse CRAN to get an idea.

I hope that BI professionals don’t lose sight of all is possible with good old SQL.

For example in Itzik’s latest book T-SQL Querying (Developer Reference)  there is an entire chapter devoted to BI. Just check out the topics covered:

Frequencies, Descriptive statistics for continuous variables, chi-squared, analysis of variance, moving averages and entropy, histograms etc.

Polybase

SQL Server and Hadoop integration. After working with Polybase in PDW/APS I am happy that it is now making its way into SQL Server SMP. With Polybase a Data Warehouse can now use Hadoop (HDFS) as a data source and/or a data archive.

Azure SQL Data Warehouse

APS

James Serra on his BLOG provides some valuable information about Azure SQL Data Warehouse which is basically PDW/APS in the cloud.

One of the most interesting things about Azure SQL Data Warehouse is that is can be integrated with Hadoop and Azure Machine Learning/“R” to produce some very interesting solutions. See: Azure SQL Data Warehouse: Deep Dive

Data Loading performance considerations with Clustered Columnstore indexes

March 16, 2015

The SQL CAT resently released an excellent article called “Data Loading performance considerations with Clustered Columnstore indexes“.  It is useful even for those who, like me, have been working with the Clustered Columnstore Indexes for several years now. The Overview of Loading Data into Columnstore Indexes section mentions that an index REGORG can effect batch size. In my experience, inconsistent  performance can sometimes be caused by a batch being too small i.e. less than the 102,400 row minimum, and so the tuple mover may not be triggered in time leaving some data operations in row mode instead of the more efficient batch mode. For small to medium size fact tables that maybe not always exceed the 102,400 row batch medium it maybe necessary to execute an index REORG which immediately will trigger the tuple mover after the normal data loading process is complete. This disadvantage to his approach is that the batch sizes over time may not be optimal because they were forced to be smaller than the 102,400 row minimum. If performance starts to suffer then implementing a periodic index REDBUILD routine will rebuild the index/table with batches closer to the 1 million rows/batch goal.

Additionally, I still, from time to time, see a lot of confusion on the distinction between an Operational Data Store (ODS) and a Data Warehouse. James Serra recently wrote an excellent BLOG about Operational Data Stores that captures their most important  functions. In my opinion one of the easiest and most effective ways to create an ODS is to simply setup an asynchronous SQL Server AllwaysOn Availability Group secondary.  Several  of the Availability Group database could form the core of an ODS but would most likely be a poor substitute for a Data Warehouse.

Mastering SQL Server 2014 Data Mining

January 24, 2015

DM_Book

I recently had the good fortune to collaborate with Amarpreet Bassan and Debarchan Sarkar from Microsoft in writing Mastering SQL Server 2014 Data Mining. This book is unique in that it covers all the major Microsoft tools for Data Warehousing, Data Mining, and Machine Learning. Each one of the tools is a major subject area and many books could be written about each one. For example just in the area of Data Warehousing there are multiple books devoted to each Data Warehousing subcategory (Microsoft BI Stack) i.e. SSIS (ETL), Dimensional Modeling (The Kimball Group), SSAS (OLAP), SSRS (Reporting), MDS (Master Data Management) etc…

Most books devoted to these topics are aimed to guide the beginner step-by-step through the process of building solutions with each of these tools. However, this book only devotes one chapter to these topics. In other words this book is not, for the most part, aimed at the absolute DW beginner. The table of contents is a good guide to determining the topics that the book chooses to focus on.

Note: I would highly recommend getting the digital version of the book as it has full color graphics which the print version does not. I would also highly recommend downloading the book’s code since this is not a step-by-step book and it would be very time consuming to reproduce every example from scratch.

Chapters:

  1. Identifying, Staging, and Understanding Data
  2. Data Model Preparation and Deployment
  3. Tools of the Trade
  4. Preparing the Data
  5. Classification Models
  6. Segmentation and Association Models
  7. Sequence and Regression Models
  8. Data Mining Using Excel and Big Data
  9. Tuning the Models
  10. Troubleshooting

 

Below is a summary of the things from each chapter that I think are interesting and/or unique.

Chapter 1 – Identifying, Staging, and Understanding Data

I like that several Adventure works based dimensional data loading scripts are provided in order to give the reader a taste of what a real world ETL process might look like. I wish I would had access to something like this over a decade ago when I was just starting with DW.

Chapter 2 – Data Model Preparation and Deployment

This chapter makes references to several excellent outside sources that provide much more in-depth treatment of the topics discussed. Example: http://www.kdnuggets.com/

Chapter 3 – Tools of the Trade

While this a brief introduction to the BI Stack. If you are not already familiar with it I would highly recommend pick up a more beginner oriented book. I do like that data warehousing is included in a data mining book because  in the real world I see too much separation between traditional IT/DW and Data Science/Data Mining/Statistics professionals. I believe that having both work together will only improve the final result.

Chapter 4 – Preparing the Data

I think that is it great that this chapter dives into the details of extracting data from Oracle and IBM. In the real world most data warehouses or data mining solutions do not extract all of their data from just one source (SQL Server).

Chapter 5 – Chapter 7

While these chapters provide a good over view of the various SQL Server Data Mining Algorithms. I really like that they also include techniques for tuning the algorithms in order improve their accuracy.

Chapter 8 – Data Mining Using Excel and Big Data

While the Data Mining plugin for excel has been around for a number years now this chapter as provides an in-depth discussion of the newer data mining tools as well, namely HDInsight and Azure Machine Learning. I would also highly recommend Predictive Analytics with Microsoft Azure Machine Learning: Build and Deploy Actionable Solutions in Minutes for those who are just getting started with AzureML.

Chapter 9 – Tuning the Models

This perhaps my favorite chapter because it uses real world data. No AdventureWorks insight!!! Plus the dataset (Housing Affordability Data System (HADS)) is also very interesting because I am looking at purchasing a house in the near future. I have spent many hours with the dataset making additional data mining discoveries than those covered in the book.

 

 

 

APS AU 3 Documentation

December 15, 2014

Microsoft just released the APS AU 3 Documentation. This is the first time that Microsoft has shared detailed information about APS/PDW with the public. In the good old days (like when I was working on PDW v1 implementations) this documentation was not shared with the public. Hence the reason for the absence of PDW/APS information on my BLOG even though I have been working with it for several years.

The documentation servers not only as the “bible” for those who already have an APS/PDW but is also an excellent introduction to Massive Parallel Processing (MPP) systems as well. See example below:

APS

The willingness of Microsoft to release this information signals that the product is maturing.

SQL Server 2014 Fast Track

November 25, 2014

Over the past several months the major hardware vendors have been hard at work developing detailed documentation around how to build an optimal data warehouse using SQL Server 2014. My friend and MSFT APS (PDW) TSP James Serra (you get a prize for knowing what all that means) has already blogged about it here. But I would like to add a few of my own observations as well.

HP:

HP DL380 Gen8 and HP LE PCIe Workload Accelerator 28TB/45 TB Data Warehouse Fast Track Reference Architecture

HP DL580 Gen 8 and HP LE PCIe Workload Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

Generally speaking, I have been impressed over that last few years with HP dedication in the DW space. This is based on having worked closely with them in past with FastTrack and PDW/APS. I like that they are making use of JBOD storage which also used in the APS (aka PDW)

 

HP sql server 2014

 

JBOD is a much more efficient approach and still preforms the same as the classic storage array/enclosure and storage controllers/processors but usually at a lower cost.

I am also happy to see that they stayed with the strategy of spreading database files from multiple file groups over the LUNs/mount points.

hp sql server 2014 2

This is that same basic setup that I cover in my DW videos: Data Warehouse 5 Parallelism.

I was much less impressed with the EMC FTDW solution it smells of EMC trying to force fit it’s VNX 5600 storage platform into the DW world. Most of the document feels like a VNX OLTP sales pitch that stresses things that are not major considerations for a DW solution.

Example:

p.9 “It allows the storage system to provide flash-drive class performance to data with a high locality of reference, which increases IOPS without placing all the data onto flash drives”. This sounds great for an OLTP solution (random reads/writes) but not for DW (sequential reads/writes).

“Multicore Fully Automated Storage Tiering (FAST) Cache reduces the load on the hard drive (HDD) by absorbing I/O bursts forms applications…” Again sounds good for OLTP but not applicable for DW. How do I/O bursts work for a DW? How would auto-tiering handle large analytical/aggregate queries i.e. last year sales aggregated by quarter? Where is the I/O hot spot there?

However, if EMC is you corporate standard you may have little choice in the matter and so choosing their FastTrack solution maybe your best option.

In summary

Not much has changed in the SQL Server 2014 FastTrack. Other than Clustered ColumStore Indexes are updatable and give world class performance. I think that JBOD, much to some SAN vendors dismay, is the future of at least DW if not OLTP storage.

Tabular and NUMA

July 30, 2014

Recently a great white paper Entitled “An Analysis Services Case Study: Using Tabular Models in a Large-scale Commercial Solution” was released which provides a lot of information that would be helpful when on is faced with evaluating the various Microsoft BI tools (SSAS Tabular, SSAS Multi-dimensional, and ColumnStore Indexes).

I will not reproduce the whole white paper on my BLOG but below are some quotes from the white paper that I found particularly interesting:

ColumnStore Indexes:

“Unpredictable Query speed. In SQL 2012, specific query plans can really take advantage of columnstores (which are an absolute need for the volume of data of inContact). However, as soon as the query became complex (for example, mixing SUM with DISTINCT COUNT in the same query) SQL Server stopped using the columnstore and reverted to row mode, leading to poor performance. It looked nearly impossible to produce a general-purpose query engine that gave enough freedom to users in expressing their query while retaining efficiency.”

While I too had problems with SQL Server 2012 Non-Clustered ColumnStore Indexes and the query optimizer picking the much slower “row mode” instead of “batch mode”, after using SQL Server 2014 Clustered ColumnStore Indexes for almost a year, I have not seen any major issues related to “batch mode”. As a matter of fact, I have even observed about a 15% – 30% performance improvement of Clustered ColumnStore Indexes over the non-clustered variety.

The section that by far was the most surprising to was the section entitled “The complexity and ubiquity of NUMA ” as it reminded me of something that I ran across a few years ago when I was testing the then brand new SSAS Tabular with NUMA servers.

AS the article clearly states:

“Most of the servers available on the market with enough expansion to handle terabytes of RAM are based on the Non Uniform Memory Access (NUMA) architecture. This is a big issue because Tabular is not NUMA aware.”

In other words any enterprise hardware will make use of NUMA nodes and so putting a none NUMA app on this kind of hardware would be the wrong approach. What then follows in the article is a good explanation of what NUMA is and why it is important. The final two paragraphs really hit home for me why SSAS Tabular, without a significant work around, is not enterprise worthy:

“As you can see, programmers need to optimize the software to run efficiently on NUMA nodes. You can’t simply use any software on NUMA and expect good performance. The problems of memory access speeds are compounded when the software relies heavily on RAM.

Unfortunately, not only is Tabular not NUMA aware, it also uses RAM as primary storage. Thus, the marriage between Tabular and NUMA is a not an easy one.”

While this is a bit of an understatement, it is now easy to find out if your server has multiple NUMA nodes by simply opening up Task Manager to the CPU tab. If your server has multiple NUMA nodes then you will see CPU usage for each node. See example below of an 8 NUMA node server:

numa nodes

 

The solution to the Tabular NUMA node problem is (assuming that it is possible) to break the data up into separate distributions according to the number of NUMA nodes on the server and then create an equal number of VMs, each with an instance of SSAS Tabular. See below:

“· Use Hyper-V VMs to set node affinity. Tabular might not be NUMA aware, but it runs great if you install a tabular instance on VM that is affinitized to a particular node. This approach has an added benefit; you don’t need to adjust memory configuration settings.

· Distinct count performs better in a Tabular implementation. This is a well-known fact about Tabular, but if you are new to the technology, it’s worth remembering. This is especially true if the alternative you are evaluating is SSAS Multidimensional.”

Columnstore: Gooooooal!!!

June 22, 2014

In the spirit of the World Cup, I am pleased to announce that the current leader of the TPC-H Performance Results is… Microsoft SQL Server 2014 for the 1 TB, 3 TB, and 10 TB Data Warehouse.

How did SQL Server manage this virtual sweep over all of its competitors? Well you can download all the code and configurations that were used by each winning system. For example the 10 TB system’s file are located here TPC-H Result Highlights.

Of course there was a lot of hardware and software configurations that went into each of these systems but the thing that pushed SQL Server above the crowd was the Clustered Columnstore Index. See code below for the “LINEITEM” table which in the TPC-H standard is similar to a fact table. (Unfortunately, TPC has no true official star schema standard)

— Create Clustered ColumnStore Index on LINEITEM on all columns

print ‘Start COLUMNSTORE L_SHIPDATE_IDX’

SELECT SYSDATETIME() AS‘SYSDATETIME’

create clustered columnstore index L_SHIPDATE_IDX on LINEITEM with(drop_existing = on,maxdop = 8)on DATESCHEME(L_SHIPDATE)

print ‘Finish COLUMNSTORE L_SHIPDATE_IDX’

SELECT SYSDATETIME() AS ‘SYSDATETIME’

While is there a lot that can be learned from the documentation of these systems. (Joe Chang has an excellent post “TPC-H Benchmarks on SQL Server 2014 with Columnstore”)

There might be a few hidden surprises.

1.Hard Disk Drives are not dead

Contrary to popular belief, Solid State Drives will not solve your performance/scalability issues, especially for scan centric Data Warehouse Work Loads (don’t let any hardware vendor tell you different). There is a reason why the vast majority of the FastTrack Data Warehouse Appliances use HDDs. Therefore, it should come as no surprise that the three award winning systems in the TPC-H standard all use good old spinning disks.

Storage Subsystem Disk Drives(Qty 56:146GB 6G SAS 15K):

(Qty 472: 300GB 6G SAS 15K):

528

 

While that is a hell of a lot of disks, keep in mind that these systems are also graded on cost and energy consumption as well and even with those criteria HDDs still win for DW workloads.

2. Table Partitioning

“Horizontal partitioning is used on LINEITEM and ORDERS tables and the partitioning columns are L_SHIPDATE and O_ORDERDATE. The partition granularity is by week.”

Even with the substantial benefit of Clustered Columnstore segment elimination, careful table partitioning, can still have a significant effect on performance. Keep in mind that correct table partitioning requires in depth knowledge of query predicate values. With TPC-H this easy to do since there are only a few dozen queries, in the real world this will require a lot more work.

Total Row with Grouping Sets

May 26, 2014

In a previous post I mentioned that under certain situations (CPU intensive calculations like log() and exp()) Columnstore Indexes can out preform SSAS MOLAP.

However, creating results sets in T-SQL maybe a little more complicated than with MDX. In MDX for example dimension attributes can have a “All” member. This “All” member can be used to create results sets that aggregate fact data by a dimension attribute that then have a “Total” or ”All” row appended to the bottom of the attributes. This “All” represents the aggregate of all the members from the dimension(s).

In the simple example below I have aggregated the LINEITEM fact table from the TPC-H standard  by SUPPKEY and also Appended a “Total” or “All” member row.

SUPPKEY Quantity
630 16026
7697 15288
8776 16089
3751 15406
Total 62809

However, getting the Total row to appear on the bottom is not as strait forward as it is with MDX. While there are several ways to produce the Total grouping/row in T-SQL because I would also like the make the best or most efficient use of the Columnstore Index that is on the table as well.

The options:

1.   Common Table Expression

with CTE  (L_SUPPKEY,Quantity)as

(

       select [L_SUPPKEY]

          ,sum([L_QUANTITY])as Quantity

       from [dbo].[LINEITEMs1]

       where [L_SUPPKEY] in(3751,8776,630,7697)

       groupby [L_SUPPKEY]

)

 select cast(L_SUPPKEY aschar)as SUPPKEY, Quantity

 from CTE

 union all

 select‘Total’,sum(Quantity)

 from CTE

While this might seem like a good option if you check out the execution plan you see that query actually scans the fact table 2 TIMES which will at least double the CPU time.

table scan

2.Table Variables

declare @tbl astable (SUPPKEY char(20), Quantity int)

 

insertinto @tbl

select [L_SUPPKEY],sum([L_QUANTITY])as Quantity

from [dbo].[LINEITEMs1]

where [L_SUPPKEY] in(3751,8776,630,7697)

groupby [L_SUPPKEY]

 

select SUPPKEY, Quantity

from @tbl

unionall

select‘Total’,sum(Quantity)

from @tbl

Even though the result set is small this is a terrible option  because inserting into a table variable forces the query into “row mode” which is much slower than “batch mode

row mode

3. Temporary Tables

select cast([L_SUPPKEY] aschar)as SUPPKEY,sum([L_QUANTITY])as Quantity

into #temp

from [dbo].[LINEITEMs1]

where [L_SUPPKEY] in (3751,8776,630,7697)

groupby [L_SUPPKEY]

 

select SUPPKEY,Quantity

from #temp

union all

select ‘Total’,sum(Quantity)

from #temp

While this is pretty good option it seems a little excessive to write the small result set all the way back down to disk (TempDB) just pull it back again for the Total row calculation. That is a unnecessary use of IO. Why go all the way back to disk when it is not required.

4. Grouping Sets 

select isnull(cast([L_SUPPKEY] aschar),‘Total’)as SUPPKEY

       ,sum([L_QUANTITY])as Quantity

from [dbo].[LINEITEMs1]

where [L_SUPPKEY] in(3751,8776,630,7697)

group by grouping sets(([L_SUPPKEY]),())

This is the best option because it has none of the limitations from the the above options.

SQL Saturday Houston

April 29, 2014

If you live the Houston area I will speaking at  SQL Saturday Houston #308 on May 10th.

SQLSAT308_SPEAKING

I will be giving the following 2 presentations:

Data Warehouse ETL

“OLTP versus DW ETL Strategies (Availability Groups/Mirroring, Change Data Capture, SSIS, FlatFiles). SSIS Synchronous & Asynchronous transforms. Demo SSIS performance. ETL versus ELT. Demo FileTable.”

Measuring Data Warehouse Performance

“DW Hardware Architecture Comparisons (SAN, DAS). Throughput (MB/sec) versus IOPS. Demo SQLIO – block size, sequential & random reads/writes, IO Requests. Query IO/Time Statistics. Demo Columstore Index Segment Elimination. Demo Extended Events.”

Of course many of theses topics will be familiar to you if you have read much of my BLOG or checked out my Youtube Channel. While I do like presenting and meeting new people, I recorded the videos for those who can’t make it to my presentations.

If you can make it to SQL Saturday don’t hesitate to say hi  and bring your hardest DW questions! Winking smile


Follow

Get every new post delivered to your Inbox.

Join 66 other followers