Azure and Open Source

November 2, 2015

It may come as a surprise to some that Microsoft has been working with open sources solutions like Linux for a number of years now. Azure is a good example of how Microsoft incorporates open source into it’s offerings.

  • Microsoft showcases the Azure Cloud Switch –  Azure Cloud Switch (ACS) is built on the Switch Abstraction Interface (SAI) which is a Linux based open source solution that allows Microsoft data centers to efficiently manage the many Azure Applications.

Azure Cloud Switch

  • Microsoft Extends Commitment to Open Compute Project – Want to know the exact hardware that  Azure and Facebook are running in their Data Centers? Take a look here:

azure hardware


AzureCon is a great way to get up to speed with all the things that are going on with Azure. Below are some of my favorite sessions:

  • Fundamentals of Revolution R Enterprise – I have been using R for a while now and while it has a great collection of packages, it also has some serious limitations. To begin with R is only in-memory (no disk storage available during data processing) and it is single threaded (there are work arounds but they take a lot of work to get right). R Enterprise is multi-threaded and has disk representation (.xdf file). It is interesting to note that the R Enterprise data file is columnar.
  • Data Warehousing In Cortana Analytics – SQL Data Warehouse (SQL Data Warehouse documentation) is the Azure version of PDW (Parallel Data Warehouse) and both have the same code base (APS Documentation ).
  • Deep Neural Networks – Neural Networks modules in Azure Machine Learning have a script input for determining the topology. The scripting language is called Net#. With Net# you can control the number of layers and how they connected including transformations like convolutions. This means that these settings can easily be shared across different models without having to build each model from scratch.
  •  Data Science with Microsoft Azure and R – I recently completed a the six-week class (edX Data Science and Machine Learning Essentials) which is a great introduction to  Azure Machine Learning for Data Scientists.

Stories and Data

October 2, 2015

Which of the following was the bloodiest century in all of human history ?

A. 8th

B. 13th

C. 17th

D. 20th

The answer is obviously the 20th but where does that answer come from? In short it comes from the stories that we tell ourselves about the 20 th century. Without much effort virtually anyone can come up with a list of many of the atrocities of that blood soaked century. Some of the most common examples of atrocities from 20 century: the brutal invasions of Hitler, Mussolini, Stalin, Po Pot, Mau Zedong, and Imperial Japan; Stalin’s Purge; the Gulag; the Holocaust and two atomic explosions just to name a few. All this and I have not even mentioned World War 1, to say nothing of the more recent wars of Korea and Vietnam.

The unprecedented level of killing in the 20th century was enabled by many technological advances such as highly industrialized and mechanized warfare for example; industrial production of mass qualities of guns (like AK-47), ammunition, tanks, bombs, land mines, planes, long range artillery, and warships); wide spread use of machine guns, and incendiary bombs;   weapons of mass destruction (nuclear, biological, and chemical). The data appears to tell the same story as well. For example, atrocitologist Matthew White in his book The 100 Deadliest Episodes in Human History meticulously enumerates in great detail the 100 worst things that humans have ever done to each other. When these atrocities are arranged by Death Toll in descending order it is not hard to predict which century comes out on top.

Top 30 Atrocities (20th Century in red)


Furthermore when one considers the apparent rise of violence in TV programs, movies, and video games; The War on Terrorism (longest war in US history at over 10 years); recent mass shootings (Columbine High School (1999), Virginia Tech (2007), Aurora movie theater (2012), Sandy Hook Elementary (2012)); never ending violence on the nightly news and many other recent events. It would seem only logical to conclude that violence will continue to rise until our modern society either falls into a Mad Max style post-apocalyptic world  (a repeat of the “Dark Ages”) or maybe a some cosmic (divine?) being  will take pity on us and save the world from its self (or save just the “chosen” ones?).

Both the data and the stories seem to be pointing in the same direction.

Or do they? Are the data telling stories or are our stories causing us to  misunderstand the data? Take a look again at the list of Top 30 Atrocities. While number 1 (World War 2) is from the 20th century, out of the top 10 atrocities, only 4 are from the 20th century. That should give you pause.  Things might not be as simple as they seem. Additionally, does it make sense to compare the top 2 atrocities (World War 2 and Genghis Khan) in absolute terms when they were so far apart in time (> 500 years apart)? Wasn’t the world population a lot lower in the 13th century than in 20th? The 40 million deaths in the 13th century could represent a larger portion of the world population than that of the 65 million during World War 2 in the 20th century.

Calculating the percentage of deaths by the world population relative to that time period tells a very different story.

Top 30 Atrocities Scaled by World Population (20 Century in red)

Top 30 Atrocities by World Population

I used the UN and HYDE World Population with Spline Interpolation to scale each atrocity by the World Population at its midpoint (middle year). See Our World in Data for more information.

Now, with atrocities scaled by their contemporary world population, only one 20th century atrocity (World War 2) makes it into the top 10 and 5 of the top 10 are from China or South Asia. A few questions to consider. Did Genghis Khan really kill about 10% of the world’s population? Were the Chinese really that much more murderous than 20th century Europeans and Asians (Hitler, Stalin, Mao Zedong)?

One of the issues here is data collection. Such simple things as literacy, the ability to record any event whether it be good or bad, can have a dramatic impact on our ability to enumerate death tolls. Relatively high levels of literacy is a modern phenomenon and for that reason many of the major and minor atrocities from the past simply may have gone under recorded or, most likely, completely un-reordered.  Fortunately, or un-fortunately depending on how you look at it, China is generally known to have had unusually high literacy rates throughout much its history, unlike Europe for example.

Another problem is that of exaggeration. Typically, it is the winner who gets to write history and they tend to exaggerate the number of enemies that were slain. In many cases a story about how the severely outnumbered heroes miraculously defeated the numerous enemy horde will be more appealing than the reality, which might not get recorded at all.

The above examples are just a couple of the many of the issues that historians and analyst alike have to try and deal with and compensate for in their daily work. Professional historians employ rigorous methodologies and their results are highly scrutinized and, in many cases, highly criticized as well.

Over the years I have become weary of simple and obvious answers to complex and far reaching questions.

Thinking critically is hard to do and there are many cognitive traps (Think Fast, Think Slow) that anyone, no matter how intelligent, can fall into. As a Business Intelligence (now renamed Advanced Analytics) professional I have to continually scrutinize various processes such as data collection and  data processing, in addition to reporting logic within the solutions that I build. This also includes thinking critically about and trying to compensate for both organizational and my own personal biases as well.

FYI if you really want to know whether or not the world is getting more violent I suggest reading Better Angles of Our Nature.  There are no simple are answers but the historical trends may surprise you. However, like any good financial advisor will tell you “Past Performance is Not a Guarantee of Future Results”

In-Memory Technologies Comparison

September 8, 2015

If you have been to a database vendor conference within the last few years, you no doubt have heard a lot about “In-Memory Technologies”.   These technologies are hailed as the solution to all of your OLTP and Data Warehouse performance problems. Just buy their (expensive) hardware/software appliances and performance will no longer be an issue. While it is true that having a database engine access data from memory is generally faster than accessing it from disk, not all in-memory technologies are created equal. Each one has it’s own unique strengths and weaknesses.

I am sometimes asked about how other competing technologies compare to SQL Server (either In-Memory OLTP or Columnstore/Batch Mode).   Admittedly, my knowledge of other competing technologies is relatively shallow as compared to my knowledge of SQL Server and it is sometimes very difficult to get pass the marketing hype of an in-memory technology to figure out what it does well and what is does not do well.

Fortunately, Microsoft has recently released an excellent white paper that can be invaluable for those who are trying to evaluate the merits of the major in-memory technologies.

SQL Server In-Memory OLTP and Columnstore Feature Comparison

SQL Server Columnstore vs. Oracle

SQLServerCS vs Oracle

SQL Server In-Memory OLTP vs. Oracle Times Ten

SQLServerOLTP vs Oracle

SQL Server vs. SAP Hana

SQLServer vs SAP Hana

SQL Server vs IBM DB2

SQLServer vs IBM DB2

Data Science and SQL

August 28, 2015

No matter which buzz words you use (Knowledge Discovery, Data Mining, Predictive Analytics, Advanced Analytics, Machine Learning etc.) the once obscure discipline of Data Science has become a red hot topic in recent years. There are now many software solutions offered by an ever increasing number of vendors.

Within the last year Microsoft has made, in my estimation, a substantially contribution to the data science ecosystem with its cloud based Azure Machine Learning. AzureML is impressive in both its depth and breadth (Azure ML Gallery). However,  I am still often met with disbelief when I mention that SQL Server has natively support some of these capabilities for over a decade now (SQL Server 2005 Data Mining).

With the raise in popularity of Data Science, more and more companies than ever are looking for people with experience with languages like “R” and Python. Data Science is now also being caught up in the “Big Data” buzz as well. I frequently hear that “Big Data” solutions like Hadoop are being touted as the end all be all of data science. Don’t get me wrong Hadoop can play an instrumental and in some cases a critical role in developing a predictive model (labels and features) but these models typically don’t magically materialize out of HDFS any more than they do out of NTFS.

The harsh reality is that creating a good predictive model usually requires a lot of work (and some amount of luck).  A model’s feature and label data typically have to be integrated from many desperate sources, cleaned, and then be conformed to specific categorical/discrete values. Similar to data warehouse ETL development, feature development can easily consume more than half of the development effort of the entire solution.   Failure to devote the appropriate resources to either feature development (or ETL) often results in failure.  “Garbage in, Garbage out” still applies.

Many people also fail to recognize the value that SQL, or Relation Database Management Systems in general, can provide to the feature development process. Data Warehouse and Business Intelligence professionals have been integrating, cleaning, and conforming data with SQL for decades and can provide valuable context to business data as well. The following is an example in which I took an existing AzureML experiment and replaced the feature development code section (originally written in python) with SQL. Not only did SQL greatly reduce the feature data processing time from over 2 hours to less than 1 hour. I was also able to easily and quickly develop almost double the number of features (from 30 to 70+).

The additional features increase the experiment’s accuracy from 0.873 to 0.910.


Note this experiment is based on a 2015 KDD competition. In other words this is real data from a real world solution. Actual Data Scientists develop solutions like this one all the time. This is not carefully crafted demo data.

Below is a high level outline of how I modified the experiment:

  1. Identified the feature development code (python) and downloaded its output for comparison the SQL based feature output to ensure that both matched.
  2. Downloaded the source CSV data into SQL Server (enrollment_test, enrollment_train, log_test,log_train, object, truth_train )
  3. Wrote SQL to create the 70+ features. ( KDD2015FeaturesTrain.sql , KDD2015FeaturesTest.sql )
  4. Uploaded SQL feature output (both training and test datasets) to AzureML and used the datasets as sources in the experiment.

Modified Experiment:

sql experiment

Lift Chart:

lift chart

This real world example demonstrates that SQL (RDMS) and Data Warehouse/Business Intelligence Professionals can make significant contributions to predictive modeling and data science, especially in the area of feature development.

SQL Server 2016

May 13, 2015

Last week SQL Server 2016 was announced on the SQL Server BLOG. While there are several new features mentioned on the SQL Server Fact Sheet.   I would like to focus on just those that effect Data Warehousing and those that indicate a new direction for SQL Server towards “Advance Analytics”.

Many of these features are covered and demonstrated in the Ignite (TechEd) video New Features SQL Server 2016.

Real-Time Operational Analytics & In-Memory OLTP


As implied by the name, Non-Clustered Columnstore Indexes have now been added to In-Memory OLTP (a.k.a. Hekaton). In the Ignite video it mentions that the addition of the report queries (Non-Clustered Indexes) to the In-Memory table adds little to “no cost to the system”. While this may seems like a farfetched claim to those who have actually had to performance tune an OLTP application that is getting bogged down by reporting queries, It really is not all that unbelievable.

The reason why reads (report queries) do not block OLTP application DML operations (Updates, Deletes, and Inserts) and visa-versa is because the In-Memory OLTP structure does not use memory latches to maintain data consistence.

In-Memory OLTP (In-Memory Optimization) – “The In-Memory OLTP engine is designed for extremely high session concurrency for OLTP type of transactions driven from a highly scaled-out middle-tier. To achieve this, it uses latch-free data structures and optimistic, multi-version concurrency control. The result is predictable, sub-millisecond low latency and high throughput with linear scaling for database transactions.”

This means that an application can support near Real Time Operational Reporting without impacting the application’s performance. In others words you *may* not have to build an ODS with say Replication/AlwaysOn Availability Groups feeding data to the ODS.

*** Provided that it fits in memory

However, this does not mean that you can get away with not building a data warehouse. You will still need a data warehouse to do all the things that a data warehouse do so well  i.e. consolidate, clean and conform data from multiple source systems. Most data warehouses are far too large to fit into memory.

Built-in Advanced Analytics

advanced analytics

Basically “Advanced Analytics” is SQL Server integrated with “R” packages. While I think that SQL Server integrated with “R” is a great step (I have been using R ever since I was introduced it in my college stat classes)  There are a lot more things that can be done with R. Just browse CRAN to get an idea.

I hope that BI professionals don’t lose sight of all is possible with good old SQL.

For example in Itzik’s latest book T-SQL Querying (Developer Reference)  there is an entire chapter devoted to BI. Just check out the topics covered:

Frequencies, Descriptive statistics for continuous variables, chi-squared, analysis of variance, moving averages and entropy, histograms etc.


SQL Server and Hadoop integration. After working with Polybase in PDW/APS I am happy that it is now making its way into SQL Server SMP. With Polybase a Data Warehouse can now use Hadoop (HDFS) as a data source and/or a data archive.

Azure SQL Data Warehouse


James Serra on his BLOG provides some valuable information about Azure SQL Data Warehouse which is basically PDW/APS in the cloud.

One of the most interesting things about Azure SQL Data Warehouse is that is can be integrated with Hadoop and Azure Machine Learning/“R” to produce some very interesting solutions. See: Azure SQL Data Warehouse: Deep Dive

Data Loading performance considerations with Clustered Columnstore indexes

March 16, 2015

The SQL CAT resently released an excellent article called “Data Loading performance considerations with Clustered Columnstore indexes“.  It is useful even for those who, like me, have been working with the Clustered Columnstore Indexes for several years now. The Overview of Loading Data into Columnstore Indexes section mentions that an index REGORG can effect batch size. In my experience, inconsistent  performance can sometimes be caused by a batch being too small i.e. less than the 102,400 row minimum, and so the tuple mover may not be triggered in time leaving some data operations in row mode instead of the more efficient batch mode. For small to medium size fact tables that maybe not always exceed the 102,400 row batch medium it maybe necessary to execute an index REORG which immediately will trigger the tuple mover after the normal data loading process is complete. This disadvantage to his approach is that the batch sizes over time may not be optimal because they were forced to be smaller than the 102,400 row minimum. If performance starts to suffer then implementing a periodic index REDBUILD routine will rebuild the index/table with batches closer to the 1 million rows/batch goal.

Additionally, I still, from time to time, see a lot of confusion on the distinction between an Operational Data Store (ODS) and a Data Warehouse. James Serra recently wrote an excellent BLOG about Operational Data Stores that captures their most important  functions. In my opinion one of the easiest and most effective ways to create an ODS is to simply setup an asynchronous SQL Server AllwaysOn Availability Group secondary.  Several  of the Availability Group database could form the core of an ODS but would most likely be a poor substitute for a Data Warehouse.

Mastering SQL Server 2014 Data Mining

January 24, 2015


I recently had the good fortune to collaborate with Amarpreet Bassan and Debarchan Sarkar from Microsoft in writing Mastering SQL Server 2014 Data Mining. This book is unique in that it covers all the major Microsoft tools for Data Warehousing, Data Mining, and Machine Learning. Each one of the tools is a major subject area and many books could be written about each one. For example just in the area of Data Warehousing there are multiple books devoted to each Data Warehousing subcategory (Microsoft BI Stack) i.e. SSIS (ETL), Dimensional Modeling (The Kimball Group), SSAS (OLAP), SSRS (Reporting), MDS (Master Data Management) etc…

Most books devoted to these topics are aimed to guide the beginner step-by-step through the process of building solutions with each of these tools. However, this book only devotes one chapter to these topics. In other words this book is not, for the most part, aimed at the absolute DW beginner. The table of contents is a good guide to determining the topics that the book chooses to focus on.

Note: I would highly recommend getting the digital version of the book as it has full color graphics which the print version does not. I would also highly recommend downloading the book’s code since this is not a step-by-step book and it would be very time consuming to reproduce every example from scratch.


  1. Identifying, Staging, and Understanding Data
  2. Data Model Preparation and Deployment
  3. Tools of the Trade
  4. Preparing the Data
  5. Classification Models
  6. Segmentation and Association Models
  7. Sequence and Regression Models
  8. Data Mining Using Excel and Big Data
  9. Tuning the Models
  10. Troubleshooting


Below is a summary of the things from each chapter that I think are interesting and/or unique.

Chapter 1 – Identifying, Staging, and Understanding Data

I like that several Adventure works based dimensional data loading scripts are provided in order to give the reader a taste of what a real world ETL process might look like. I wish I would had access to something like this over a decade ago when I was just starting with DW.

Chapter 2 – Data Model Preparation and Deployment

This chapter makes references to several excellent outside sources that provide much more in-depth treatment of the topics discussed. Example:

Chapter 3 – Tools of the Trade

While this a brief introduction to the BI Stack. If you are not already familiar with it I would highly recommend pick up a more beginner oriented book. I do like that data warehousing is included in a data mining book because  in the real world I see too much separation between traditional IT/DW and Data Science/Data Mining/Statistics professionals. I believe that having both work together will only improve the final result.

Chapter 4 – Preparing the Data

I think that is it great that this chapter dives into the details of extracting data from Oracle and IBM. In the real world most data warehouses or data mining solutions do not extract all of their data from just one source (SQL Server).

Chapter 5 – Chapter 7

While these chapters provide a good over view of the various SQL Server Data Mining Algorithms. I really like that they also include techniques for tuning the algorithms in order improve their accuracy.

Chapter 8 – Data Mining Using Excel and Big Data

While the Data Mining plugin for excel has been around for a number years now this chapter as provides an in-depth discussion of the newer data mining tools as well, namely HDInsight and Azure Machine Learning. I would also highly recommend Predictive Analytics with Microsoft Azure Machine Learning: Build and Deploy Actionable Solutions in Minutes for those who are just getting started with AzureML.

Chapter 9 – Tuning the Models

This perhaps my favorite chapter because it uses real world data. No AdventureWorks insight!!! Plus the dataset (Housing Affordability Data System (HADS)) is also very interesting because I am looking at purchasing a house in the near future. I have spent many hours with the dataset making additional data mining discoveries than those covered in the book.




APS AU 3 Documentation

December 15, 2014

Microsoft just released the APS AU 3 Documentation. This is the first time that Microsoft has shared detailed information about APS/PDW with the public. In the good old days (like when I was working on PDW v1 implementations) this documentation was not shared with the public. Hence the reason for the absence of PDW/APS information on my BLOG even though I have been working with it for several years.

The documentation servers not only as the “bible” for those who already have an APS/PDW but is also an excellent introduction to Massive Parallel Processing (MPP) systems as well. See example below:


The willingness of Microsoft to release this information signals that the product is maturing.

SQL Server 2014 Fast Track

November 25, 2014

Over the past several months the major hardware vendors have been hard at work developing detailed documentation around how to build an optimal data warehouse using SQL Server 2014. My friend and MSFT APS (PDW) TSP James Serra (you get a prize for knowing what all that means) has already blogged about it here. But I would like to add a few of my own observations as well.


HP DL380 Gen8 and HP LE PCIe Workload Accelerator 28TB/45 TB Data Warehouse Fast Track Reference Architecture

HP DL580 Gen 8 and HP LE PCIe Workload Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

Generally speaking, I have been impressed over that last few years with HP dedication in the DW space. This is based on having worked closely with them in past with FastTrack and PDW/APS. I like that they are making use of JBOD storage which also used in the APS (aka PDW)


HP sql server 2014


JBOD is a much more efficient approach and still preforms the same as the classic storage array/enclosure and storage controllers/processors but usually at a lower cost.

I am also happy to see that they stayed with the strategy of spreading database files from multiple file groups over the LUNs/mount points.

hp sql server 2014 2

This is that same basic setup that I cover in my DW videos: Data Warehouse 5 Parallelism.

I was much less impressed with the EMC FTDW solution it smells of EMC trying to force fit it’s VNX 5600 storage platform into the DW world. Most of the document feels like a VNX OLTP sales pitch that stresses things that are not major considerations for a DW solution.


p.9 “It allows the storage system to provide flash-drive class performance to data with a high locality of reference, which increases IOPS without placing all the data onto flash drives”. This sounds great for an OLTP solution (random reads/writes) but not for DW (sequential reads/writes).

“Multicore Fully Automated Storage Tiering (FAST) Cache reduces the load on the hard drive (HDD) by absorbing I/O bursts forms applications…” Again sounds good for OLTP but not applicable for DW. How do I/O bursts work for a DW? How would auto-tiering handle large analytical/aggregate queries i.e. last year sales aggregated by quarter? Where is the I/O hot spot there?

However, if EMC is you corporate standard you may have little choice in the matter and so choosing their FastTrack solution maybe your best option.

In summary

Not much has changed in the SQL Server 2014 FastTrack. Other than Clustered ColumStore Indexes are updatable and give world class performance. I think that JBOD, much to some SAN vendors dismay, is the future of at least DW if not OLTP storage.

Tabular and NUMA

July 30, 2014

Recently a great white paper Entitled “An Analysis Services Case Study: Using Tabular Models in a Large-scale Commercial Solution” was released which provides a lot of information that would be helpful when on is faced with evaluating the various Microsoft BI tools (SSAS Tabular, SSAS Multi-dimensional, and ColumnStore Indexes).

I will not reproduce the whole white paper on my BLOG but below are some quotes from the white paper that I found particularly interesting:

ColumnStore Indexes:

“Unpredictable Query speed. In SQL 2012, specific query plans can really take advantage of columnstores (which are an absolute need for the volume of data of inContact). However, as soon as the query became complex (for example, mixing SUM with DISTINCT COUNT in the same query) SQL Server stopped using the columnstore and reverted to row mode, leading to poor performance. It looked nearly impossible to produce a general-purpose query engine that gave enough freedom to users in expressing their query while retaining efficiency.”

While I too had problems with SQL Server 2012 Non-Clustered ColumnStore Indexes and the query optimizer picking the much slower “row mode” instead of “batch mode”, after using SQL Server 2014 Clustered ColumnStore Indexes for almost a year, I have not seen any major issues related to “batch mode”. As a matter of fact, I have even observed about a 15% – 30% performance improvement of Clustered ColumnStore Indexes over the non-clustered variety.

The section that by far was the most surprising to was the section entitled “The complexity and ubiquity of NUMA ” as it reminded me of something that I ran across a few years ago when I was testing the then brand new SSAS Tabular with NUMA servers.

AS the article clearly states:

“Most of the servers available on the market with enough expansion to handle terabytes of RAM are based on the Non Uniform Memory Access (NUMA) architecture. This is a big issue because Tabular is not NUMA aware.”

In other words any enterprise hardware will make use of NUMA nodes and so putting a none NUMA app on this kind of hardware would be the wrong approach. What then follows in the article is a good explanation of what NUMA is and why it is important. The final two paragraphs really hit home for me why SSAS Tabular, without a significant work around, is not enterprise worthy:

“As you can see, programmers need to optimize the software to run efficiently on NUMA nodes. You can’t simply use any software on NUMA and expect good performance. The problems of memory access speeds are compounded when the software relies heavily on RAM.

Unfortunately, not only is Tabular not NUMA aware, it also uses RAM as primary storage. Thus, the marriage between Tabular and NUMA is a not an easy one.”

While this is a bit of an understatement, it is now easy to find out if your server has multiple NUMA nodes by simply opening up Task Manager to the CPU tab. If your server has multiple NUMA nodes then you will see CPU usage for each node. See example below of an 8 NUMA node server:

numa nodes


The solution to the Tabular NUMA node problem is (assuming that it is possible) to break the data up into separate distributions according to the number of NUMA nodes on the server and then create an equal number of VMs, each with an instance of SSAS Tabular. See below:

“· Use Hyper-V VMs to set node affinity. Tabular might not be NUMA aware, but it runs great if you install a tabular instance on VM that is affinitized to a particular node. This approach has an added benefit; you don’t need to adjust memory configuration settings.

· Distinct count performs better in a Tabular implementation. This is a well-known fact about Tabular, but if you are new to the technology, it’s worth remembering. This is especially true if the alternative you are evaluating is SSAS Multidimensional.”


Get every new post delivered to your Inbox.

Join 66 other followers