Posts Tagged ‘Data Mining’

Data Science and SQL

August 28, 2015

No matter which buzz words you use (Knowledge Discovery, Data Mining, Predictive Analytics, Advanced Analytics, Machine Learning etc.) the once obscure discipline of Data Science has become a red hot topic in recent years. There are now many software solutions offered by an ever increasing number of vendors.

Within the last year Microsoft has made, in my estimation, a substantially contribution to the data science ecosystem with its cloud based Azure Machine Learning. AzureML is impressive in both its depth and breadth (Azure ML Gallery). However,  I am still often met with disbelief when I mention that SQL Server has natively support some of these capabilities for over a decade now (SQL Server 2005 Data Mining).

With the raise in popularity of Data Science, more and more companies than ever are looking for people with experience with languages like “R” and Python. Data Science is now also being caught up in the “Big Data” buzz as well. I frequently hear that “Big Data” solutions like Hadoop are being touted as the end all be all of data science. Don’t get me wrong Hadoop can play an instrumental and in some cases a critical role in developing a predictive model (labels and features) but these models typically don’t magically materialize out of HDFS any more than they do out of NTFS.

The harsh reality is that creating a good predictive model usually requires a lot of work (and some amount of luck).  A model’s feature and label data typically have to be integrated from many desperate sources, cleaned, and then be conformed to specific categorical/discrete values. Similar to data warehouse ETL development, feature development can easily consume more than half of the development effort of the entire solution.   Failure to devote the appropriate resources to either feature development (or ETL) often results in failure.  “Garbage in, Garbage out” still applies.

Many people also fail to recognize the value that SQL, or Relation Database Management Systems in general, can provide to the feature development process. Data Warehouse and Business Intelligence professionals have been integrating, cleaning, and conforming data with SQL for decades and can provide valuable context to business data as well. The following is an example in which I took an existing AzureML experiment and replaced the feature development code section (originally written in python) with SQL. Not only did SQL greatly reduce the feature data processing time from over 2 hours to less than 1 hour. I was also able to easily and quickly develop almost double the number of features (from 30 to 70+).

The additional features increase the experiment’s accuracy from 0.873 to 0.910.


Note this experiment is based on a 2015 KDD competition. In other words this is real data from a real world solution. Actual Data Scientists develop solutions like this one all the time. This is not carefully crafted demo data.

Below is a high level outline of how I modified the experiment:

  1. Identified the feature development code (python) and downloaded its output for comparison the SQL based feature output to ensure that both matched.
  2. Downloaded the source CSV data into SQL Server (enrollment_test, enrollment_train, log_test,log_train, object, truth_train )
  3. Wrote SQL to create the 70+ features. ( KDD2015FeaturesTrain.sql , KDD2015FeaturesTest.sql )
  4. Uploaded SQL feature output (both training and test datasets) to AzureML and used the datasets as sources in the experiment.

Modified Experiment:

sql experiment

Lift Chart:

lift chart

This real world example demonstrates that SQL (RDMS) and Data Warehouse/Business Intelligence Professionals can make significant contributions to predictive modeling and data science, especially in the area of feature development.


Data Mining

December 14, 2011

Data Mining Data mining techniques can be used in virtually all business applications, answering various types of businesses questions. In truth, given the software available today, all you need is the motivation and the know-how. In general, data mining can be applied whenever something could be known, but is not. The following examples describe some scenarios:

Recommendation generation— What products or services should you offer to your customers? Generating recommendations is an important
business challenge for retailers and service providers. Customers who  are provided appropriate and timely recommendations are likely to be
more valuable (because they purchase more) and more loyal (because  they feel a stronger relationship to the vendor). For example, if you go to online stores such as or to purchase an item, you are provided with recommendations about other items you may be interested in. These recommendations are derived from using data mining to analyze purchase behavior of all of the retailer’s customers, and applying the derived rules to your personal information.

Anomaly detection — How do you know whether your data is ‘‘good’’ or not? Data mining can analyze your data and pick out those items that don’t fit with the rest. Credit card companies use data mining–driven anomaly detection to determine if a particular transaction is valid. If the data mining system flags the transaction as anomalous, you get a call to see if it was really you who used your card. Insurance companies also use anomaly detection to determine if claims are fraudulent. Because these companies process thousands of claims a day, it is impossible to investigate each case, and data mining can identify which claims are likely to be false. Anomaly detection can even be used to validate data entry—checking to see if the data entered is correct at the point of entry.

Churn analysis — Which customers are most likely to switch to a competitor? The telecom, banking, and insurance industries face severe competition. On average, obtaining a single new mobile phone subscriber costs more than $200. Every business would like to retain as many customers as possible. Churn analysis can help marketing managers identify the customers who are likely to leave and why, and as a result, they can improve customer relations and retain customers.

Risk management—Should a loan be approved for a particular customer? Since the subprime mortgage meltdown, this is the single most common question in banking. Data mining techniques are used to determine the risk of a loan application, helping the loan officer make appropriate decisions on the cost and validity of each application.

Customer segmentation —How do you think of your customers? Are your customers the indescribable masses, or can you learn more about  your customers to have a more intimate and appropriate discussion with them. Customer segmentation determines the behavioral and descriptive profiles for your customers. These profiles are then used to provide personalized marketing programs and strategies that are appropriate for each group.

Targeted ads — Web retailers or portal sites like to personalize their  content for their Web customers. Using navigation or online purchase
patterns, these sites can use data mining solutions to display targeted advertisements to their Web navigators.

Forecasting — How many cases of wine will you sell next week in this  store? What will the inventory level be in one month? Data mining forecasting techniques can be used to answer these types of time-related questions.


Data Mining with Microsoft SQL Server 2008: Provides in-depth knowledge about Data Mining with SSAS. Covers the Algorithms, DMX and SSIS Data Mining solutions in detail. This is an absolute must for learning how to develop a Data Mining solution. It has tons of great examples.

Predixion Insight: Jamie MacLennan, Director of SSAS at Microsoft, left Microsoft and started Predixion which offers the best Data Mining SSAS based product on the market. The excel demo comes with great examples as well.

Data Mining Techniques: Provides good examples of how to develop a strategy for using Data Mining to solve business problems.

Data Mining Webcasts

Automation of Data Mining Using Integration Services: This article is a walkthrough that illustrates how to build multiple related data models by using the tools that are provided with Microsoft SQL Server Integration Services. In this walkthrough, you will learn how to automatically build and process multiple data mining models based on a single mining structure, how to create predictions from all related models, and how to save the results to a relational database for further analysis. Finally, you view and compare the predictions, historical trends, and model statistics in SQL Server Reporting Services reports.