No matter which buzz words you use (Knowledge Discovery, Data Mining, Predictive Analytics, Advanced Analytics, Machine Learning etc.) the once obscure discipline of Data Science has become a red hot topic in recent years. There are now many software solutions offered by an ever increasing number of vendors.
Within the last year Microsoft has made, in my estimation, a substantially contribution to the data science ecosystem with its cloud based Azure Machine Learning. AzureML is impressive in both its depth and breadth (Azure ML Gallery). However, I am still often met with disbelief when I mention that SQL Server has natively support some of these capabilities for over a decade now (SQL Server 2005 Data Mining).
With the raise in popularity of Data Science, more and more companies than ever are looking for people with experience with languages like “R” and Python. Data Science is now also being caught up in the “Big Data” buzz as well. I frequently hear that “Big Data” solutions like Hadoop are being touted as the end all be all of data science. Don’t get me wrong Hadoop can play an instrumental and in some cases a critical role in developing a predictive model (labels and features) but these models typically don’t magically materialize out of HDFS any more than they do out of NTFS.
The harsh reality is that creating a good predictive model usually requires a lot of work (and some amount of luck). A model’s feature and label data typically have to be integrated from many desperate sources, cleaned, and then be conformed to specific categorical/discrete values. Similar to data warehouse ETL development, feature development can easily consume more than half of the development effort of the entire solution. Failure to devote the appropriate resources to either feature development (or ETL) often results in failure. “Garbage in, Garbage out” still applies.
Many people also fail to recognize the value that SQL, or Relation Database Management Systems in general, can provide to the feature development process. Data Warehouse and Business Intelligence professionals have been integrating, cleaning, and conforming data with SQL for decades and can provide valuable context to business data as well. The following is an example in which I took an existing AzureML experiment and replaced the feature development code section (originally written in python) with SQL. Not only did SQL greatly reduce the feature data processing time from over 2 hours to less than 1 hour. I was also able to easily and quickly develop almost double the number of features (from 30 to 70+).
The additional features increase the experiment’s accuracy from 0.873 to 0.910.
Note this experiment is based on a 2015 KDD competition. In other words this is real data from a real world solution. Actual Data Scientists develop solutions like this one all the time. This is not carefully crafted demo data.
Below is a high level outline of how I modified the experiment:
- Identified the feature development code (python) and downloaded its output for comparison the SQL based feature output to ensure that both matched.
- Downloaded the source CSV data into SQL Server (enrollment_test, enrollment_train, log_test,log_train, object, truth_train )
- Wrote SQL to create the 70+ features. ( KDD2015FeaturesTrain.sql , KDD2015FeaturesTest.sql )
- Uploaded SQL feature output (both training and test datasets) to AzureML and used the datasets as sources in the experiment.
This real world example demonstrates that SQL (RDMS) and Data Warehouse/Business Intelligence Professionals can make significant contributions to predictive modeling and data science, especially in the area of feature development.