SQL for Data Science

Data scientists need to be comfortable working with relational databases, especially SQL. If you want to play with data, you need some SQL.

If you are a data scientist or an aspiring data professional, you need to be comfortable working with relational databases. Knowledge of SQL is incredibly important for big data professionals and one would definitely need it in his or her toolbox. Almost all the data scientists spend the majority of their time writing SQL and related scripts. The 2016 Data Science report from CrowdFlower also mentions SQL, which heads the Top 10 In-Demand Data Science Skills this year.

Image title

What Is SQL?

It is the primary language that directly communicates with data and can be defined as a special-purpose programming language for managing data held in relational database management systems.

Why SQL for Data Scientists?

Almost all structured data is stored in such databases, so if you want to play with data, you need some SQL.
If you work in NoSQL databases, chances are there you can find familiar SQL syntax for working with data in those databases as well. For example, a NoSQL database Cassandra introduced Cassandra Query Language (CQL), which layers a familiar SQL language syntax.
If you are working on Hadoop, Apache Hive provides a mechanism to extend structure onto the data in Hadoop. You can query or manipulate that data using an SQL-like language called HiveQL.
If you work with big data processing tools such as Apache Spark, you will want to do as much of the data preparation and wrangling utilizing SQL. SparkSQL is widely used in the framework to prepare the data and create data frames to be used by the corresponding ML libraries.
If you use open source R, you can also use SQL libraries to work on the data. For example, the sqldf library in R supports running SQL statements on data frames.
If you are doing some analytics tasks over data stored in Oracle DB or SQL Server, no programming language is better than using SQL.

So, learning SQL is not bad, and is beneficial while working on a data science pipeline. Today, we discuss some of the SQL functions that can be utilized over Oracle Database. However, if you are not familiar with Oracle and are comfortable with some other database from a different vendor, the tasks and syntax remain essentially similar across all of them. The SQL functions help data scientists and developers to perform various useful tasks using SQL that were previously confined to procedural languages. Data wrangling and shaping up the data becomes much easier using these functions. Operations on data are faster, as these functions sit close to the database and can be embedded into the data mining workflows to automate various tasks.

Here is a list of analytic tasks that can be performed on data using SQL.

Data Aggregations

Aggregation functions are very useful for understanding the data and present its summarized picture. The widely used aggregate functions are min, max, average, first_value, and last_value. You can find the complete list from the Oracle Database SQL Language Reference.

Ranking Functions

Ranking functions are useful to rank values in a data set and doing a Top-N analysis. For example, ranking functions can be used to rank employees within a department. There are two variant of rank functions in Oracle- rank and dense_rank. A detailed discussion is available on this page.

Bucketing the Data

Sometimes, we need to discrete the data for producing better predictions or results. For example, we can bucket Customer’s ages into four distinct groups to analyze their common traits across each age group. SQL bucketing functions come in handy here. WIDTH_BUCKET and NTILE are two widely used bucketing functions to transform continuous data to a discrete form.

Statistical Functions

Most databases, including Oracle, have provided many in-database statistical functions that can be used in a SQL query directly. Functions like statistical aggregates, hypothesis tests, and distribution fitting functions can be directly applied on data by writing a simple SQL function. A good presentation on Statistical functions is available on Oracle’s Technology Network forum.

Windowing Functions

Windowing functions are very useful for any aggregate calculations that involve a range of values or a group of rows. For example, these functions come handy in case of operations over time series data which includes calculations over a fixed or variable window time period. Also, these are useful in calculations of moving averages or running totals which require reference to one/more previous or following rows.

To learn more about analytical SQL functions, you can also refer my upcoming book on Data Science using Oracle Data Miner and Oracle R Enterprise published by Apress. However, before learning analytical SQL functions, one needs to discover SQL. Learning SQL is easy and you can get started using some of the materials provided below.

To end, I couldn’t stop sharing a great list on Six Handy SQL Features for Data Scientists that is adapted from a list on the Yhat’s blog: 7 handy SQL features for data scientists. The discussion is on Postgres database, but the learnings from this article can be implemented in any database.

Generate queries from a query: Basic string concatenation makes it easy to generate queries in mass that use data in a database to fetch data found in another system.
Handle dates: Dates are always tricky. Most of the times dates are never seen the way we need them. Excellent date functions exist in SQL to meet all your formatting and type conversion needs.
Text mining: We can take the advantage of SQL’s built-in string functions such as REGEX and SUBSTR before turning to a scripting language.
Load data into your database if it is in a CSV or any text files.
Generate sequences: Sequences are useful to enumerate over tables and prevents us from having to write for loops in the SQL code.

The opportunities for professionals with big data and data science skills is going to be more in 2017. The knowledge of SQL is necessary to stay ahead of the curve.

Categories: Machine Learning

SQL for Data Science