Alex-Monahan.github.io

Project maintained by Alex-Monahan Hosted on GitHub Pages — Theme by mattgraham

Python and SQL: Better Together

Python and SQL are complementary - we should focus on how best to integrate them rather than try to replace them!

By Alex Monahan
2021-08-15 (Yes, this is the one date format to rule them all)
LinkedIn
Twitter @__alexmonahan__
The views I express are my own and not my employer’s.

Welcome to my first blog post! I welcome any and all feedback over on Twitter - I’d love to learn from all of you!

There has been some spirited debate over SQL on Tech Twitter in the last few weeks. I am a huge fan of open discussions like this, and I consistently learn something when reading multiple viewpoints. It’s my hope to contribute to a friendly and productive dialog with fellow data folks! Let’s focus on a few positives from each perspective.

Here, Jamie Brandon makes some excellent points about SQL’s weaknesses. The points I agree with the most are related to SQL’s incompressibility. The way I encounter this most frequently is that it is difficult to execute queries on a dynamic list of columns purely in SQL. I also wish that it was easier to modularize code into functions in more powerful way, although there are some ways of doing so. Imagine a repository of SQL helper functions you could import! If only it were possible. If it is, please let me know on Twitter!

Pedram Navid responded to those points in an indirect way that I also found to be very impactful and thought provoking. Pedram focused on the organizational impacts of choosing to move away from SQL. I agree with Pedram that SQL is a tremendous data democratization tool and that it is important that SQL folks and other programming language folks work as a team. He also makes the case that SQL is often good enough, and I would go a step further and say there are many cases where SQL is a very expressive way to request and manipulate data! The cases where SQL is most useful are very accessible and can really empower people where data is just a portion of their job. SQL is the easiest to learn superpower as a data person!

Pedram also cites dbt as a powerful way to address some of SQL’s rough edges. I would like to take that line of thinking a step further here: How can we mix and match Python and SQL to get the best of both?

Why use SQL?

Before mixing and matching, why would we want to use SQL in the first place? While I agree with Jamie that it is imperfect, it has many redeeming qualities! Toss any others I’ve forgotten on Twitter!

SQL is very widely used
- It ranks 3rd in the Stack Overflow survey (see above graph!), and it was invented all the way back in 1979!
- Excel and nearly every Business Intelligence tool provide a SQL interface. Plus Python’s standard library includes SQLite, which is in the top 5 most widely deployed pieces of software in existence!
- I also agree with Erik Bernhardsson’s blog post I don’t want to learn your garbage query language. Let’s align on SQL so we only have to learn or develop things once!
The use of SQL is expanding
- The growth of cloud data warehouses is a huge indication of the power of SQL. Stream processing is adding SQL support, and the SQL language itself continues to grow in power and flexibility.
SQL is easy to get started with
- While I don’t necessarily have sources to cite here, I have led multiple SQL training courses that can take domain experts from 0 to introductory SQL in 8 hours. While I have not done the same for other languages, I feel like it would be difficult to be productive that quickly!
However, it’s hard to outgrow the need for SQL
- Even the majority of data scientists use SQL “Sometimes” or more - placing it as the number 2 language in Anaconda’s State of Data Science 2021.
- This also makes SQL great for your career!
SQL’s declarative nature removes the need to understand database internals
- The deliberate separation between the user’s request and the specific algorithms used by the database is an excellent abstraction layer 99%+ of the time. You can write years of productive SQL queries before learning the difference between a hash join and a sorted merge loop join! And even then, the database will usually choose correctly on your behalf.
Chances are good you need to use SQL to pull your data initially anyway
- If you already need to know some SQL to access your company’s valuable info, why not maximize your effectiveness with it?

Tools to use when combining SQL and Python

Asterisks indicate libraries I have not used yet, but that I am excited to try!

DuckDB
- Think of this as SQLite for analytics! I list the many pros of DuckDB below. It is my favorite way to mix and match SQL and Pandas.
SQLite
- SQLite is an easy way to process larger than memory data in Python. It comes bundled in the standard library.
- There are a few drawbacks that DuckDB addresses:
  - Slow performance for analytics (row-based instead of columnar like Pandas and DuckDB)
  - Requires data to be inserted into SQLite before executing a query on it
  - Overly flexible data types (This is debatable, but it makes it harder to interact with Pandas in my experience)
SQLAlchemy
- SQLAlchemy can connect to tons of different databases. It’s a huge advantage for the Python ecosystem.
- When used in combination with pandas.read_sql, SQLAlchemy can pull from a SQL DB and load a Pandas DataFrame.
- SQLAlchemy is traditionally known for its ORM (Object Relational Mapper) capabilities which allow you to avoid SQL. However, it also has powerful features for SQL fans like safe parameter escaping
- There are many other tools for querying SQL DB’s
  - pyodbc - Uses ODBC drivers for querying which is flexible, but can be slow
  - turbodbc* - Uses Apache Arrow to speed up ODBC connections
  - A variety of DB-specific native connectors (Ex: cx_oracle, psycopg2) - These are fast, but not universal
ipython-sql
- Convert a Jupyter cell into a SQL language cell! This enables syntax highlighting for SQL in Jupyter.
- I find this much nicer than working with SQL in multi-line strings, and you get syntax highlighting without having separate SQL files.
- This builds on top of SQLAlchemy
- See an example below!
PostgreSQL’s PL/Python*
- You can write functions and procedures on Postgres using Python!
- I think this addresses some of the concerns highlighted by Jamie Brandon, but as popular as Postgres is, it’s not a standard feature of all DBs
- This is a more DB-centric approach. I’ve found DB stored procedures to be more challenging to use with version control, but they certainly have value
Dask-sql*
- Process SQL statements on a Dask cluster (on your local machine or 1000’s of servers!)
- The first bullet in the readme advertises easy Python and SQL interoperability, which sounds great!
- This comes with some extra complexity over DuckDB - you’ll need Java for the Apache Calcite query parser and a few lines of code to set up a Dask cluster.
- I’d like to benchmark this a bit to see how well it performs in my single machine use case
dbt*
- dbt is a way to build SQL pipelines and execute jinja-templated SQL
- The templating seems like a powerful way to avoid some of SQL’s pain points like a rigid set of columns and requiring columns be specified in the SELECT and GROUP BY clauses.
- dbt also works well with version control systems
- I am very excited to try it out, but I am not a cloud data warehouse user so I am slightly outside their target audience
- There are adapters for both DuckDB and Clickhouse (A fast, massively popular open source columnar data warehouse) so this should perform well. These are both community maintained, so I’ll just need to test them out a bit!

I plan to explore more of these in upcoming posts!

DuckDB - One powerful way to mix Python and SQL

DuckDB is best summarized as the SQLite of analytics. In under 10ms, you can spin up your own in-process database that is 20x faster than SQLite, and in most cases faster than Pandas!

Besides the speed, why do I love DuckDB?

It works seamlessly with Pandas
- You can query a DataFrame without needing to insert it into the DB, and you can return results as DF’s as well. This is both simple and very fast since it is in the same process as Python.
Setup is easy
- Just pip install duckdb and you’re all set.
DuckDB supports almost all of PostgreSQL’s syntax, but also smooths rough edges
- When I first tested out DuckDB, it could handle everything I threw at it: Recursive CTE’s, Window functions, arbitrary subqueries, and more. Even lateral joins are supported! Since then, it has only improved by adding Regex, statistical functions, and more!
- As an example of smoothing rough edges, Postgres is notoriously picky about capitalization, but DuckDB is not case sensitive. While most function names come from Postgres, many equivalent function names from other DB’s can also be used.
MIT licensed
Parquet, csv, and Apache Arrow structures can also be queried by DuckDB
- Interoperability with Arrow and Parquet in particular expand the ecosystem that DuckDB can interact with.
DuckDB continues to dramatically improve, and the developers are fantastic!
- Truthfully, this should be item 0! I’ve had multiple (sometimes tricky) bugs be squashed in a matter of days, and many of my feature requests have been added. The entire team is excellent!
Persistence comes for free, but is optional! This allows DuckDB to work on larger-than-memory data.
- If you are building a data pipeline, it can be super useful to see all of the intermediate steps. Plus, I believe that DuckDB databases are going to become a key multi-table data storage structure, just as SQLite is today. The developers are in the midst of adding some powerful compression to DuckDB’s storage engine, so I see significant potential here.
Did I mention it’s fast?
- DuckDB is multi-threaded, so you can utilize all your CPU cores without any work on your end - no need to partition your data or anything!
DuckDB has a Relational API that is targeting Pandas compatibility
- While I am admittedly a SQL fan, having a relational API can be very helpful to add in some of the dynamism and flexibility of Pandas.

An example workflow with DuckDB

#I use Anaconda, so Pandas, and SQLAlchemy are already installed. Otherwise pip install those to start with
# !pip install pandas
# !pip install sqlalchemy

!pip install duckdb==0.2.8

#This is a SQLalchemy driver for DuckDB. It powers the ipython-sql library below 
#Thank you to the core developer of duckdb_engine, Elliana May! 
#She rapidly added a feature and squashed a bug so that it works better with ipython-sql!
!pip install duckdb_engine==0.1.8rc4 

#This allows for the %%sql magic in Jupyter to build SQL language cells!
!pip install ipython-sql

import duckdb
import pandas as pd
import sqlalchemy
#no need to import duckdb_engine - SQLAlchemy will auto-detect the driver needed based on your connection string!

Let’s play with a moderately sized dataset: a 1.6 GB csv

This analysis is inspired by this DuckDB intro article by Uwe Korn and can be downloaded here. Note that DuckDB can handle much larger datasets - this is only an example!

First we load the ipython-sql extension and default our output to be in Pandas DF format. We will also simplify what is printed after each SQL statement.
Next, we connect to an in-memory DuckDB and set it to use all our available CPU horsepower!

%load_ext sql
%config SqlMagic.autopandas=True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:
%sql pragma threads=16 

Check out this super clean syntax for directly querying a CSV file! As a note, we could have also used pandas.read_csv and then queried the resulting DataFrame. DuckDB’s csv reader allows us to skip a step!

%%sql
create table taxis as
SELECT * FROM 'yellow_tripdata_2016-01.csv';

	Count
0	10906858

Our csv import took 15 seconds: 40% faster than the 25 seconds Pandas required!

(the csv was loaded into RAM prior to timing so it may take an extra few seconds in both cases if you weren’t already working with that file)

Now let’s take a quick sample of our dataset and load it into a Pandas DF.

%%sql
SELECT
    *
FROM taxis
USING SAMPLE 5

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RatecodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount
0	2	2016-01-29 17:59:21	2016-01-29 18:17:50	2	2.36	-73.990707	40.756535	1	N	-73.991417	40.735207	1	13.0	1.0	0.5	2.96	0.3	17.76
1	2	2016-01-06 20:46:10	2016-01-06 20:49:45	1	0.66	-73.983322	40.750511	1	N	-73.994217	40.755001	1	4.5	0.5	0.5	1.00	0.3	6.80
2	1	2016-01-09 13:23:12	2016-01-09 13:29:04	1	0.50	-73.994179	40.751106	1	N	-73.985992	40.750496	2	5.5	0.0	0.5	0.00	0.3	6.30
3	2	2016-01-28 09:57:02	2016-01-28 10:04:10	2	3.22	-73.989326	40.742462	1	N	-74.002136	40.709290	1	11.0	0.0	0.5	1.50	0.3	13.30
4	2	2016-01-18 20:45:15	2016-01-18 21:06:36	1	13.80	-74.012917	40.706169	1	N	-73.939781	40.852749	1	37.5	0.5	0.5	3.80	0.3	42.60

As you can see below, DuckDB did a great job auto-detecting our column types. One more traditional DB hassle eliminated!

%sql describe taxis

	Field	Type	Null	Key	Default	Extra
0	VendorID	INTEGER	YES	None	None	None
1	tpep_pickup_datetime	TIMESTAMP	YES	None	None	None
2	tpep_dropoff_datetime	TIMESTAMP	YES	None	None	None
3	passenger_count	INTEGER	YES	None	None	None
4	trip_distance	DOUBLE	YES	None	None	None
5	pickup_longitude	DOUBLE	YES	None	None	None
6	pickup_latitude	DOUBLE	YES	None	None	None
7	RatecodeID	INTEGER	YES	None	None	None
8	store_and_fwd_flag	VARCHAR	YES	None	None	None
9	dropoff_longitude	DOUBLE	YES	None	None	None
10	dropoff_latitude	DOUBLE	YES	None	None	None
11	payment_type	INTEGER	YES	None	None	None
12	fare_amount	DOUBLE	YES	None	None	None
13	extra	DOUBLE	YES	None	None	None
14	mta_tax	DOUBLE	YES	None	None	None
15	tip_amount	DOUBLE	YES	None	None	None
16	tolls_amount	DOUBLE	YES	None	None	None
17	improvement_surcharge	DOUBLE	YES	None	None	None
18	total_amount	DOUBLE	YES	None	None	None

Now let’s do some analysis!

First, let’s see how we can pass in a parameter through ipython-sql to better understand how VendorId 1 is behaving. Just for fun, we are going to pass in the entire WHERE clause. Why? It shows that you can build up dynamic SQL statements using Python! It’s not quite as easy to use as a templating language, but just as flexible.

my_where_clause = """
    WHERE
        vendorid = 1
"""

%%sql
SELECT
    vendorid
    ,passenger_count
    ,count(*) as count
    ,avg(fare_amount/total_amount) as average_fare_percentage
    ,avg(trip_distance) as average_distance
FROM taxis
{my_where_clause}
GROUP BY
    vendorid
    ,passenger_count

	VendorID	passenger_count	count	average_fare_percentage	average_distance
0	1	0	303	0.480146	3.217492
1	1	1	4129971	0.787743	6.362515
2	1	2	689305	0.796871	5.674639
3	1	3	164253	0.804641	3.126398
4	1	4	83817	0.819123	32.439788
5	1	5	2360	0.806479	3.923771
6	1	6	1407	0.788484	3.875124
7	1	7	3	0.755823	4.800000
8	1	9	10	0.785628	3.710000

Next we’ll aggregate our data in DuckDB (which is exceptionally fast for Group By queries), and then pivot the result with Pandas. Pandas is a great fit for pivoting because the column names are not known ahead of time, which would require writing some dynamic SQL. However, DuckDB has a pivot_table function on their roadmap for their Python Relational (read, DataFrame-like) API! This will allow us to pivot larger than memory data and use multiple CPU cores for that pivoting operation.

Note: The more complex the query, the better DuckDB performs relative to Pandas! This is because it can do more work in less passes through the dataset, and because it is using multiple CPU cores.

%%sql aggregated_df <<
SELECT
    --Aggregate up to a weekly level. lpad makes sure week numbers always have 2 digits (Ex: '02' instead of '2')
    date_part('year',tpep_pickup_datetime) || 
        lpad(cast(date_part('week',tpep_pickup_datetime) as varchar),2,'0') as yyyyww
    ,passenger_count
    ,count(*) as count
    ,min(total_amount) as min_amount
    ,quantile_cont(total_amount,0.1) as _10th_percentile
    ,quantile_cont(total_amount,0.25) as _25th_percentile
    ,quantile_cont(total_amount,0.5) as _50th_percentile
    ,avg(total_amount) as avg_amount
    ,quantile_cont(total_amount,0.75) as _75th_percentile
    ,quantile_cont(total_amount,0.9) as _90th_percentile
    ,max(total_amount) as max_amount
    ,stddev_pop(total_amount) as std_amount
FROM taxis
GROUP BY
    date_part('year',tpep_pickup_datetime) || 
        lpad(cast(date_part('week',tpep_pickup_datetime) as varchar),2,'0')
    ,passenger_count
ORDER BY
    yyyyww
    ,passenger_count

Returning data to local variable aggregated_df

aggregated_df

	yyyyww	passenger_count	count	min_amount	_10th_percentile	_25th_percentile	_50th_percentile	avg_amount	_75th_percentile	_90th_percentile	max_amount	std_amount
0	201601	0	129	-160.46	3.000000e-01	4.8000	10.560	21.901163	22.5600	70.019999	278.30	45.119609
1	201601	1	1814315	-227.10	6.350000e+00	8.1800	11.160	14.909425	16.3000	26.300000	3045.34	12.835089
2	201601	2	361892	-100.80	6.800000e+00	8.3000	11.620	15.835135	17.1600	29.300000	1297.75	14.113061
3	201601	3	100336	-80.80	6.800000e+00	8.3000	11.300	15.394737	16.6200	27.300000	1297.75	14.566035
4	201601	4	47683	-120.30	6.800000e+00	8.3000	11.300	15.561985	16.8000	28.300000	889.30	13.701526
5	201601	5	138636	-52.80	6.360000e+00	8.3000	11.300	15.263942	16.6200	27.960000	303.84	12.353158
6	201601	6	85993	-52.80	6.360000e+00	8.1900	11.160	14.889752	16.3000	26.300000	170.50	11.906014
7	201601	7	6	70.80	7.330000e+01	76.3675	83.685	82.648333	90.5450	90.960000	90.96	8.069036
8	201601	8	4	8.30	8.690000e+00	9.2750	9.805	36.630000	37.1600	86.029992	118.61	47.335385
9	201601	9	1	11.80	1.180000e+01	11.8000	11.800	11.800000	11.8000	11.800000	11.80	0.000000
10	201602	0	137	-10.10	3.000000e-01	6.3500	13.390	24.970073	35.3000	70.513999	148.01	30.159674
11	201602	1	1911279	-150.30	6.360000e+00	8.3000	11.300	15.160333	16.5500	27.300000	1297.75	13.008159
12	201602	2	383117	-200.80	6.800000e+00	8.3000	11.750	15.868765	17.1600	29.300000	1154.84	14.086363
13	201602	3	107589	-52.80	6.800000e+00	8.3000	11.300	15.317193	16.8000	27.300000	472.94	12.881864
14	201602	4	50919	-958.40	6.800000e+00	8.3000	11.300	15.412864	16.8000	27.950000	958.40	14.208739
15	201602	5	147897	-54.80	6.360000e+00	8.3000	11.300	15.380488	16.8000	28.340000	240.80	12.517914
16	201602	6	91826	-52.80	6.360000e+00	8.3000	11.300	15.058432	16.5600	27.300000	200.00	12.108098
17	201602	7	2	8.10	8.365000e+00	8.7625	9.425	9.425000	10.0875	10.485000	10.75	1.325000
18	201602	8	6	5.80	7.050000e+00	8.7250	10.100	36.156667	63.1500	91.319997	101.84	39.502676
19	201602	9	6	-9.60	1.430511e-07	9.7250	11.360	36.283333	57.2825	97.489994	122.81	46.255662
20	201603	0	110	0.00	3.000000e-01	0.9275	11.460	30.601818	53.1750	90.259985	282.30	42.165426
21	201603	1	1526446	-200.80	6.800000e+00	8.5000	11.760	15.951602	17.3000	29.000000	8008.80	16.090206
22	201603	2	281137	-75.30	6.800000e+00	8.7600	12.250	16.938134	18.3000	32.800000	725.30	15.298718
23	201603	3	76414	-75.84	6.800000e+00	8.7600	11.800	16.356485	17.7600	29.800000	1137.85	15.731051
24	201603	4	35166	-73.80	6.800000e+00	8.7600	11.800	16.261706	17.7600	30.800000	459.79	14.083521
25	201603	5	117191	-52.80	6.800000e+00	8.7600	11.800	16.123766	17.7600	30.300000	271.30	13.284821
26	201603	6	70617	-52.80	6.800000e+00	8.5000	11.760	15.869602	17.3000	29.300000	326.84	12.932498
27	201603	7	5	7.60	7.788000e+00	8.0700	8.100	29.332000	46.5900	64.415997	76.30	27.853121
28	201603	8	5	8.80	9.120000e+00	9.6000	80.800	58.760000	95.3000	97.700000	99.30	40.931973
29	201603	9	3	10.10	1.164000e+01	13.9500	17.800	42.233333	58.3000	82.599996	98.80	40.122008
30	201604	0	80	-0.80	3.000000e-01	1.1750	17.945	33.452375	62.7625	80.300000	132.80	33.498167
31	201604	1	1622846	-440.34	6.800000e+00	8.7600	12.300	16.211807	18.3000	29.160000	111271.65	88.344714
32	201604	2	323779	-234.80	6.950000e+00	8.8000	12.360	16.718532	18.8000	30.950000	795.84	13.931410
33	201604	3	89060	-52.80	6.850000e+00	8.8000	12.300	16.218767	18.3000	29.160000	724.82	13.392155
34	201604	4	41461	-52.80	6.890000e+00	8.8000	12.300	16.079813	18.3000	28.800000	249.35	12.625694
35	201604	5	126134	-52.80	6.800000e+00	8.8000	12.300	16.341509	18.3600	30.290000	259.80	12.791541
36	201604	6	77699	-52.80	6.800000e+00	8.8000	12.300	16.129398	18.3000	29.750000	177.46	12.564739
37	201604	7	5	8.10	9.460000e+00	11.5000	18.170	37.074000	70.0000	74.559999	77.60	30.256773
38	201604	8	5	10.38	1.046800e+01	10.6000	80.800	60.596000	98.8400	100.952000	102.36	41.560278
39	201604	9	5	10.80	4.360000e+01	92.8000	97.800	82.100000	100.3000	105.399999	108.80	36.024436
40	201653	0	64	-7.80	3.000000e-01	1.1425	11.775	24.791719	34.7250	73.869993	107.30	29.413893
41	201653	1	852098	-300.80	6.300000e+00	7.8000	10.800	15.229454	16.5600	29.120000	1003.38	13.651999
42	201653	2	212052	-258.20	6.360000e+00	8.3000	11.300	16.580428	17.7600	34.240000	1004.94	15.547422
43	201653	3	63032	-75.30	6.360000e+00	8.3000	11.300	16.141369	17.3000	31.340000	998.30	15.105769
44	201653	4	35412	-100.30	6.800000e+00	8.3000	11.400	16.463965	17.3000	32.195966	656.15	15.957624
45	201653	5	71221	-21.30	6.300000e+00	8.1600	11.160	15.816108	17.1600	32.300000	469.30	13.725301
46	201653	6	43020	-52.80	6.300000e+00	7.8800	11.000	15.358045	16.6400	30.300000	148.34	13.060431
47	201653	7	4	8.20	8.665000e+00	9.3625	40.025	45.147500	75.8100	85.727998	92.34	37.006354
48	201653	8	6	9.60	4.495000e+01	80.4250	80.800	77.088333	84.5800	105.514995	125.19	34.114769
49	201653	9	8	5.80	7.025000e+00	7.7375	8.400	10.331250	12.0125	15.004999	20.15	4.291994

pivoted_df = pd.pivot_table(aggregated_df,
               values=['_50th_percentile'],
               index=['yyyyww'],
               columns=['passenger_count'],
               aggfunc='max'
            )
pivoted_df

	_50th_percentile
passenger_count	0	1	2	3	4	5	6	7	8	9
yyyyww
201601	10.560	11.16	11.62	11.3	11.3	11.30	11.16	83.685	9.805	11.80
201602	13.390	11.30	11.75	11.3	11.3	11.30	11.30	9.425	10.100	11.36
201603	11.460	11.76	12.25	11.8	11.8	11.80	11.76	8.100	80.800	17.80
201604	17.945	12.30	12.36	12.3	12.3	12.30	12.30	18.170	80.800	97.80
201653	11.775	10.80	11.30	11.3	11.4	11.16	11.00	40.025	80.800	8.40

Let’s summarize what we have accomplished!

To go from Python to SQL, we have used Python to generate SQL dynamically and have embedded our SQL code nicely alongside Python in our Jupyter Notebook programming environment. We also discussed using DuckDB to read a Pandas DF directly. When moving from SQL to Python, DuckDB can quickly and easily convert SQL results back into Pandas DF’s. Now we can mix and match Python and SQL to our heart’s content!

Thank you for reading! Please hop on Twitter and let me know if any of this was helpful or what I can cover next!