For any task related to data science and machine learning, the performance of a model depends on how good the data is. Python Pandas and SQL are among the most powerful tools that can help in extracting and manipulating data efficiently. By combining these two together, data analysts can perform complex analysis even on large datasets. In this article, we’ll explore how you can combine Python Pandas with SQL to enhance the quality of your data analysis.
Before using Pandas and SQL together, let’s first understand what Pandas and SQL are capable of doing on their own.
Pandas is a software library written for Python programming language, used for data manipulation and analysis. It offers operations for manipulating tables, data structures, and time series data.
Learn More: The Ultimate Guide to Pandas For Data Science!
SQL stands for Structured Query Language, which is used for extracting, managing, and manipulating relational databases. It is useful in handling structured data by incorporating relations among entities and variables. It allows for inserting, updating, deleting, and managing the stored data in tables.
Learn More: SQL For Data Science: A Beginner’s Guide!
Using Pandas and SQL together makes the code more readable and, in certain cases, easier to implement. This is true for complex workflows, as SQL queries are much clearer and easier to read than the equivalent Pandas code. Moreover, most of the relational data originates from databases, and SQL is one of the main tools used to deal with relational data. This is one of the main reasons why working professionals like data analysts and data scientists prefer to integrate their functionalities.
To combine SQL queries with Pandas, one needs a common bridge between these two. This is where ‘pandasql’ comes into the picture. Pandasql allows you to run SQL queries directly within Pandas. This way, we can seamlessly use the SQL syntax without leaving the dynamic Pandas environment.
The first step to using Pandas and SQL together is to install pandasql into our environment.
pip install pandasql
Once the installation is complete, we can import the pandasql into our code and use it to execute the SQL queries on Pandas DataFrame.
import pandas as pd
import pandasql as psql
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# SQL query to select all data
query = "SELECT * FROM df"
result = psql.sqldf(query, locals())
result
Let’s break down the code
Once all the libraries are imported, it’s time to perform the data analysis using pandasql. The below section shows a few examples of how one can enhance the data analysis by combining Pandas and SQL.
# Required libraries
import pandas as pd
import pandasql as ps
import plotly.express as px
import ipywidgets as widgets
# Load the dataset
car_data = pd.read_csv("cars_datasets.csv")
car_data.head()
Let’s break down the code
In this step, we’ll try to get familiar with the data by exploring things like the names of columns, the data type of the features, and whether the data has any null values or not.
# Display column names
column_names = car_data.columns
column_names
"""
Output:
Index(['Unnamed: 0', 'price', 'brand', 'model', 'year', 'title_status',
'mileage', 'color', 'vin', 'lot', 'state', 'country', 'condition'],
dtype='object')
""”
# Display dataset info
car_data.info()
"""
Ouput:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2499 entries, 0 to 2498
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 2499 non-null int64
1 price 2499 non-null int64
2 brand 2499 non-null object
3 model 2499 non-null object
4 year 2499 non-null int64
5 title_status 2499 non-null object
6 mileage 2499 non-null float64
7 color 2499 non-null object
8 vin 2499 non-null object
9 lot 2499 non-null int64
10 state 2499 non-null object
11 country 2499 non-null object
12 condition 2499 non-null object
dtypes: float64(1), int64(4), object(8)
memory usage: 253.9+ KB
"""
# Check for null values
car_data.isnull().sum()
"""Output:
Unnamed: 0 0
price 0
brand 0
model 0
year 0
title_status 0
mileage 0
color 0
vin 0
lot 0
state 0
country 0
condition 0
dtype: int64
"""
Once we have loaded the dataset into the workflow. Now we will begin performing the data analysis.
Now let’s try using pandasql to analyze the above dataset by running some of our queries.
Let’s first find the top 10 most expensive cars from the entire dataset.
def q(query):
return ps.sqldf(query, {'car_data': car_data})
q("""
SELECT brand, model, year, price
FROM car_data
ORDER BY price DESC
LIMIT 10
""")
Let’s break down the code
Here we’ll find the average price of cars for each brand.
def q(query):
return ps.sqldf(query, {'car_data': car_data})
q("""
SELECT brand, ROUND(AVG(price), 2) AS avg_price
FROM car_data
GROUP BY brand
ORDER BY avg_price DESC""")
Let’s break down the code
Let’s make a list of the cars manufactured after 2015.
def q(query):
return ps.sqldf(query, {'car_data': car_data})
q("""
SELECT *
FROM car_data
WHERE year > 2015
ORDER BY year DESC
""")
Let’s break down the code
Now let’s find the total number of cars produced by each brand.
def q(query):
return ps.sqldf(query, {'car_data': car_data})
q("""
SELECT brand, COUNT(*) as total_listed
FROM car_data
GROUP BY brand
ORDER BY total_listed DESC
LIMIT 5
""")
Let’s break down the code
Let’s see how we can group the cars based on a condition. Here, the condition column shows the time when the listing was added or how much time is left. Based on that, we can categorize the cars and get their average pricing.
def q(query):
return ps.sqldf(query, {'car_data': car_data})
q("""
SELECT condition, ROUND(AVG(price), 2) AS avg_price, COUNT(*) as listings
FROM car_data
GROUP BY condition
ORDER BY avg_price DESC
""")
Let’s break down the code
Here we’ll find the average mileage of the cars for each brand and their average price.
def q(query):
return ps.sqldf(query, {'car_data': car_data})
q("""
SELECT brand,
ROUND(AVG(mileage), 2) AS avg_mileage,
ROUND(AVG(price), 2) AS avg_price,
COUNT(*) AS total_listings
FROM car_data
GROUP BY brand
ORDER BY avg_price DESC
LIMIT 10
""")
Let’s break down the code
Now let’s sort the top brands based on their calculated mileage ratio i.e. the average price per mile of the cars for each brand.
def q(query):
return ps.sqldf(query, {'car_data': car_data})
q("""
SELECT brand,
ROUND(AVG(price/mileage), 4) AS price_per_mile,
COUNT(*) AS total
FROM car_data
WHERE mileage > 0
GROUP BY brand
ORDER BY price_per_mile DESC
LIMIT 10
""")
Let’s break down the code
Here we’ll find and plot the number of cars of each brand in a particular city.
state_dropdown = widgets.Dropdown(
options=car_data['state'].unique().tolist(),
value=car_data['state'].unique()[0],
description='Select State:',
layout=widgets.Layout(width='50%')
)
def plot_avg_price_state(state_selected):
query = f"""
SELECT brand, AVG(price) AS avg_price
FROM car_data
WHERE state = '{state_selected}'
GROUP BY brand
ORDER BY avg_price DESC
"""
result = q(query)
fig = px.bar(result, x='brand', y='avg_price', color='brand',
title=f"Average Car Price in {state_selected}")
fig.show()
widgets.interact(plot_avg_price_state, state_selected=state_dropdown)
Let’s break down the code
For the notebook and the dataset used here, please visit this link.
Although pandasql is a convenient way to run SQL queries with Pandas and offers many efficient functionalities, it also has some limitations. In this section, we’ll explore these limitations and try to figure out when to rely on traditional Pandas or SQL, and when to use pandasql.
Using Pandas and SQL together significantly improves the data analysis workflow. By leveraging pandasql, one can seamlessly run SQL queries in the pandas DataFrames. This helps those who are familiar with SQL and want to work in Python environments. This integration of Pandas and SQL combines the flexibility of both and opens up new possibilities for data manipulation and analysis. With this, one can enhance their ability to tackle a wide range of data challenges. However, it’s important to consider the limitations of pandasql as well, and explore other approaches when dealing with large and complex datasets.