Data Analysis with TensorFlow in PostgreSQL

Data Analysis
with TensorFlow
in PostgreSQL
Dave Page
12 May 2021

Dave Page
● EDB (CTO Office)
○ VP & Chief Architect, Database Infrastructure
● PostgreSQL
○ Core Team
○ pgAdmin Lead Developer

2021 Copyright © EnterpriseDB Corporation All Rights Reserved
In this talk...
3
● What are PostgreSQL, pl/python3 and TensorFlow?
● Why would I use them together?
● Examples of analysis types.
● Calling TensorFlow from PostgreSQL.
● Preparing data.
● Designing a network.
● Training a model.
● Performing analysis.

What is PostgreSQL?
5
50,000 foot overview
● Relational, SQL based database.
● Fully enterprise ready; increasingly replacing Oracle, SQL Server, DB2 and more.
● Used in pretty much every sector: government, law enforcement, financial, healthcare…
● Possibly the most SQL Standard compliant database there is.
● Highly extensible:
○ Plugin extension modules.
○ Plugin procedural languages (e.g. Python, Perl, R, Java, v8).
○ Low level code hooks.

What is pl/python3?
6
● Procedural language for PostgreSQL.
● Write stored procedures, functions and anonymous blocks within your database.
● Supports Python 3:
○ Don’t try to use pl/python, which uses the now-obsolete Python 2!
● The vast Python ecosystem of libraries may be used.
● Combines the power of Python with PostgreSQL.

What is TensorFlow?
7
● Open Source Machine Learning library.
● Originated from the Google Brain team.
● Extremely powerful and flexible.
● Supports a variety of languages:
○ Python
○ C/C++
○ R
○ Javascript
○ …
● Library of pre-built models and datasets.
● Supports distributed learning.

Why?
8
Not just for fun
● Our data is already in the database.
● We can easily use the power of SQL to choose and format data for analysis:
○ SQL is designed for working with datasets:
■ datum ~= scalar
■ tuple ~= vector
■ array/set ~= matrix/tensor
○ SELECT … FROM … WHERE …
○ Mathematical functions & operators: sqrt(), log(), power(), mod(), round()...
○ Aggregates and Window Functions, Common Table Expressions.

Regression analysis
10
● Model relationships between input values (features) and outputs.
● Analyse new or hypothetical inputs and predict outputs.
● For example, house prices:
○ Inputs:
■ Number of bedrooms
■ Property type (detached, semi, flat etc.)
■ Property condition
■ Proximity to the beach
■ Proximity to major roads or a rail link to the city
■ Council tax cost
■ Number of nearby pubs serving CAMRA recommended beer
○ Output:
■ The price of the house

Time series analysis
11
● Analyse time series data and make predictions.
● More powerful than linear analysis, predicting:
○ Linear trends (upwards or downwards)
○ Seasonal variability, e.g.
■ Summer is busier than winter.
■ Friday and Saturday night account for 60% of trade.
■ January is always the slowest month.
■ Multiple seasonalities can be predicted together.
○ Noise is inherently smoothed out, unless it overshadows trends and seasonal variations.
● Useful for multiple purposes:
○ Capacity management of application deployments.
○ Sales predictions.
○ Stock management.

Other types of analysis
12
Not covered in this talk!
● Text prediction/generation.
● Text classification.
● Image classification.
● Object detection.
● Audio analysis.
● Speech recognition.
● The list goes on!

Setting up pl/python3
14
● Install PostgreSQL:
○ If using EDB installers, use StackBuilder to install the LanguagePack.
○ On Linux, install the pl/python3 package, e.g. on Debian/Ubuntu: postgresql-plpython3-13.
● Run psql or pgAdmin, and execute:
○ CREATE EXTENSION plpython3;

Setting up the Python environment
15
● Any Python libraries that will be used need to be added to the Python environment, using pip or the
OS package manager:
○ On Linux, using the system Python:
■ sudo pip3 install <package 1> …
○ On macOS, using the EDB LanguagePack:
■ sudo /Library/edb/languagepack/v1/Python-3.7/bin/pip install <package 1> …
○ On Window, using the EDB LanguagePack (as Administrator):
■ C:edblanguagepackv1Python-3.7binpip install <package 1> …
● Recommended starter packages:
○ tensorflow
○ numpy (will be installed automatically as a dependency of tensorflow)
○ pandas
○ matplotlib
○ seaborn

A brief introduction to pl/python3
16
A.K.A. Making sure it all works

Preparing the data
18
● Cleanup:
○ Goal: maximise the accuracy of the model.
○ Method: eliminate data that might skew results.
○ Requires: analysis and understanding of existing data.
○ Applies mostly to regression analysis where we're trying to model a relationship, rather than time series.
● Multiple data sets:
○ Training data is used to teach the model.
○ Validation data is used during training to validate what has been learnt.
○ Test data is optionally used to test the model.
○ Training vs. validation data is typically randomly selected for regression analysis.
○ Training vs. validation data is typically sequential for time series analysis.
○ Ratio of training to validation (and test) data is usually skewed towards training, e.g. 3:1 or 4:1.

Correlations
19
Analysis
● Some features have stronger correlations to the output than others.
● We can exclude uncorrelated or loosely correlated features to simplify the neural network (model)
and increase accuracy.
NOTICE: Correlation data:
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
crim 1.000000 -0.200469 0.406583 -0.055892 0.420972 -0.219247 0.352734 -0.379670 0.625505 0.582764 0.289946 -0.385064 0.455621 -0.388305
zn -0.200469 1.000000 -0.533828 -0.042697 -0.516604 0.311991 -0.569537 0.664408 -0.311948 -0.314563 -0.391679 0.175520 -0.412995 0.360445
indus 0.406583 -0.533828 1.000000 0.062938 0.763651 -0.391676 0.644779 -0.708027 0.595129 0.720760 0.383248 -0.356977 0.603800 -0.483725
chas -0.055892 -0.042697 0.062938 1.000000 0.091203 0.091251 0.086518 -0.099176 -0.007368 -0.035587 -0.121515 0.048788 -0.053929 0.175260
nox 0.420972 -0.516604 0.763651 0.091203 1.000000 -0.302188 0.731470 -0.769230 0.611441 0.668023 0.188933 -0.380051 0.590879 -0.427321
rm -0.219247 0.311991 -0.391676 0.091251 -0.302188 1.000000 -0.240265 0.205246 -0.209847 -0.292048 -0.355501 0.128069 -0.613808 0.695360
age 0.352734 -0.569537 0.644779 0.086518 0.731470 -0.240265 1.000000 -0.747881 0.456022 0.506456 0.261515 -0.273534 0.602339 -0.376955
dis -0.379670 0.664408 -0.708027 -0.099176 -0.769230 0.205246 -0.747881 1.000000 -0.494588 -0.534432 -0.232471 0.291512 -0.496996 0.249929
rad 0.625505 -0.311948 0.595129 -0.007368 0.611441 -0.209847 0.456022 -0.494588 1.000000 0.910228 0.464741 -0.444413 0.488676 -0.381626
tax 0.582764 -0.314563 0.720760 -0.035587 0.668023 -0.292048 0.506456 -0.534432 0.910228 1.000000 0.460853 -0.441808 0.543993 -0.468536
ptratio 0.289946 -0.391679 0.383248 -0.121515 0.188933 -0.355501 0.261515 -0.232471 0.464741 0.460853 1.000000 -0.177383 0.374044 -0.507787
b -0.385064 0.175520 -0.356977 0.048788 -0.380051 0.128069 -0.273534 0.291512 -0.444413 -0.441808 -0.177383 1.000000 -0.366087 0.333461
lstat 0.455621 -0.412995 0.603800 -0.053929 0.590879 -0.613808 0.602339 -0.496996 0.488676 0.543993 0.374044 -0.366087 1.000000 -0.737663
medv -0.388305 0.360445 -0.483725 0.175260 -0.427321 0.695360 -0.376955 0.249929 -0.381626 -0.468536 -0.507787 0.333461 -0.737663 1.000000

Eliminating outliers
20
Analysis
● Outlier values in the training/validation data can make it harder to build an accurate model.
● Analyse the input features and automatically remove rows with outliers using an algorithm such as
interquartile range (IQR), i.e. those values that sit in the first or fourth quartile of distribution:
NOTICE: Outliers detected using IQR:
row crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
0 False False False False False False False False False False False False False False
...
18 False False False False False False False False False False False True False False
...

Eliminating outliers
21
Example code
# Outlier detection
# Note: 'data' is a Pandas dataframe containing our raw data
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
plpy.notice('Outliers detected using IQR:n{}n'.
format((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))))
# Outlier Removal
plpy.notice('Removing outliers...')
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]

Visualisation
22
Everyone likes a pretty picture

Creating data sets
23
Example code
# Figure out how many rows to use for training, validation and test
test_rows = int((actual_rows/100) * test_pct)
validation_rows = int((actual_rows/100) * validation_pct)
training_rows = actual_rows - test_rows - validation_rows
# Split the data into input and output dataframes (the last column is the output)
input = data[columns[:-1]]
output = data[columns[-1:]]
# Split the input and output into training, validation and test sets
training_input = input[:training_rows]
training_output = output[:training_rows]
validation_input = input[training_rows:training_rows+validation_rows]
validation_output = output[training_rows:training_rows+validation_rows]
test_input = input[training_rows+validation_rows:]
test_output = output[training_rows+validation_rows:]

Designing a model
25
● A model is an interconnected layered network of known mathematical functions with trainable
parameters (or filters); a.k.a. a neural network.
● Different model architectures are suited to different types of task:
○ Regression might use a simple network with multiple layers:
■ The number of input filters matches the number of input features.
■ Inner layers can be constructed as desired for best results; often based on trial and error and experience.
■ The number of output filters matches the number of outputs.
■ Layers are dense; an activation function allows modelling of non-linear functions.
○ The WaveNet architecture is well suited to time series analysis, despite being originally designed for audio
analysis:
■ A single filter on the input layer.
■ Multiple layers of filters with increasing dilation to detect seasonal patterns, e.g. 2, 4, 8, 16, 32.
■ A single filter on the output layer.
■ Layers are convolutional; all filters in one layer connect to all filters in the next.

Creating the model
26
Regression analysis
# Define the model
# 2 layers of 13 filters for the input features, and one layer of one filter for the output
l1 = tf.keras.layers.Dense(units=13, input_shape=(2,), activation = 'relu')
l2 = tf.keras.layers.Dense(units=13, activation = 'relu')
l3 = tf.keras.layers.Dense(units=1))
model = tf.keras.Sequential([l1, l2, l3])
# Compile it
model.compile(loss=tf.keras.losses.MeanSquaredError(),
optimizer='adam')

Creating the model
27
# Define the model
model = keras.models.Sequential()
# Input layer
model.add(keras.layers.InputLayer(input_shape=[None, 1]))
# Add multiple 1D convolutional layers with increasing dilation rates to
# allow each layer to detect patterns over longer time frequencies
for dilation_rate in (1, 2, 4, 8, 16, 32):
model.add(keras.layers.Conv1D(filters=32, kernel_size=2, strides=1,
dilation_rate=dilation_rate, padding="causal", activation="relu"))
# Add one output layer, with 1 filter to give us one output per time step
model.add(keras.layers.Conv1D(filters=1, kernel_size=1))
# Create a learning optimiser and compile the model
optimizer = keras.optimizers.Adam(lr=3e-4)
model.compile(loss=keras.losses.Huber(), optimizer=optimizer, metrics=["mae"])

Training the model
29
● Training is repeated multiple times (or epochs), hopefully improving each time:
○ The training data set is used for learning.
○ The validation data set is used to validate results during training.
○ The test data is optionally used to test the model after training.
● We monitor a metric to assess how well the network is learning:
○ For regression, I've had success with Mean Squared Error (which I monitor as Root Mean Squared Error).
○ For time series, Huber loss works well (it's less sensitive to outliers than MSE).
● A callback is used to checkpoint (save) the model each time we see a better accuracy than any
previous epoch.
● With regression analysis, we use an 'early stopping' callback to exit the training epoch loop when
no further significant improvement is made, to prevent the network learning the training data
rather than the mathematical relationship.

Training the model
30
Regression analysis
# Save a checkpoint each time our loss metric improves.
checkpoint = ModelCheckpoint("checkpoint.h5", save_best_only=True)
# Use early stopping
early_stopping = EarlyStopping(patience=50)
# Display output. This would go to stdout automatically if we weren't using pl/python
logger = LambdaCallback(
on_epoch_end=lambda epoch,
logs: plpy.notice(
'epoch: {}, training RMSE: {} ({}%), validation RMSE: {} ({}%)'.format(
epoch,
sqrt(logs['loss']), round(100 / max_z * sqrt(logs['loss']), 5),
sqrt(logs['val_loss']), round(100 / max_z * sqrt(logs['val_loss']), 5))))
# Train it!
history = model.fit(training_input, training_output,
validation_data=(validation_input, validation_output),
epochs=epochs, verbose=False, batch_size=50,
callbacks=[logger, checkpoint, early_stopping])

Training the model
31
# Save checkpoints when we get the best model
model_checkpoint = keras.callbacks.ModelCheckpoint("checkpoint.h5", save_best_only=True)
# Use early stopping to prevent over fitting
early_stopping = keras.callbacks.EarlyStopping(patience=50)
# Display output. This would go to stdout automatically if we weren't using pl/python
logger = LambdaCallback(
on_epoch_end=lambda epoch,
logs: plpy.notice(
'epoch: {}, training RMSE: {} ({}%), validation RMSE: {} ({}%)'.format(
epoch,
sqrt(logs['loss']), round(100 / max_z * sqrt(logs['loss']), 5),
sqrt(logs['val_loss']), round(100 / max_z * sqrt(logs['val_loss']), 5))))
# Train it!
history = model.fit(train_set, epochs=100,
validation_data=valid_set,
callbacks=[early_stopping, logger, model_checkpoint])

Use once vs. use many
32
● Each model is trained with a specific data set.
● With regression analysis, we can re-use a model with any input features to predict an output:
○ In practice this means we might use the model repeatedly over time to model different inputs.
● With time series analysis we can reuse the model to predict different timeframes:
○ In practice, this means we might only use a model once when performing time series analysis.
● Models can be 're-trained' as new data becomes available:
○ If the data distribution has changed, the model might degrade.
○ It may be preferable to re-train from scratch.
● For complex problems, it may be useful to start with a suitable pre-trained generic model, and
continue training with specific data:
○ This is known as transfer learning.

Using the model
34
Regression analysis
CREATE OR REPLACE FUNCTION public.rg_analysis(
input_values double precision[],
model_path text)
RETURNS double precision[]
LANGUAGE 'plpython3u'
AS $BODY$
import tensorflow as tf
# Reset everything
tf.keras.backend.clear_session()
tf.random.set_seed(42)
# Load the model
model = tf.keras.models.load_model("checkpoint.h5")
# Are we dealing with a single prediction,
# or a list of them?
if not any(isinstance(sub, list) for sub in
input_values):
data = [input_values]
else:
data = input_values
# Make the prediction(s)
result = model.predict([data])[0]
result = [ item for elem in result for item in elem]
return result
$BODY$;

Using the model
35
# Load the best model from the last checkpoint
model = keras.models.load_model("checkpoint.h5")
cnn_forecast = model_forecast(model,
series[..., np.newaxis],
window_size)
cnn_forecast = cnn_forecast[train_samples - window_size:-1, -1, 0]
plt.figure(figsize=(10, 6))
plot_series(dates,
np.concatenate([series[:train_samples],
np.full(valid_samples, None, dtype=float)]),
label="Training Data")
plot_series(dates,
np.concatenate([np.full(train_samples, None, dtype=float),
series[train_samples:]]),
label="Validation Data")
plot_series(dates,
np.concatenate([np.full(train_samples, None, dtype=float),
cnn_forecast]),
label="Forecast Data")
plt.savefig('ts_analysis.png')

Summary
37
In this talk:
● We introduced PostgreSQL, TensorFlow and pl/python3.
● Discussed why we might use them together.
● Introduced two (of many) types of analysis we can perform:
○ Regression.
○ Time Series.
● Showed how we can call TensorFlow from PostgreSQL using pl/python3.
● Walked through the main steps of performing an analysis, considering regression and time series
problems:
○ Preparing the data.
○ Creating a model.
○ Training the model.
○ Using the model.

Questions and resources
38
Questions?
● EDB blog, includes posts on machine learning and other topics:
○ https://www.enterprisedb.com/dave-page
● Experimental code from my ML/AI journey:
○ https://github.com/dpage/ml-experiments
● Other resources:
○ https://www.postgresql.org
○ https://www.tensorflow.org
○ https://www.postgresql.org/docs/current/plpython.html
○ https://pandas.pydata.org
○ https://numpy.org
○ https://matplotlib.org
○ https://seaborn.pydata.org

Data Analysis with TensorFlow in PostgreSQL

Recommended

More Related Content

What's hot (20)

Similar to Data Analysis with TensorFlow in PostgreSQL (20)

More from EDB (20)

Recently uploaded (20)

Data Analysis with TensorFlow in PostgreSQL