SlideShare a Scribd company logo
Database Design
most common pitfalls
€ whoami
● Federico Razzoli
● Freelance consultant
● Working with databases since 2000
hello@federico-razzoli.com
federico-razzoli.com
● I worked as a consultant for Percona and Ibuildings
(mainly MySQL and MariaDB)
● I worked as a DBA for fast-growing companies like
Catawiki, HumanState, TransferWise
Agenda
We will talk about…
● The most common design bad practices
● Information that is not easy to represent
● Relational model: why?
● Keys and indexes
● Data types
● Abusing NULL
● Hierarchies (trees)
● Lists
● Inheritance & polymorphism
● Heterogeneous rows
● Misc
Criteria
Criteria
● Queries should be fast
● Data structures should be reasonably simple
● Design must be reasonably extendable
Why Relational?
Specific Use Cases
● Some databases are designed for specific use cases
● In those cases, they may work much better than generic technologies
● Using them when not necessary may lead to use many technologies
● A technology should only be introduced if our company has:
○ Skills
○ Knowledge necessary for troubleshooting
○ Backups
○ High Availability
○ ...
Relational is flexible
With the relational model we:
● Are sure that data is written correctly (transactions)
● Can make sure that data is valid (schema, integrity constraints)
● Design tables with access patterns in mind
● To run a query we initially didn’t consider, most of the times we can just add
an index
Flexibility example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
surname VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL UNIQUE
);
SELECT * FROM user WHERE id = 24;
SELECT name, surname FROM user
WHERE email = 'picard@starfleet.earth';
CREATE INDEX idx_surname_name ON user (surname, name);
SELECT name, surname FROM user
WHERE surname LIKE 'B%'
ORDER BY surname, name;
When Relational is not a good fit
● Heterogeneous data (product catalogue)
● Searchable text
● Graphs
● …
However, for simple use cases relational databases include non-relational
features, like:
● JSON type and functions
● Arrays (PostgreSQL)
● Fulltext indexes
● ...
Keys and Indexes
Primary Key
● Column or set of columns that identifies each row (unique, not null)
● Usually you want to create an artificial column for this:
○ id
○ or uuid
Poor Primary Keys
● No primary key!
○ In MySQL this causes many performance problems
○ CDC applications need a way to identify each row
● Wrong columns
○ email
■ An email can change over time
■ An email address can be assigned to another person
■ The primary key is a PII!
○ name (eg: city name, product name…)
■ Quite long, especially if it must be UTF-8
■ Certain names can change over time
○ timestamp
■ Multiple rows could be created at the same timestamp!
■ Long
○ ...
UNIQUE
● An index whose values are distinct, or NULL
● Could theoretically be a primary key, but it’s not
Poor UNIQUE keys
● Columns whose values will always be distinct, no matter if there is an index or
not
○ Enforcing unicity implies extra reads, possibly on disk
● Columns that could have duplicates, but they’re unlikely
○ timestamp
○ (last_name, first_name)
Foreign Keys
● References to another table (user.city_id -> city.id)
● In most cases they are bad for performance
● They create problems for operations (ALTER TABLE)
● In MySQL they are not compatible with some other features
○ They don’t activate triggers
○ Table partitioning
○ Tables not using InnoDB
○ Many bugs
Indexing Bad Practices
● Indexing all columns: it won’t work
● Multi-columns indexes in random order
● Indexing columns with few distinct values (eg, boolean)
○ Unless you know what you’re doing
● Indexes contained in other indexes:
idx1 (email), idx2 (email, last_name)
idx (email, id)
UNIQUE unq1 (email), INDEX idx1 (email)
● Non-descriptive index names (like the ones above)
Looking at an index name (EXPLAIN),
I should know which columns it contains
Quick hints
● Learn how indexes work
○ Google: Federico Razzoli indexes bad practices
● Use pt-duplicate-key-checker, from Percona Toolkit
Data Types
Integer Types
● Don’t use bigger types than necessary
● ...but don’t overoptimise when you are not 100% sure. You’ll hardly see a
benefit using TINYINT instead of SMALLINT
● MySQL UNSIGNED is good, column’s max is double
● I discourage the use of exotic MySQL syntax like:
○ MEDIUMINT: non-standard, and 3-bytes variables don’t exist in nature
○ INT(length)
○ ZEROFILL
Real Numbers
● FLOAT and DOUBLE are fast when aggregating many values
● But they are subject to approximation. Don’t use them for prices, etc
● Instead you can use:
○ DECIMAL
○ INT - Multiply a number by 100, for example
○ DECIMAL is slower if heavy arithmetics is performed on many values
○ But storing a transformed value (price*100) can lead to
misunderstandings and bugs
Text Values
● Be sure that VARCHAR columns have adequate size for your data
● In PostgreSQL there is no difference between VARCHAR and TEXT, except
that for VARCHAR you specify a max size
● In MySQL TEXT and BLOB columns are stored separately
○ Less data read if you often don’t read those columns
○ More read operations if you always use SELECT *
● CHAR is only good for small fixed-size data. The space saving is tiny.
Temporal Types
● TIMESTAMP and DATETIME are mostly interchangeable
● MySQL YEAR is weird. 2-digit values meaning changes over time. Use
SMALLINT inxtead.
● MySQL TIME is apparently weird and useless. But not if you consider it as an
interval. (range: -838:59:59 .. 838:59:59)
● PostgreSQL has a proper INTERVAL type, which is surely better
● PostgreSQL allows to specify a timezone for each value (TIMESTAMP WITH
TIMEZONE)
○ Timezones depend on policy, economy and religion. They may vary by 15
mins. Timezones are created, dismissed, and changed. In one case a
timezone was changed by skipping a whole calendar day.
○ Never deal with timezones yourself, no one ever succeeded in history.
Store all dates as UTC, use an external library for conversion.
ENUM, SET
● MySQL weird types that include a list of allowed string values
● With ENUM, any number of values from the list are allowed
● With SET, exactly one value from the list is allowed
● '' is always allowed, because.
● Specifying the value by index is allowed, so 0 could match '1'
● Adding, dropping and changing values requires an ALTER TABLE
○ And possibly a locking table rebuild
Instead of ENUM
CREATE TABLE account (
state ENUM('active', 'suspended') NOT NULL,
...
)
Instead of ENUM
CREATE TABLE account (
state_id INT UNSIGNED NOT NULL,
...
)
CREATE TABLE state (
id INT UNSIGNED PRIMARY KEY,
state VARCHAR(100) NOT NULL UNIQUE
)
INSERT INTO state (state) VALUES ('active'), ('suspended');
Abusing NULL
NULL anomalies
mysql> SELECT
NULL = 1 AS a,
NULL <> 1 AS b,
NULL IS NULL AS c,
1 IS NOT NULL AS d;
+------+------+---+---+
| a | b | c | d |
+------+------+---+---+
| NULL | NULL | 1 | 1 |
+------+------+---+---+
-- This returns TRUE in MySQL:
NULL <=> NULL AND 1 <=> 1
Problematic queries
These queries will not return rows with age = NULL or approved = NULL
● WHERE year != 1994
● WHERE NOT (year = 1994)
● WHERE year > 2000
● WHERE NOT (year > 2000)
● WHERE approved != TRUE
● WHERE NOT approved
And:
SELECT CONCAT(year, ' years old') FROM user ...
Bad Reasons for NULL
● Because columns are NULLable by default
● To indicate that a value doesn’t exist
○ Use a special value instead: '' or -1 or 0 or …
○ But this is not always a bad reason: UNIQUE allows multiple NULLs
● Using your tables as spreadsheets
Spreadsheet Example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
-- if a user may have multiple URL’s, let’s move them
-- to a separate table:
-- url { id, user_id, url }
url_1 VARCHAR(100),
url_2 VARCHAR(100),
url_3 VARCHAR(100),
url_4 VARCHAR(100),
url_5 VARCHAR(100)
);
Spreadsheet Example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
-- if we may have users bank data or not,
-- let’s move them to another table:
-- bank { user_id, account_no, account_holder, ... }
bank_account_no VARCHAR(50),
bank_account_holder VARCHAR(100),
bank_iban VARCHAR(100),
bank_swift_code VARCHAR(5)
);
Hierarchies
Category Hierarchies
Antipattern: column-per-level
TABLE product (id, category_name, subcategory_name, name, price, ..)
-----
TABLE category (id, name)
TABLE product (id, category_id, subcategory_id, name, price, ...)
Possible problems:
● To add or delete a level, we need to add or drop a column
● A subcategory can be erroneously linked to multiple categories
● A category can be erroneously used as subcategory, and vice versa
Category Hierarchies
A better way:
TABLE category (id, parent_id, name)
TABLE product (id, category_id, name, price, ...)
Possible problems:
● Circular dependencies (must be prevented at application level)
Category Networks
What if every category can have multiple parents?
Antipattern:
TABLE category (id, parent_id1, parent_id2, name)
Category Graphs
If every category can have multiple parents, correct pattern:
TABLE category (id, name)
TABLE category_relationship (parent_id, child_id)
Antipattern: Parent List
If every category can have multiple parents, correct pattern:
TABLE category (id, name, parent_list)
INSERT INTO category (parent_list, name) VALUES
('sports/football/wear', 'football shoes');
● This antipattern is sometimes used because it simplifies certain aspects
● But it overcomplicates other aspects
● Also, up to recently MySQL and MariaDB did not support recursive queries,
but now they do
Storing Lists
Tags Column
● Suppose you want to store user-typed tags for posts
● You may be tempted to:
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags VARCHAR(200)
);
INSERT INTO post (tags, ... ) VALUES (sunday,venus, ... );
Tags Column
● But what about this query?
SELECT id FROM post WHERE tags LIKE '%sun%';
● Mmm, maybe this is better:
INSERT INTO post (tags, ... ) VALUES (',events,diversity,', ... );
SELECT id FROM post WHERE tags LIKE '%,sun,%';
However, this query cannot take advantage of indexes
Tag Table
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
...
);
CREATE TABLE tag (
post_id INT UNSIGNED,
tag VARCHAR(50),
PRIMARY KEY (post_id, tag),
INDEX (tag)
);
It works.
Queries will be able to use indexes.
Tag Array
-- PostgreSQL
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags TEXT[]
);
CREATE INDEX idx_tags on post USING GIN (tags);
-- MySQL
CREATE TABLE post (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
tags JSON DEFAULT JSON_ARRAY(),
INDEX idx_tags (tags)
);
-- MariaDB can store JSON arrays,
-- but since it cannot index them this solution is not viable
Inheritance
And
Polymorphism
Not So Different Entities
● Your DB has users, landlords and tenants
● Separate entities with different info
● But sometimes you treat them as one thing
● What to do?
Inheritance
● In the simplest case, they are just subclasses
● For example, landlords and tenants could be types of users
● Common properties are in the parent class
-- relational way to represent it:
TABLE user (id, first_name, last_name, email)
TABLE landlord (id, user_id, vat_number)
TABLE tenant (id, user_id, landlord_id)
PostgreSQL allows to do this in a more object oriented way, with Table Inheritance
Different Entities
● But sometimes it’s better to consider them different entities
● Antipattern: Union View
CREATE VIEW everyone AS
(SELECT id, first_name, last_name FROM landlord)
UNION
(SELECT id, first_name, last_name FROM tenant)
;
This makes some queries less verbose, at the cost of making them
potentially very slow
Unicity Across Tables /1
● But maybe both landlords and tenants have emails,
and we want to make sure they are UNIQUE
● Question: is there a practical reason?
Unicity Across Tables /2
● If it is necessary, you’re thinking about the problem in a wrong way
● If emails need be unique, they are a whole entity, so you’ll guarantee unicity
on a single table
TABLE landlord (id, first_name, last_name, vat_number)
TABLE tenant (id, first_name, last_name, landlord_id)
TABLE email (id, email UNIQUE, landlord_id, tenant_id)
Bloody hell! The solution initially looks great, but linking emails to landlords or
tenants in that way is horrific!
Unicity Across Tables /2bis
Why?
● Cannot build foreign keys (I don’t recommend it, but…)
● If in the future we want to link emails to suppliers, employees, etc, we’ll need
to add columns to the table
Unicity Across Tables /3
Even if we keep the landlord and tenant tables separated,
we can create a superset called person.
We decided it’s not a parent class, so it can just have an id column.
Every landlord, tenant and email is linked to a person.
TABLE landlord (id, person_id, first_name, last_name, vat_number)
TABLE tenant (id, person_id, first_name, last_name, landlord_id)
TABLE person (id)
TABLE email (id, person_id, email UNIQUE)
Heterogeneous Rows
Catalog of Products
Imagine we have a catalogue of products where:
● Every product has certain common characteristics
● It’s important to be able to run queries on all products
○ SELECT id FROM p WHERE qty = 0;
○ SELECT MAX(price) FROM p GROUP BY vendor;
● Each product type also has a unique set of characteristics
Antipattern: Stylesheet Table
● Keep all products in the same table
● Add a column for every characteristic that applies to at least one product
● Where a column doesn’t make sense, set to NULL
Problems:
● Too many columns and indexes
○ Generally bad for query performance, especially INSERTs
○ Generally bad for operations (repair, backup, restore, ALTER TABLE…)
● Adding/removing a product type means to add/remove a set of columns
○ But in practice columns will hardly be removed and will remain unused
● NULL means both “no value for this product” and “doesn’t apply to this type of
products”, leading to endless confusion
Antipattern: Table per Type
● Store products of different types in different tables
Problems:
● Metadata become data
○ How to get the list of product types?
● Some queries become overcomplicated
○ Get the id’s of out of stock products
○ Most expensive product for each vendor
Hybrid
● A single table for characteristics common to all product types
● A separate table per product type, for non-common characteristics
Problems:
● Many JOINs
● Adding/removing product types means to add/remove tables
Semi-Structured Data
● A single table for all products
● A regular column for each column common to all product types
● A semi-structured column for all type-specific characteristics
○ JSON, HStore…
○ Not arrays
○ Not CSV
● Proper indexes on unstructured data (depending on your technology)
Problems:
● Still a big table
● Queries on semi-structured data may be complicated and not supported by
ORMs
Antipattern: Entity,Attribute,Value
TABLE entity (id, name)
TABLE attribute (id, entity_id, name)
TABLE value (id, attribute_id, value)
● Each product type is an entity
● Each type characteristics are stored in attribute
● Each product is a set of values
Example:
Entity { id: 24, name: "Bed" }
Attribute [ { id: 123, entity_id: 24, name: "material" }, ... ]
Value [ { id: 999, attribute_id: 123, value: "wood" } ]
Antipattern: Entity,Attribute,Value
Problems:
● We JOIN 3 tables every time we want to get a single value!
● All values must be treated as texts
○ Unless we create multiple value tables: int_value, text_value...
○ Which means, even more JOINs
Misc Antipatterns
Names Beyond Comprehension
● I saw the following table names in production:
○ marco2015 # Marco was the table’s creation
○ jan2015 # jan was the month
○ tmp_tmp_tmp_fix
○ tmp_fix_fix_fix # Because symmetry is cool
I forgot many other examples because...
“Ultimate horror often paralyses memory in a merciful way.”
― H.P. Lovecraft
Data in Metadata
● Include data in table names
○ invoice_2020, invoice_2019, invoice_2018…
● User a year column instead
● If the table is too big, there are other ways to contain the problem
(partitioning)
Bad Names in General
● A names should tell everyone what a table or column is
○ Even to new hires!
○ Even to you… in 5 years from now!
● Otherwise people have to look at other documentation sources
○ ….which typically don’t exist
● Names should follow a standard across all company databases
○ singular/plural, long/short names, ...
● So people don’t have to check how a table / column is called exactly
Thank you for listening!
federico-razzoli.com/services
Telegram channel:
open_source_databases

More Related Content

PDF
Object Based Databases
PDF
Relational database management system
PPTX
Data Retrival
PPTX
Sql Basics And Advanced
PDF
How MySQL can boost (or kill) your application
PDF
MariaDB workshop
PDF
query optimization
PDF
Exalead managing terrabytes
Object Based Databases
Relational database management system
Data Retrival
Sql Basics And Advanced
How MySQL can boost (or kill) your application
MariaDB workshop
query optimization
Exalead managing terrabytes

Similar to Database Design most common pitfalls (20)

ODP
MySQL Performance Optimization
PPTX
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
PPTX
Indexes: The Second Pillar of Database Wisdom
PDF
Introduction to Databases - query optimizations for MySQL
PDF
Scaling MySQL Strategies for Developers
KEY
10x improvement-mysql-100419105218-phpapp02
KEY
10x Performance Improvements
PPTX
Boosting MySQL (for starters)
PPTX
"MySQL Boosting - DB Best Practices & Optimization" by José Luis Martínez - C...
PDF
Zurich2007 MySQL Query Optimization
PDF
Zurich2007 MySQL Query Optimization
PDF
How MySQL can boost (or kill) your application v2
PDF
MySQL 8.0: not only good, it’s GREAT! - PHP UK 2019
PDF
Relational Database Design Bootcamp
PPTX
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
PDF
Performance Tuning Best Practices
KEY
PostgreSQL
PPTX
SQL Server 2012 Best Practices
PDF
Beyond php - it's not (just) about the code
PPT
sql-basic.ppt
MySQL Performance Optimization
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
Indexes: The Second Pillar of Database Wisdom
Introduction to Databases - query optimizations for MySQL
Scaling MySQL Strategies for Developers
10x improvement-mysql-100419105218-phpapp02
10x Performance Improvements
Boosting MySQL (for starters)
"MySQL Boosting - DB Best Practices & Optimization" by José Luis Martínez - C...
Zurich2007 MySQL Query Optimization
Zurich2007 MySQL Query Optimization
How MySQL can boost (or kill) your application v2
MySQL 8.0: not only good, it’s GREAT! - PHP UK 2019
Relational Database Design Bootcamp
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Performance Tuning Best Practices
PostgreSQL
SQL Server 2012 Best Practices
Beyond php - it's not (just) about the code
sql-basic.ppt
Ad

More from Federico Razzoli (20)

PDF
MariaDB Data Protection: Backup Strategies for the Real World
PDF
MariaDB/MySQL_: Developing Scalable Applications
PDF
Webinar: Designing a schema for a Data Warehouse
PDF
High-level architecture of a complete MariaDB deployment
PDF
Webinar - Unleash AI power with MySQL and MindsDB
PDF
MariaDB Security Best Practices
PDF
A first look at MariaDB 11.x features and ideas on how to use them
PDF
MariaDB stored procedures and why they should be improved
PDF
Webinar - MariaDB Temporal Tables: a demonstration
PDF
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
PDF
MariaDB 10.11 key features overview for DBAs
PDF
Recent MariaDB features to learn for a happy life
PDF
Advanced MariaDB features that developers love.pdf
PDF
Automate MariaDB Galera clusters deployments with Ansible
PDF
Creating Vagrant development machines with MariaDB
PDF
MariaDB, MySQL and Ansible: automating database infrastructures
PDF
Playing with the CONNECT storage engine
PDF
MariaDB Temporal Tables
PDF
MySQL and MariaDB Backups
PDF
JSON in MySQL and MariaDB Databases
MariaDB Data Protection: Backup Strategies for the Real World
MariaDB/MySQL_: Developing Scalable Applications
Webinar: Designing a schema for a Data Warehouse
High-level architecture of a complete MariaDB deployment
Webinar - Unleash AI power with MySQL and MindsDB
MariaDB Security Best Practices
A first look at MariaDB 11.x features and ideas on how to use them
MariaDB stored procedures and why they should be improved
Webinar - MariaDB Temporal Tables: a demonstration
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
MariaDB 10.11 key features overview for DBAs
Recent MariaDB features to learn for a happy life
Advanced MariaDB features that developers love.pdf
Automate MariaDB Galera clusters deployments with Ansible
Creating Vagrant development machines with MariaDB
MariaDB, MySQL and Ansible: automating database infrastructures
Playing with the CONNECT storage engine
MariaDB Temporal Tables
MySQL and MariaDB Backups
JSON in MySQL and MariaDB Databases
Ad

Recently uploaded (20)

PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Transform Your Business with a Software ERP System
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Cost to Outsource Software Development in 2025
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Nekopoi APK 2025 free lastest update
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
history of c programming in notes for students .pptx
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PPTX
assetexplorer- product-overview - presentation
PPTX
Introduction to Artificial Intelligence
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Digital Strategies for Manufacturing Companies
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Why Generative AI is the Future of Content, Code & Creativity?
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Transform Your Business with a Software ERP System
PTS Company Brochure 2025 (1).pdf.......
Odoo POS Development Services by CandidRoot Solutions
Design an Analysis of Algorithms II-SECS-1021-03
Cost to Outsource Software Development in 2025
wealthsignaloriginal-com-DS-text-... (1).pdf
Nekopoi APK 2025 free lastest update
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Navsoft: AI-Powered Business Solutions & Custom Software Development
history of c programming in notes for students .pptx
iTop VPN Free 5.6.0.5262 Crack latest version 2025
assetexplorer- product-overview - presentation
Introduction to Artificial Intelligence
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Digital Strategies for Manufacturing Companies
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...

Database Design most common pitfalls

  • 2. € whoami ● Federico Razzoli ● Freelance consultant ● Working with databases since 2000 [email protected] federico-razzoli.com ● I worked as a consultant for Percona and Ibuildings (mainly MySQL and MariaDB) ● I worked as a DBA for fast-growing companies like Catawiki, HumanState, TransferWise
  • 3. Agenda We will talk about… ● The most common design bad practices ● Information that is not easy to represent ● Relational model: why? ● Keys and indexes ● Data types ● Abusing NULL ● Hierarchies (trees) ● Lists ● Inheritance & polymorphism ● Heterogeneous rows ● Misc
  • 5. Criteria ● Queries should be fast ● Data structures should be reasonably simple ● Design must be reasonably extendable
  • 7. Specific Use Cases ● Some databases are designed for specific use cases ● In those cases, they may work much better than generic technologies ● Using them when not necessary may lead to use many technologies ● A technology should only be introduced if our company has: ○ Skills ○ Knowledge necessary for troubleshooting ○ Backups ○ High Availability ○ ...
  • 8. Relational is flexible With the relational model we: ● Are sure that data is written correctly (transactions) ● Can make sure that data is valid (schema, integrity constraints) ● Design tables with access patterns in mind ● To run a query we initially didn’t consider, most of the times we can just add an index
  • 9. Flexibility example CREATE TABLE user ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, name VARCHAR(100) NOT NULL, surname VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL UNIQUE ); SELECT * FROM user WHERE id = 24; SELECT name, surname FROM user WHERE email = '[email protected]'; CREATE INDEX idx_surname_name ON user (surname, name); SELECT name, surname FROM user WHERE surname LIKE 'B%' ORDER BY surname, name;
  • 10. When Relational is not a good fit ● Heterogeneous data (product catalogue) ● Searchable text ● Graphs ● … However, for simple use cases relational databases include non-relational features, like: ● JSON type and functions ● Arrays (PostgreSQL) ● Fulltext indexes ● ...
  • 12. Primary Key ● Column or set of columns that identifies each row (unique, not null) ● Usually you want to create an artificial column for this: ○ id ○ or uuid
  • 13. Poor Primary Keys ● No primary key! ○ In MySQL this causes many performance problems ○ CDC applications need a way to identify each row ● Wrong columns ○ email ■ An email can change over time ■ An email address can be assigned to another person ■ The primary key is a PII! ○ name (eg: city name, product name…) ■ Quite long, especially if it must be UTF-8 ■ Certain names can change over time ○ timestamp ■ Multiple rows could be created at the same timestamp! ■ Long ○ ...
  • 14. UNIQUE ● An index whose values are distinct, or NULL ● Could theoretically be a primary key, but it’s not
  • 15. Poor UNIQUE keys ● Columns whose values will always be distinct, no matter if there is an index or not ○ Enforcing unicity implies extra reads, possibly on disk ● Columns that could have duplicates, but they’re unlikely ○ timestamp ○ (last_name, first_name)
  • 16. Foreign Keys ● References to another table (user.city_id -> city.id) ● In most cases they are bad for performance ● They create problems for operations (ALTER TABLE) ● In MySQL they are not compatible with some other features ○ They don’t activate triggers ○ Table partitioning ○ Tables not using InnoDB ○ Many bugs
  • 17. Indexing Bad Practices ● Indexing all columns: it won’t work ● Multi-columns indexes in random order ● Indexing columns with few distinct values (eg, boolean) ○ Unless you know what you’re doing ● Indexes contained in other indexes: idx1 (email), idx2 (email, last_name) idx (email, id) UNIQUE unq1 (email), INDEX idx1 (email) ● Non-descriptive index names (like the ones above) Looking at an index name (EXPLAIN), I should know which columns it contains
  • 18. Quick hints ● Learn how indexes work ○ Google: Federico Razzoli indexes bad practices ● Use pt-duplicate-key-checker, from Percona Toolkit
  • 20. Integer Types ● Don’t use bigger types than necessary ● ...but don’t overoptimise when you are not 100% sure. You’ll hardly see a benefit using TINYINT instead of SMALLINT ● MySQL UNSIGNED is good, column’s max is double ● I discourage the use of exotic MySQL syntax like: ○ MEDIUMINT: non-standard, and 3-bytes variables don’t exist in nature ○ INT(length) ○ ZEROFILL
  • 21. Real Numbers ● FLOAT and DOUBLE are fast when aggregating many values ● But they are subject to approximation. Don’t use them for prices, etc ● Instead you can use: ○ DECIMAL ○ INT - Multiply a number by 100, for example ○ DECIMAL is slower if heavy arithmetics is performed on many values ○ But storing a transformed value (price*100) can lead to misunderstandings and bugs
  • 22. Text Values ● Be sure that VARCHAR columns have adequate size for your data ● In PostgreSQL there is no difference between VARCHAR and TEXT, except that for VARCHAR you specify a max size ● In MySQL TEXT and BLOB columns are stored separately ○ Less data read if you often don’t read those columns ○ More read operations if you always use SELECT * ● CHAR is only good for small fixed-size data. The space saving is tiny.
  • 23. Temporal Types ● TIMESTAMP and DATETIME are mostly interchangeable ● MySQL YEAR is weird. 2-digit values meaning changes over time. Use SMALLINT inxtead. ● MySQL TIME is apparently weird and useless. But not if you consider it as an interval. (range: -838:59:59 .. 838:59:59) ● PostgreSQL has a proper INTERVAL type, which is surely better ● PostgreSQL allows to specify a timezone for each value (TIMESTAMP WITH TIMEZONE) ○ Timezones depend on policy, economy and religion. They may vary by 15 mins. Timezones are created, dismissed, and changed. In one case a timezone was changed by skipping a whole calendar day. ○ Never deal with timezones yourself, no one ever succeeded in history. Store all dates as UTC, use an external library for conversion.
  • 24. ENUM, SET ● MySQL weird types that include a list of allowed string values ● With ENUM, any number of values from the list are allowed ● With SET, exactly one value from the list is allowed ● '' is always allowed, because. ● Specifying the value by index is allowed, so 0 could match '1' ● Adding, dropping and changing values requires an ALTER TABLE ○ And possibly a locking table rebuild
  • 25. Instead of ENUM CREATE TABLE account ( state ENUM('active', 'suspended') NOT NULL, ... )
  • 26. Instead of ENUM CREATE TABLE account ( state_id INT UNSIGNED NOT NULL, ... ) CREATE TABLE state ( id INT UNSIGNED PRIMARY KEY, state VARCHAR(100) NOT NULL UNIQUE ) INSERT INTO state (state) VALUES ('active'), ('suspended');
  • 28. NULL anomalies mysql> SELECT NULL = 1 AS a, NULL <> 1 AS b, NULL IS NULL AS c, 1 IS NOT NULL AS d; +------+------+---+---+ | a | b | c | d | +------+------+---+---+ | NULL | NULL | 1 | 1 | +------+------+---+---+ -- This returns TRUE in MySQL: NULL <=> NULL AND 1 <=> 1
  • 29. Problematic queries These queries will not return rows with age = NULL or approved = NULL ● WHERE year != 1994 ● WHERE NOT (year = 1994) ● WHERE year > 2000 ● WHERE NOT (year > 2000) ● WHERE approved != TRUE ● WHERE NOT approved And: SELECT CONCAT(year, ' years old') FROM user ...
  • 30. Bad Reasons for NULL ● Because columns are NULLable by default ● To indicate that a value doesn’t exist ○ Use a special value instead: '' or -1 or 0 or … ○ But this is not always a bad reason: UNIQUE allows multiple NULLs ● Using your tables as spreadsheets
  • 31. Spreadsheet Example CREATE TABLE user ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL, -- if a user may have multiple URL’s, let’s move them -- to a separate table: -- url { id, user_id, url } url_1 VARCHAR(100), url_2 VARCHAR(100), url_3 VARCHAR(100), url_4 VARCHAR(100), url_5 VARCHAR(100) );
  • 32. Spreadsheet Example CREATE TABLE user ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email VARCHAR(100) NOT NULL, -- if we may have users bank data or not, -- let’s move them to another table: -- bank { user_id, account_no, account_holder, ... } bank_account_no VARCHAR(50), bank_account_holder VARCHAR(100), bank_iban VARCHAR(100), bank_swift_code VARCHAR(5) );
  • 34. Category Hierarchies Antipattern: column-per-level TABLE product (id, category_name, subcategory_name, name, price, ..) ----- TABLE category (id, name) TABLE product (id, category_id, subcategory_id, name, price, ...) Possible problems: ● To add or delete a level, we need to add or drop a column ● A subcategory can be erroneously linked to multiple categories ● A category can be erroneously used as subcategory, and vice versa
  • 35. Category Hierarchies A better way: TABLE category (id, parent_id, name) TABLE product (id, category_id, name, price, ...) Possible problems: ● Circular dependencies (must be prevented at application level)
  • 36. Category Networks What if every category can have multiple parents? Antipattern: TABLE category (id, parent_id1, parent_id2, name)
  • 37. Category Graphs If every category can have multiple parents, correct pattern: TABLE category (id, name) TABLE category_relationship (parent_id, child_id)
  • 38. Antipattern: Parent List If every category can have multiple parents, correct pattern: TABLE category (id, name, parent_list) INSERT INTO category (parent_list, name) VALUES ('sports/football/wear', 'football shoes'); ● This antipattern is sometimes used because it simplifies certain aspects ● But it overcomplicates other aspects ● Also, up to recently MySQL and MariaDB did not support recursive queries, but now they do
  • 40. Tags Column ● Suppose you want to store user-typed tags for posts ● You may be tempted to: CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags VARCHAR(200) ); INSERT INTO post (tags, ... ) VALUES (sunday,venus, ... );
  • 41. Tags Column ● But what about this query? SELECT id FROM post WHERE tags LIKE '%sun%'; ● Mmm, maybe this is better: INSERT INTO post (tags, ... ) VALUES (',events,diversity,', ... ); SELECT id FROM post WHERE tags LIKE '%,sun,%'; However, this query cannot take advantage of indexes
  • 42. Tag Table CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, ... ); CREATE TABLE tag ( post_id INT UNSIGNED, tag VARCHAR(50), PRIMARY KEY (post_id, tag), INDEX (tag) ); It works. Queries will be able to use indexes.
  • 43. Tag Array -- PostgreSQL CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags TEXT[] ); CREATE INDEX idx_tags on post USING GIN (tags); -- MySQL CREATE TABLE post ( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, tags JSON DEFAULT JSON_ARRAY(), INDEX idx_tags (tags) ); -- MariaDB can store JSON arrays, -- but since it cannot index them this solution is not viable
  • 45. Not So Different Entities ● Your DB has users, landlords and tenants ● Separate entities with different info ● But sometimes you treat them as one thing ● What to do?
  • 46. Inheritance ● In the simplest case, they are just subclasses ● For example, landlords and tenants could be types of users ● Common properties are in the parent class -- relational way to represent it: TABLE user (id, first_name, last_name, email) TABLE landlord (id, user_id, vat_number) TABLE tenant (id, user_id, landlord_id) PostgreSQL allows to do this in a more object oriented way, with Table Inheritance
  • 47. Different Entities ● But sometimes it’s better to consider them different entities ● Antipattern: Union View CREATE VIEW everyone AS (SELECT id, first_name, last_name FROM landlord) UNION (SELECT id, first_name, last_name FROM tenant) ; This makes some queries less verbose, at the cost of making them potentially very slow
  • 48. Unicity Across Tables /1 ● But maybe both landlords and tenants have emails, and we want to make sure they are UNIQUE ● Question: is there a practical reason?
  • 49. Unicity Across Tables /2 ● If it is necessary, you’re thinking about the problem in a wrong way ● If emails need be unique, they are a whole entity, so you’ll guarantee unicity on a single table TABLE landlord (id, first_name, last_name, vat_number) TABLE tenant (id, first_name, last_name, landlord_id) TABLE email (id, email UNIQUE, landlord_id, tenant_id) Bloody hell! The solution initially looks great, but linking emails to landlords or tenants in that way is horrific!
  • 50. Unicity Across Tables /2bis Why? ● Cannot build foreign keys (I don’t recommend it, but…) ● If in the future we want to link emails to suppliers, employees, etc, we’ll need to add columns to the table
  • 51. Unicity Across Tables /3 Even if we keep the landlord and tenant tables separated, we can create a superset called person. We decided it’s not a parent class, so it can just have an id column. Every landlord, tenant and email is linked to a person. TABLE landlord (id, person_id, first_name, last_name, vat_number) TABLE tenant (id, person_id, first_name, last_name, landlord_id) TABLE person (id) TABLE email (id, person_id, email UNIQUE)
  • 53. Catalog of Products Imagine we have a catalogue of products where: ● Every product has certain common characteristics ● It’s important to be able to run queries on all products ○ SELECT id FROM p WHERE qty = 0; ○ SELECT MAX(price) FROM p GROUP BY vendor; ● Each product type also has a unique set of characteristics
  • 54. Antipattern: Stylesheet Table ● Keep all products in the same table ● Add a column for every characteristic that applies to at least one product ● Where a column doesn’t make sense, set to NULL Problems: ● Too many columns and indexes ○ Generally bad for query performance, especially INSERTs ○ Generally bad for operations (repair, backup, restore, ALTER TABLE…) ● Adding/removing a product type means to add/remove a set of columns ○ But in practice columns will hardly be removed and will remain unused ● NULL means both “no value for this product” and “doesn’t apply to this type of products”, leading to endless confusion
  • 55. Antipattern: Table per Type ● Store products of different types in different tables Problems: ● Metadata become data ○ How to get the list of product types? ● Some queries become overcomplicated ○ Get the id’s of out of stock products ○ Most expensive product for each vendor
  • 56. Hybrid ● A single table for characteristics common to all product types ● A separate table per product type, for non-common characteristics Problems: ● Many JOINs ● Adding/removing product types means to add/remove tables
  • 57. Semi-Structured Data ● A single table for all products ● A regular column for each column common to all product types ● A semi-structured column for all type-specific characteristics ○ JSON, HStore… ○ Not arrays ○ Not CSV ● Proper indexes on unstructured data (depending on your technology) Problems: ● Still a big table ● Queries on semi-structured data may be complicated and not supported by ORMs
  • 58. Antipattern: Entity,Attribute,Value TABLE entity (id, name) TABLE attribute (id, entity_id, name) TABLE value (id, attribute_id, value) ● Each product type is an entity ● Each type characteristics are stored in attribute ● Each product is a set of values Example: Entity { id: 24, name: "Bed" } Attribute [ { id: 123, entity_id: 24, name: "material" }, ... ] Value [ { id: 999, attribute_id: 123, value: "wood" } ]
  • 59. Antipattern: Entity,Attribute,Value Problems: ● We JOIN 3 tables every time we want to get a single value! ● All values must be treated as texts ○ Unless we create multiple value tables: int_value, text_value... ○ Which means, even more JOINs
  • 61. Names Beyond Comprehension ● I saw the following table names in production: ○ marco2015 # Marco was the table’s creation ○ jan2015 # jan was the month ○ tmp_tmp_tmp_fix ○ tmp_fix_fix_fix # Because symmetry is cool I forgot many other examples because... “Ultimate horror often paralyses memory in a merciful way.” ― H.P. Lovecraft
  • 62. Data in Metadata ● Include data in table names ○ invoice_2020, invoice_2019, invoice_2018… ● User a year column instead ● If the table is too big, there are other ways to contain the problem (partitioning)
  • 63. Bad Names in General ● A names should tell everyone what a table or column is ○ Even to new hires! ○ Even to you… in 5 years from now! ● Otherwise people have to look at other documentation sources ○ ….which typically don’t exist ● Names should follow a standard across all company databases ○ singular/plural, long/short names, ... ● So people don’t have to check how a table / column is called exactly
  • 64. Thank you for listening! federico-razzoli.com/services Telegram channel: open_source_databases