Database Design most common pitfalls

Database Design
most common pitfalls

€ whoami
● Federico Razzoli
● Freelance consultant
● Working with databases since 2000
hello@federico-razzoli.com
federico-razzoli.com
● I worked as a consultant for Percona and Ibuildings
(mainly MySQL and MariaDB)
● I worked as a DBA for fast-growing companies like
Catawiki, HumanState, TransferWise

Agenda
We will talk about…
● The most common design bad practices
● Information that is not easy to represent
● Relational model: why?
● Keys and indexes
● Data types
● Abusing NULL
● Hierarchies (trees)
● Lists
● Inheritance & polymorphism
● Heterogeneous rows
● Misc

Criteria
● Queries should be fast
● Data structures should be reasonably simple
● Design must be reasonably extendable

Specific Use Cases
● Some databases are designed for specific use cases
● In those cases, they may work much better than generic technologies
● Using them when not necessary may lead to use many technologies
● A technology should only be introduced if our company has:
○ Skills
○ Knowledge necessary for troubleshooting
○ Backups
○ High Availability
○ ...

Relational is flexible
With the relational model we:
● Are sure that data is written correctly (transactions)
● Can make sure that data is valid (schema, integrity constraints)
● Design tables with access patterns in mind
● To run a query we initially didn’t consider, most of the times we can just add
an index

Flexibility example
CREATE TABLE user (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
surname VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL UNIQUE
);
SELECT * FROM user WHERE id = 24;
SELECT name, surname FROM user
WHERE email = 'picard@starfleet.earth';
CREATE INDEX idx_surname_name ON user (surname, name);
SELECT name, surname FROM user
WHERE surname LIKE 'B%'
ORDER BY surname, name;

When Relational is not a good fit
● Heterogeneous data (product catalogue)
● Searchable text
● Graphs
● …
However, for simple use cases relational databases include non-relational
features, like:
● JSON type and functions
● Arrays (PostgreSQL)
● Fulltext indexes
● ...

Primary Key
● Column or set of columns that identifies each row (unique, not null)
● Usually you want to create an artificial column for this:
○ id
○ or uuid

Poor Primary Keys
● No primary key!
○ In MySQL this causes many performance problems
○ CDC applications need a way to identify each row
● Wrong columns
○ email
■ An email can change over time
■ An email address can be assigned to another person
■ The primary key is a PII!
○ name (eg: city name, product name…)
■ Quite long, especially if it must be UTF-8
■ Certain names can change over time
○ timestamp
■ Multiple rows could be created at the same timestamp!
■ Long
○ ...

UNIQUE
● An index whose values are distinct, or NULL
● Could theoretically be a primary key, but it’s not

Poor UNIQUE keys
● Columns whose values will always be distinct, no matter if there is an index or
not
○ Enforcing unicity implies extra reads, possibly on disk
● Columns that could have duplicates, but they’re unlikely
○ timestamp
○ (last_name, first_name)

Foreign Keys
● References to another table (user.city_id -> city.id)
● In most cases they are bad for performance
● They create problems for operations (ALTER TABLE)
● In MySQL they are not compatible with some other features
○ They don’t activate triggers
○ Table partitioning
○ Tables not using InnoDB
○ Many bugs

Indexing Bad Practices
● Indexing all columns: it won’t work
● Multi-columns indexes in random order
● Indexing columns with few distinct values (eg, boolean)
○ Unless you know what you’re doing
● Indexes contained in other indexes:
idx1 (email), idx2 (email, last_name)
idx (email, id)
UNIQUE unq1 (email), INDEX idx1 (email)
● Non-descriptive index names (like the ones above)
Looking at an index name (EXPLAIN),
I should know which columns it contains

Quick hints
● Learn how indexes work
○ Google: Federico Razzoli indexes bad practices
● Use pt-duplicate-key-checker, from Percona Toolkit

Integer Types
● Don’t use bigger types than necessary
● ...but don’t overoptimise when you are not 100% sure. You’ll hardly see a
benefit using TINYINT instead of SMALLINT
● MySQL UNSIGNED is good, column’s max is double
● I discourage the use of exotic MySQL syntax like:
○ MEDIUMINT: non-standard, and 3-bytes variables don’t exist in nature
○ INT(length)
○ ZEROFILL

Real Numbers
● FLOAT and DOUBLE are fast when aggregating many values
● But they are subject to approximation. Don’t use them for prices, etc
● Instead you can use:
○ DECIMAL
○ INT - Multiply a number by 100, for example
○ DECIMAL is slower if heavy arithmetics is performed on many values
○ But storing a transformed value (price*100) can lead to
misunderstandings and bugs

Text Values
● Be sure that VARCHAR columns have adequate size for your data
● In PostgreSQL there is no difference between VARCHAR and TEXT, except
that for VARCHAR you specify a max size
● In MySQL TEXT and BLOB columns are stored separately
○ Less data read if you often don’t read those columns
○ More read operations if you always use SELECT *
● CHAR is only good for small fixed-size data. The space saving is tiny.

Temporal Types
● TIMESTAMP and DATETIME are mostly interchangeable
● MySQL YEAR is weird. 2-digit values meaning changes over time. Use
SMALLINT inxtead.
● MySQL TIME is apparently weird and useless. But not if you consider it as an
interval. (range: -838:59:59 .. 838:59:59)
● PostgreSQL has a proper INTERVAL type, which is surely better
● PostgreSQL allows to specify a timezone for each value (TIMESTAMP WITH
TIMEZONE)
○ Timezones depend on policy, economy and religion. They may vary by 15
mins. Timezones are created, dismissed, and changed. In one case a
timezone was changed by skipping a whole calendar day.
○ Never deal with timezones yourself, no one ever succeeded in history.
Store all dates as UTC, use an external library for conversion.

ENUM, SET
● MySQL weird types that include a list of allowed string values
● With ENUM, any number of values from the list are allowed
● With SET, exactly one value from the list is allowed
● '' is always allowed, because.
● Specifying the value by index is allowed, so 0 could match '1'
● Adding, dropping and changing values requires an ALTER TABLE
○ And possibly a locking table rebuild

Instead of ENUM
CREATE TABLE account (
state ENUM('active', 'suspended') NOT NULL,
...
)

Instead of ENUM
CREATE TABLE account (
state_id INT UNSIGNED NOT NULL,
...
)
CREATE TABLE state (
id INT UNSIGNED PRIMARY KEY,
state VARCHAR(100) NOT NULL UNIQUE
)
INSERT INTO state (state) VALUES ('active'), ('suspended');

NULL anomalies
mysql> SELECT
NULL = 1 AS a,
NULL <> 1 AS b,
NULL IS NULL AS c,
1 IS NOT NULL AS d;
+------+------+---+---+
| a | b | c | d |
+------+------+---+---+
| NULL | NULL | 1 | 1 |
+------+------+---+---+
-- This returns TRUE in MySQL:
NULL <=> NULL AND 1 <=> 1

Problematic queries
These queries will not return rows with age = NULL or approved = NULL
● WHERE year != 1994
● WHERE NOT (year = 1994)
● WHERE year > 2000
● WHERE NOT (year > 2000)
● WHERE approved != TRUE
● WHERE NOT approved
And:
SELECT CONCAT(year, ' years old') FROM user ...

Bad Reasons for NULL
● Because columns are NULLable by default
● To indicate that a value doesn’t exist
○ Use a special value instead: '' or -1 or 0 or …
○ But this is not always a bad reason: UNIQUE allows multiple NULLs
● Using your tables as spreadsheets

Spreadsheet Example
CREATE TABLE user (
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
-- if a user may have multiple URL’s, let’s move them
-- to a separate table:
-- url { id, user_id, url }
url_1 VARCHAR(100),
url_2 VARCHAR(100),
url_3 VARCHAR(100),
url_4 VARCHAR(100),
url_5 VARCHAR(100)
);

Spreadsheet Example
CREATE TABLE user (
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(100) NOT NULL,
-- if we may have users bank data or not,
-- let’s move them to another table:
-- bank { user_id, account_no, account_holder, ... }
bank_account_no VARCHAR(50),
bank_account_holder VARCHAR(100),
bank_iban VARCHAR(100),
bank_swift_code VARCHAR(5)
);

Category Hierarchies
Antipattern: column-per-level
TABLE product (id, category_name, subcategory_name, name, price, ..)
-----
TABLE category (id, name)
TABLE product (id, category_id, subcategory_id, name, price, ...)
Possible problems:
● To add or delete a level, we need to add or drop a column
● A subcategory can be erroneously linked to multiple categories
● A category can be erroneously used as subcategory, and vice versa

Category Hierarchies
A better way:
TABLE category (id, parent_id, name)
TABLE product (id, category_id, name, price, ...)
Possible problems:
● Circular dependencies (must be prevented at application level)

Category Networks
What if every category can have multiple parents?
Antipattern:
TABLE category (id, parent_id1, parent_id2, name)

Category Graphs
If every category can have multiple parents, correct pattern:
TABLE category (id, name)
TABLE category_relationship (parent_id, child_id)

Antipattern: Parent List
If every category can have multiple parents, correct pattern:
TABLE category (id, name, parent_list)
INSERT INTO category (parent_list, name) VALUES
('sports/football/wear', 'football shoes');
● This antipattern is sometimes used because it simplifies certain aspects
● But it overcomplicates other aspects
● Also, up to recently MySQL and MariaDB did not support recursive queries,
but now they do

Tags Column
● Suppose you want to store user-typed tags for posts
● You may be tempted to:
CREATE TABLE post (
tags VARCHAR(200)
);
INSERT INTO post (tags, ... ) VALUES (sunday,venus, ... );

Tags Column
● But what about this query?
SELECT id FROM post WHERE tags LIKE '%sun%';
● Mmm, maybe this is better:
INSERT INTO post (tags, ... ) VALUES (',events,diversity,', ... );
SELECT id FROM post WHERE tags LIKE '%,sun,%';
However, this query cannot take advantage of indexes

Tag Table
CREATE TABLE post (
...
);
CREATE TABLE tag (
post_id INT UNSIGNED,
tag VARCHAR(50),
PRIMARY KEY (post_id, tag),
INDEX (tag)
);
It works.
Queries will be able to use indexes.

Tag Array
-- PostgreSQL
CREATE TABLE post (
tags TEXT[]
);
CREATE INDEX idx_tags on post USING GIN (tags);
-- MySQL
CREATE TABLE post (
tags JSON DEFAULT JSON_ARRAY(),
INDEX idx_tags (tags)
);
-- MariaDB can store JSON arrays,
-- but since it cannot index them this solution is not viable

Not So Different Entities
● Your DB has users, landlords and tenants
● Separate entities with different info
● But sometimes you treat them as one thing
● What to do?

Inheritance
● In the simplest case, they are just subclasses
● For example, landlords and tenants could be types of users
● Common properties are in the parent class
-- relational way to represent it:
TABLE user (id, first_name, last_name, email)
TABLE landlord (id, user_id, vat_number)
TABLE tenant (id, user_id, landlord_id)
PostgreSQL allows to do this in a more object oriented way, with Table Inheritance

Different Entities
● But sometimes it’s better to consider them different entities
● Antipattern: Union View
CREATE VIEW everyone AS
(SELECT id, first_name, last_name FROM landlord)
UNION
(SELECT id, first_name, last_name FROM tenant)
;
This makes some queries less verbose, at the cost of making them
potentially very slow

Unicity Across Tables /1
● But maybe both landlords and tenants have emails,
and we want to make sure they are UNIQUE
● Question: is there a practical reason?

● If it is necessary, you’re thinking about the problem in a wrong way
● If emails need be unique, they are a whole entity, so you’ll guarantee unicity
on a single table
TABLE landlord (id, first_name, last_name, vat_number)
TABLE tenant (id, first_name, last_name, landlord_id)
TABLE email (id, email UNIQUE, landlord_id, tenant_id)
Bloody hell! The solution initially looks great, but linking emails to landlords or
tenants in that way is horrific!

Unicity Across Tables /2bis
Why?
● Cannot build foreign keys (I don’t recommend it, but…)
● If in the future we want to link emails to suppliers, employees, etc, we’ll need
to add columns to the table

Even if we keep the landlord and tenant tables separated,
we can create a superset called person.
We decided it’s not a parent class, so it can just have an id column.
Every landlord, tenant and email is linked to a person.
TABLE landlord (id, person_id, first_name, last_name, vat_number)
TABLE tenant (id, person_id, first_name, last_name, landlord_id)
TABLE person (id)
TABLE email (id, person_id, email UNIQUE)

Catalog of Products
Imagine we have a catalogue of products where:
● Every product has certain common characteristics
● It’s important to be able to run queries on all products
○ SELECT id FROM p WHERE qty = 0;
○ SELECT MAX(price) FROM p GROUP BY vendor;
● Each product type also has a unique set of characteristics

Antipattern: Stylesheet Table
● Keep all products in the same table
● Add a column for every characteristic that applies to at least one product
● Where a column doesn’t make sense, set to NULL
Problems:
● Too many columns and indexes
○ Generally bad for query performance, especially INSERTs
○ Generally bad for operations (repair, backup, restore, ALTER TABLE…)
● Adding/removing a product type means to add/remove a set of columns
○ But in practice columns will hardly be removed and will remain unused
● NULL means both “no value for this product” and “doesn’t apply to this type of
products”, leading to endless confusion

Antipattern: Table per Type
● Store products of different types in different tables
Problems:
● Metadata become data
○ How to get the list of product types?
● Some queries become overcomplicated
○ Get the id’s of out of stock products
○ Most expensive product for each vendor

Hybrid
● A single table for characteristics common to all product types
● A separate table per product type, for non-common characteristics
Problems:
● Many JOINs
● Adding/removing product types means to add/remove tables

Semi-Structured Data
● A single table for all products
● A regular column for each column common to all product types
● A semi-structured column for all type-specific characteristics
○ JSON, HStore…
○ Not arrays
○ Not CSV
● Proper indexes on unstructured data (depending on your technology)
Problems:
● Still a big table
● Queries on semi-structured data may be complicated and not supported by
ORMs

Antipattern: Entity,Attribute,Value
TABLE entity (id, name)
TABLE attribute (id, entity_id, name)
TABLE value (id, attribute_id, value)
● Each product type is an entity
● Each type characteristics are stored in attribute
● Each product is a set of values
Example:
Entity { id: 24, name: "Bed" }
Attribute [ { id: 123, entity_id: 24, name: "material" }, ... ]
Value [ { id: 999, attribute_id: 123, value: "wood" } ]

Antipattern: Entity,Attribute,Value
Problems:
● We JOIN 3 tables every time we want to get a single value!
● All values must be treated as texts
○ Unless we create multiple value tables: int_value, text_value...
○ Which means, even more JOINs

Names Beyond Comprehension
● I saw the following table names in production:
○ marco2015 # Marco was the table’s creation
○ jan2015 # jan was the month
○ tmp_tmp_tmp_fix
○ tmp_fix_fix_fix # Because symmetry is cool
I forgot many other examples because...
“Ultimate horror often paralyses memory in a merciful way.”
― H.P. Lovecraft

Data in Metadata
● Include data in table names
○ invoice_2020, invoice_2019, invoice_2018…
● User a year column instead
● If the table is too big, there are other ways to contain the problem
(partitioning)

Bad Names in General
● A names should tell everyone what a table or column is
○ Even to new hires!
○ Even to you… in 5 years from now!
● Otherwise people have to look at other documentation sources
○ ….which typically don’t exist
● Names should follow a standard across all company databases
○ singular/plural, long/short names, ...
● So people don’t have to check how a table / column is called exactly

Thank you for listening!
federico-razzoli.com/services
Telegram channel:
open_source_databases

Database Design most common pitfalls

More Related Content

Similar to Database Design most common pitfalls (20)

More from Federico Razzoli (20)

Recently uploaded (20)

Database Design most common pitfalls