SlideShare a Scribd company logo
Transactional Storage for MySQL
FAST. RELIABLE. PROVEN.
InnoDB Internals: InnoDB File
Formats and Source Code
Structure
MySQL University, October 2009
Calvin Sun
Principal Engineer
Oracle Corporation
Today’s Topics
• Goals of InnoDB
• Key Functional Characteristics
• InnoDB Design Considerations
• InnoDB Architecture
• InnoDB On Disk Format
• Source Code Structure
• Q & A
Goals of InnoDB
• OLTP oriented
• Performance, Reliability, Scalability
• Data Protection
• Portability
InnoDB Key Functional
Characteristics
• Full transaction support
• Row-level locking
• MVCC
• Crash recovery
• Efficient IO
Design Considerations
• Modeled on Gray & Reuter’s “Transactions
Processing: Concepts & Techniques”
• Also emulated the Oracle architecture
• Added unique subsystems
• Doublewrite
• Insert buffering
• Adaptive hash index
• Designed to evolve with changing
hardware & requirements
InnoDB Architecture
IO
Buffer
File Space Manager
Transaction
Handler API Embedded InnoDB API
Cursor / Row
Mini-
transaction
LockB-tree
Page
Server Applications
InnoDB On Disk Format
• InnoDB Database Files
• InnoDB Tablespaces
• InnoDB Pages / Extents
• InnoDB Rows
• InnoDB Indexes
• InnoDB Logs
• File Format Design Considerations
InnoDB Database Files
ibdata files
Systemtablespace
internal
data
dictionary
MySQL Data Directory
InnoDB
tables
OR innodb_file_per_table
.ibd files
.frm files
undo
logs
insert
buffer
InnoDB Tablespaces
• A tablespace consists of multiple files and/or
raw disk partitions.
file_name:file_size[:autoextend[:max:max_file_size]]
• A file/partition is a collection of segments.
• A segment consists of fixed-length pages.
• The page size is always 16KB in uncompressed
tablespaces, and 1KB-16KB in compressed
tablespaces (for both data and index).
System Tablespace
• Internal Data Dictionary
• Undo
• Insert Buffer
• Doublewrite Buffer
• MySQL Replication Info
InnoDB Tablespaces
Extent
Segment
Extent
Extent Extent
an extent = 64 pages
Extent
Trx id
Row
Field 1
Roll pointer
Field pointers
Field 2 Field n
Row
Page
Row
Row
Row Row
Leaf node segment
Tablespace
Rollback segment
Non-leaf node segment
RowRow
InnoDB Pages
Symbol Value Notes
FIL_PAGE_INODE 3 File segment inode
FIL_PAGE_INDEX 17855 B-tree node
FIL_PAGE_TYPE_BLOB 10 Uncompressed BLOB page
FIL_PAGE_TYPE_ZBLOB 11 1st compressed BLOB page
FIL_PAGE_TYPE_ZBLOB2 12 Subsequent compressed BLOB page
FIL_PAGE_TYPE_SYS 6 System page
FIL_PAGE_TYPE_TRX_SYS 7 Transaction system page
others
i-buf bitmap, I-buf free list, file space
header, extent desp page, new
allocated page
InnoDB Page TypesInnoDB Page Types
InnoDB Pages
A page consists of: a page header, a page
trailer, and a page body (rows or other
contents).
Page header
Page trailer
row offset array
Row RowRow
Row
Row
RowRow
Row
Row RowRow
Page Declares
typedef struct /* a space address */
{
ulint pageno; /* page number within the file */
ulint boffset; /* byte offset within the page */
} fil_addr_t;
typedef struct
{
ulint checksum; /* checksum of the page (since 4.0.14) */
ulint page_offset; /* page offset inside space */
fil_addr_t previous; /* offset or fil_addr_t */
fil_addr_t next; /* offset or fil_addr_t */
dulint page_lsn; /* lsn of the end of the newest
modification log record to the page */
PAGE_TYPE page type; /* file page type */
dulint file_flush_lsn;/* the file has been flushed to disk
at least up to this lsn */
int space_id; /* space id of the page */
char data[]; /* will grow */
ulint page_lsn; /* the last 4 bytes of page_lsn */
ulint checksum; /* page checksum, or checksum magic, or 0 */
} PAGE, *PAGE;
InnoDB Compressed Pages
•InnoDB keeps a “modification
log” in each page
•Updates & inserts of small
records are written to the log
w/o page reconstruction;
deletes don’t even require
uncompression
•Log also tells InnoDB if the
page will compress to fit page
size
•When log space runs out,
InnoDB uncompresses the
page, applies the changes and
recompresses the page
Page header
modification log
Page trailer
page directory
compressed data
BLOB pointers
empty space
InnoDB Rows
prefix(768B) ……
overflow
page
COMACT formatCOMACT format
Record hdr Trx ID Roll ptr Fld ptrs overflow-page ptr .. Field values
overflow
page
… …
DYNAMIC formatDYNAMIC format
20 bytes
InnoDB Indexes - Primary
●Data rows are stored
in the B-tree leaf
nodes of a clustered
index
●B-tree is organized
by primary key or
non-null unique key
of table, if defined;
else, an internal
column with 6-byte
ROW_ID is added.
xxxxxxxxxxxx
----
nnnnnnnnnnnn001001001001
----
275275275275
276276276276 ––––
500500500500
clustered
(primary key)
index
501501501501
----
630630630630
631631631631
----
768768768768
769769769769
----
800800800800
801801801801
----
949949949949
950950950950
----
xxxxxxxxxxxx
001001001001 ––––
500500500500
801801801801 ––––
nnnnnnnnnnnn
500500500500 ––––
800800800800
PK valuesPK valuesPK valuesPK values
001001001001 ---- nnnnnnnnnnnn
Key valuesKey valuesKey valuesKey values
501501501501----630630630630
+ data for+ data for+ data for+ data for
corresponding rowscorresponding rowscorresponding rowscorresponding rows
……
Primary Index
InnoDB Indexes - Secondary
● Secondary index B-
tree leaf nodes
contain, for each key
value, the primary
keys of the
corresponding rows,
used to access
clustering index to
obtain the data
clustered
(primary key)
index
clustered
(primary key)
index
Secondary index
PK valuesPK valuesPK valuesPK values
001001001001 ---- nnnnnnnnnnnn
B-tree leaf nodes, containing data
key valueskey valueskey valueskey values
A ZA ZA ZA Z
B-tree leaf nodes, containing PKs
Secondary index
key valueskey valueskey valueskey values
A ZA ZA ZA Z
B-tree leaf nodes, containing PKs
Secondary Index
DATA
InnoDB Logging
Rollback segments
Log Buffer Buffer Pool
redo
log
rollback
Log File
#1
Log File
#2
log thread
write thread
log files
ibdata files
InnoDB Redo Log
Redo log structure:
Space id PageNo OpCode Data
end of log
min LSN
start of log last checkpoint
File Format Management
• Builtin InnoDB format: “Antelope”
• New “Barracuda” format enables
compression,ROW_FORMAT=DYNAMIC
• Fast index creation, other features do not
require Barracuda file format
• Builtin InnoDB can access “Antelope”
databases, but not “Barracuda”
databases
• Check file format tag in system tablespace
on startup
• Enable a file format with new dynamic
parameter innodb_file_format
• Preserves ability to downgrade easily
.ibd
data files
(file per
table)
InnoDB File Format Design
Considerations
• Durability
• Logging, doublewrite, checksum;
• Performance
• Insert buffering, table compression
• Efficiency
• Dynamic row format, table compression
• Compatibility
• File format management
Source Code Structure
• 31 subdirectories
• Relevant InnoDB source files on file
formats
• Tablespace: fsp0fsp {.c, .ic, .h}
• Page: page0page, page0zip {.c, .ic, .h}
• Log: log0log {.c, .ic, .h}
Source Code Subdirectories
• buf
• data
• db
• dict
• dyn
• eval
• fil
• fsp
• fut
• ha
• handler
• ibuf
• include
• lock
• log
• math
• mem
• mtr
• os
• page
• pars
• que
• read
• rem
• row
• srv
• sync
• thr
• trx
• usr
• ut
Summary:
Durability, Performance,
Compatibility & Efficiency
• InnoDB is the leading transactional storage engine
for MySQL
• InnoDB’s architecture is well-suited to modern, on-
line transactional applications; as well as embedded
applications.
• InnoDB’s file format is designed for high durability,
better performance, and easy to manage
Q U E S T I O N SQ U E S T I O N S
A N S W E R SA N S W E R S
InnoDB Size Limits
• Max # of tables: 4 G
• Max size of a table: 32TB
• Columns per table: 1000
• Max row size: n*4 GB
• 8 kB if stored on the same page
• n*4 GB with n BLOBs
• Max key length: 3500
• Maximum tablespace size: 64 TB
• Max # of concurrent trxs: 1023

More Related Content

What's hot (20)

PDF
Redo log improvements MYSQL 8.0
Mydbops
 
PPTX
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
Vietnam Open Infrastructure User Group
 
PPT
Replacing Your Shared Drive with Alfresco - Open Source ECM
Alfresco Software
 
PDF
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 
PPTX
Vce vxrail-customer-presentation new
Jennifer Graham
 
PPTX
RedHat Virtualization Manager
Raz Tamir
 
PDF
NGINXセミナー(基本編)~いまさら聞けないNGINXコンフィグなど基本がわかる!
NGINX, Inc.
 
PDF
Gpfs introandsetup
asihan
 
PPTX
NetBackup Appliance Family presentation
Symantec
 
PDF
Red Hat Enterprise Linux 8
Kangaroot
 
PPTX
Redis Reliability, Performance & Innovation
Redis Labs
 
PPTX
VMware vSphere technical presentation
aleyeldean
 
PDF
5分で分かった気になるTekton
Shuhei Kitagawa
 
PDF
Virtualization Technology Overview
OpenCity Community
 
PDF
Introduction to Kubernetes and Google Container Engine (GKE)
Opsta
 
PPT
Red Hat Ansible 적용 사례
Opennaru, inc.
 
PDF
NGINX: Basics and Best Practices EMEA
NGINX, Inc.
 
PPTX
The Elastic Stack as a SIEM
John Hubbard
 
PDF
Ws2012フェールオーバークラスタリングdeep dive 130802
wintechq
 
PDF
이스티오 (Istio) 자습서 v0.5.0
Jo Hoon
 
Redo log improvements MYSQL 8.0
Mydbops
 
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
Vietnam Open Infrastructure User Group
 
Replacing Your Shared Drive with Alfresco - Open Source ECM
Alfresco Software
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 
Vce vxrail-customer-presentation new
Jennifer Graham
 
RedHat Virtualization Manager
Raz Tamir
 
NGINXセミナー(基本編)~いまさら聞けないNGINXコンフィグなど基本がわかる!
NGINX, Inc.
 
Gpfs introandsetup
asihan
 
NetBackup Appliance Family presentation
Symantec
 
Red Hat Enterprise Linux 8
Kangaroot
 
Redis Reliability, Performance & Innovation
Redis Labs
 
VMware vSphere technical presentation
aleyeldean
 
5分で分かった気になるTekton
Shuhei Kitagawa
 
Virtualization Technology Overview
OpenCity Community
 
Introduction to Kubernetes and Google Container Engine (GKE)
Opsta
 
Red Hat Ansible 적용 사례
Opennaru, inc.
 
NGINX: Basics and Best Practices EMEA
NGINX, Inc.
 
The Elastic Stack as a SIEM
John Hubbard
 
Ws2012フェールオーバークラスタリングdeep dive 130802
wintechq
 
이스티오 (Istio) 자습서 v0.5.0
Jo Hoon
 

Viewers also liked (20)

PDF
MySQL High Availability Deep Dive
hastexo
 
PDF
MySQL High Availability and Disaster Recovery with Continuent, a VMware company
Continuent
 
PDF
Lessons Learned: Troubleshooting Replication
Sveta Smirnova
 
PDF
MySQL High Availability with Group Replication
Nuno Carvalho
 
PPTX
Mysql参数-GDB
zhaolinjnu
 
PPTX
The nightmare of locking, blocking and isolation levels!
Boris Hristov
 
PDF
MySQL Group Replication - HandsOn Tutorial
Kenny Gryp
 
PDF
Java MySQL Connector & Connection Pool Features & Optimization
Kenny Gryp
 
PDF
10x Performance Improvements - A Case Study
Ronald Bradford
 
PDF
Эффективная отладка репликации MySQL
Sveta Smirnova
 
PPT
Mysql展示功能与源码对应
zhaolinjnu
 
PDF
Advanced mysql replication techniques
Giuseppe Maxia
 
PDF
Why MySQL High Availability Matters
Matt Lord
 
PDF
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Kenny Gryp
 
PDF
Why MySQL Replication Fails, and How to Get it Back
Sveta Smirnova
 
PDF
Requirements the Last Bottleneck
Karwin Software Solutions LLC
 
PDF
MySQL High-Availability and Scale-Out architectures
FromDual GmbH
 
PPTX
MySQL aio
zhaolinjnu
 
ODP
Explain
Ligaya Turmelle
 
ODP
Mysql For Developers
Carol McDonald
 
MySQL High Availability Deep Dive
hastexo
 
MySQL High Availability and Disaster Recovery with Continuent, a VMware company
Continuent
 
Lessons Learned: Troubleshooting Replication
Sveta Smirnova
 
MySQL High Availability with Group Replication
Nuno Carvalho
 
Mysql参数-GDB
zhaolinjnu
 
The nightmare of locking, blocking and isolation levels!
Boris Hristov
 
MySQL Group Replication - HandsOn Tutorial
Kenny Gryp
 
Java MySQL Connector & Connection Pool Features & Optimization
Kenny Gryp
 
10x Performance Improvements - A Case Study
Ronald Bradford
 
Эффективная отладка репликации MySQL
Sveta Smirnova
 
Mysql展示功能与源码对应
zhaolinjnu
 
Advanced mysql replication techniques
Giuseppe Maxia
 
Why MySQL High Availability Matters
Matt Lord
 
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Kenny Gryp
 
Why MySQL Replication Fails, and How to Get it Back
Sveta Smirnova
 
Requirements the Last Bottleneck
Karwin Software Solutions LLC
 
MySQL High-Availability and Scale-Out architectures
FromDual GmbH
 
MySQL aio
zhaolinjnu
 
Mysql For Developers
Carol McDonald
 
Ad

Similar to Inno db internals innodb file formats and source code structure (20)

PDF
InnoDB Internal
mysqlops
 
PDF
Inno Db Internals Inno Db File Formats And Source Code Structure
MySQLConference
 
PPTX
cPanelCon 2014: InnoDB Anatomy
Ryan Robson
 
PDF
The InnoDB Storage Engine for MySQL
Morgan Tocker
 
PDF
MySQL Space Management
MIJIN AN
 
PDF
Open sql2010 recovery-of-lost-or-corrupted-innodb-tables
Arvids Godjuks
 
PDF
InnoDB Scalability improvements in MySQL 8.0
Mydbops
 
PDF
InnoDB architecture and performance optimization (Пётр Зайцев)
Ontico
 
PDF
InnoDB Performance Optimisation
Mydbops
 
PDF
MongoDB WiredTiger Internals
Norberto Leite
 
PPT
Star schema my sql
deathsubte
 
PDF
Innodb 和 XtraDB 结构和性能优化
YUCHENG HU
 
PDF
Data recovery talk on PLUK
Aleksandr Kuzminsky
 
PDF
Locality of (p)reference
FromDual GmbH
 
PDF
Configuring workload-based storage and topologies
MariaDB plc
 
PPTX
MySQL database
lalit choudhary
 
PDF
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
Mydbops
 
PDF
The Internals of "Hello World" Program
National Cheng Kung University
 
PDF
MySQL innoDB split and merge pages
Marco Tusa
 
InnoDB Internal
mysqlops
 
Inno Db Internals Inno Db File Formats And Source Code Structure
MySQLConference
 
cPanelCon 2014: InnoDB Anatomy
Ryan Robson
 
The InnoDB Storage Engine for MySQL
Morgan Tocker
 
MySQL Space Management
MIJIN AN
 
Open sql2010 recovery-of-lost-or-corrupted-innodb-tables
Arvids Godjuks
 
InnoDB Scalability improvements in MySQL 8.0
Mydbops
 
InnoDB architecture and performance optimization (Пётр Зайцев)
Ontico
 
InnoDB Performance Optimisation
Mydbops
 
MongoDB WiredTiger Internals
Norberto Leite
 
Star schema my sql
deathsubte
 
Innodb 和 XtraDB 结构和性能优化
YUCHENG HU
 
Data recovery talk on PLUK
Aleksandr Kuzminsky
 
Locality of (p)reference
FromDual GmbH
 
Configuring workload-based storage and topologies
MariaDB plc
 
MySQL database
lalit choudhary
 
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
Mydbops
 
The Internals of "Hello World" Program
National Cheng Kung University
 
MySQL innoDB split and merge pages
Marco Tusa
 
Ad

Recently uploaded (20)

PDF
WATERSHED MANAGEMENT CASE STUDIES - ULUGURU MOUNTAINS AND ARVARI RIVERpdf
Ar.Asna
 
PPTX
Parsing HTML read and write operations and OS Module.pptx
Ramakrishna Reddy Bijjam
 
PPTX
grade 8 week 2 ict.pptx. matatag grade 7
VanessaTaberlo
 
PPTX
Life and Career Skills Lesson 2.pptxProtective and Risk Factors of Late Adole...
ryangabrielcatalon40
 
PPTX
How to Manage Wins & Losses in Odoo 18 CRM
Celine George
 
PDF
Genomics Proteomics and Vaccines 1st Edition Guido Grandi (Editor)
kboqcyuw976
 
PPTX
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
PPTX
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
PDF
COM and NET Component Services 1st Edition Juval Löwy
kboqcyuw976
 
PPTX
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
PDF
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
PPTX
Lesson 1 Cell (Structures, Functions, and Theory).pptx
marvinnbustamante1
 
PDF
Quiz Night Live May 2025 - Intra Pragya Online General Quiz
Pragya - UEM Kolkata Quiz Club
 
PDF
Cooperative wireless communications 1st Edition Yan Zhang
jsphyftmkb123
 
PDF
TechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.06.25.pdf
TechSoup
 
PDF
TLE 8 QUARTER 1 MODULE WEEK 1 MATATAG CURRICULUM
denniseraya1997
 
PDF
Lesson 1 - Nature of Inquiry and Research.pdf
marvinnbustamante1
 
PPTX
Iván Bornacelly - Presentation of the report - Empowering the workforce in th...
EduSkills OECD
 
PDF
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
PDF
I3PM Industry Case Study Siemens on Strategic and Value-Oriented IP Management
MIPLM
 
WATERSHED MANAGEMENT CASE STUDIES - ULUGURU MOUNTAINS AND ARVARI RIVERpdf
Ar.Asna
 
Parsing HTML read and write operations and OS Module.pptx
Ramakrishna Reddy Bijjam
 
grade 8 week 2 ict.pptx. matatag grade 7
VanessaTaberlo
 
Life and Career Skills Lesson 2.pptxProtective and Risk Factors of Late Adole...
ryangabrielcatalon40
 
How to Manage Wins & Losses in Odoo 18 CRM
Celine George
 
Genomics Proteomics and Vaccines 1st Edition Guido Grandi (Editor)
kboqcyuw976
 
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
COM and NET Component Services 1st Edition Juval Löwy
kboqcyuw976
 
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
Lesson 1 Cell (Structures, Functions, and Theory).pptx
marvinnbustamante1
 
Quiz Night Live May 2025 - Intra Pragya Online General Quiz
Pragya - UEM Kolkata Quiz Club
 
Cooperative wireless communications 1st Edition Yan Zhang
jsphyftmkb123
 
TechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.06.25.pdf
TechSoup
 
TLE 8 QUARTER 1 MODULE WEEK 1 MATATAG CURRICULUM
denniseraya1997
 
Lesson 1 - Nature of Inquiry and Research.pdf
marvinnbustamante1
 
Iván Bornacelly - Presentation of the report - Empowering the workforce in th...
EduSkills OECD
 
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
I3PM Industry Case Study Siemens on Strategic and Value-Oriented IP Management
MIPLM
 

Inno db internals innodb file formats and source code structure

  • 1. Transactional Storage for MySQL FAST. RELIABLE. PROVEN. InnoDB Internals: InnoDB File Formats and Source Code Structure MySQL University, October 2009 Calvin Sun Principal Engineer Oracle Corporation
  • 2. Today’s Topics • Goals of InnoDB • Key Functional Characteristics • InnoDB Design Considerations • InnoDB Architecture • InnoDB On Disk Format • Source Code Structure • Q & A
  • 3. Goals of InnoDB • OLTP oriented • Performance, Reliability, Scalability • Data Protection • Portability
  • 4. InnoDB Key Functional Characteristics • Full transaction support • Row-level locking • MVCC • Crash recovery • Efficient IO
  • 5. Design Considerations • Modeled on Gray & Reuter’s “Transactions Processing: Concepts & Techniques” • Also emulated the Oracle architecture • Added unique subsystems • Doublewrite • Insert buffering • Adaptive hash index • Designed to evolve with changing hardware & requirements
  • 6. InnoDB Architecture IO Buffer File Space Manager Transaction Handler API Embedded InnoDB API Cursor / Row Mini- transaction LockB-tree Page Server Applications
  • 7. InnoDB On Disk Format • InnoDB Database Files • InnoDB Tablespaces • InnoDB Pages / Extents • InnoDB Rows • InnoDB Indexes • InnoDB Logs • File Format Design Considerations
  • 8. InnoDB Database Files ibdata files Systemtablespace internal data dictionary MySQL Data Directory InnoDB tables OR innodb_file_per_table .ibd files .frm files undo logs insert buffer
  • 9. InnoDB Tablespaces • A tablespace consists of multiple files and/or raw disk partitions. file_name:file_size[:autoextend[:max:max_file_size]] • A file/partition is a collection of segments. • A segment consists of fixed-length pages. • The page size is always 16KB in uncompressed tablespaces, and 1KB-16KB in compressed tablespaces (for both data and index).
  • 10. System Tablespace • Internal Data Dictionary • Undo • Insert Buffer • Doublewrite Buffer • MySQL Replication Info
  • 11. InnoDB Tablespaces Extent Segment Extent Extent Extent an extent = 64 pages Extent Trx id Row Field 1 Roll pointer Field pointers Field 2 Field n Row Page Row Row Row Row Leaf node segment Tablespace Rollback segment Non-leaf node segment RowRow
  • 12. InnoDB Pages Symbol Value Notes FIL_PAGE_INODE 3 File segment inode FIL_PAGE_INDEX 17855 B-tree node FIL_PAGE_TYPE_BLOB 10 Uncompressed BLOB page FIL_PAGE_TYPE_ZBLOB 11 1st compressed BLOB page FIL_PAGE_TYPE_ZBLOB2 12 Subsequent compressed BLOB page FIL_PAGE_TYPE_SYS 6 System page FIL_PAGE_TYPE_TRX_SYS 7 Transaction system page others i-buf bitmap, I-buf free list, file space header, extent desp page, new allocated page InnoDB Page TypesInnoDB Page Types
  • 13. InnoDB Pages A page consists of: a page header, a page trailer, and a page body (rows or other contents). Page header Page trailer row offset array Row RowRow Row Row RowRow Row Row RowRow
  • 14. Page Declares typedef struct /* a space address */ { ulint pageno; /* page number within the file */ ulint boffset; /* byte offset within the page */ } fil_addr_t; typedef struct { ulint checksum; /* checksum of the page (since 4.0.14) */ ulint page_offset; /* page offset inside space */ fil_addr_t previous; /* offset or fil_addr_t */ fil_addr_t next; /* offset or fil_addr_t */ dulint page_lsn; /* lsn of the end of the newest modification log record to the page */ PAGE_TYPE page type; /* file page type */ dulint file_flush_lsn;/* the file has been flushed to disk at least up to this lsn */ int space_id; /* space id of the page */ char data[]; /* will grow */ ulint page_lsn; /* the last 4 bytes of page_lsn */ ulint checksum; /* page checksum, or checksum magic, or 0 */ } PAGE, *PAGE;
  • 15. InnoDB Compressed Pages •InnoDB keeps a “modification log” in each page •Updates & inserts of small records are written to the log w/o page reconstruction; deletes don’t even require uncompression •Log also tells InnoDB if the page will compress to fit page size •When log space runs out, InnoDB uncompresses the page, applies the changes and recompresses the page Page header modification log Page trailer page directory compressed data BLOB pointers empty space
  • 16. InnoDB Rows prefix(768B) …… overflow page COMACT formatCOMACT format Record hdr Trx ID Roll ptr Fld ptrs overflow-page ptr .. Field values overflow page … … DYNAMIC formatDYNAMIC format 20 bytes
  • 17. InnoDB Indexes - Primary ●Data rows are stored in the B-tree leaf nodes of a clustered index ●B-tree is organized by primary key or non-null unique key of table, if defined; else, an internal column with 6-byte ROW_ID is added. xxxxxxxxxxxx ---- nnnnnnnnnnnn001001001001 ---- 275275275275 276276276276 –––– 500500500500 clustered (primary key) index 501501501501 ---- 630630630630 631631631631 ---- 768768768768 769769769769 ---- 800800800800 801801801801 ---- 949949949949 950950950950 ---- xxxxxxxxxxxx 001001001001 –––– 500500500500 801801801801 –––– nnnnnnnnnnnn 500500500500 –––– 800800800800 PK valuesPK valuesPK valuesPK values 001001001001 ---- nnnnnnnnnnnn Key valuesKey valuesKey valuesKey values 501501501501----630630630630 + data for+ data for+ data for+ data for corresponding rowscorresponding rowscorresponding rowscorresponding rows …… Primary Index
  • 18. InnoDB Indexes - Secondary ● Secondary index B- tree leaf nodes contain, for each key value, the primary keys of the corresponding rows, used to access clustering index to obtain the data clustered (primary key) index clustered (primary key) index Secondary index PK valuesPK valuesPK valuesPK values 001001001001 ---- nnnnnnnnnnnn B-tree leaf nodes, containing data key valueskey valueskey valueskey values A ZA ZA ZA Z B-tree leaf nodes, containing PKs Secondary index key valueskey valueskey valueskey values A ZA ZA ZA Z B-tree leaf nodes, containing PKs Secondary Index
  • 19. DATA InnoDB Logging Rollback segments Log Buffer Buffer Pool redo log rollback Log File #1 Log File #2 log thread write thread log files ibdata files
  • 20. InnoDB Redo Log Redo log structure: Space id PageNo OpCode Data end of log min LSN start of log last checkpoint
  • 21. File Format Management • Builtin InnoDB format: “Antelope” • New “Barracuda” format enables compression,ROW_FORMAT=DYNAMIC • Fast index creation, other features do not require Barracuda file format • Builtin InnoDB can access “Antelope” databases, but not “Barracuda” databases • Check file format tag in system tablespace on startup • Enable a file format with new dynamic parameter innodb_file_format • Preserves ability to downgrade easily .ibd data files (file per table)
  • 22. InnoDB File Format Design Considerations • Durability • Logging, doublewrite, checksum; • Performance • Insert buffering, table compression • Efficiency • Dynamic row format, table compression • Compatibility • File format management
  • 23. Source Code Structure • 31 subdirectories • Relevant InnoDB source files on file formats • Tablespace: fsp0fsp {.c, .ic, .h} • Page: page0page, page0zip {.c, .ic, .h} • Log: log0log {.c, .ic, .h}
  • 24. Source Code Subdirectories • buf • data • db • dict • dyn • eval • fil • fsp • fut • ha • handler • ibuf • include • lock • log • math • mem • mtr • os • page • pars • que • read • rem • row • srv • sync • thr • trx • usr • ut
  • 25. Summary: Durability, Performance, Compatibility & Efficiency • InnoDB is the leading transactional storage engine for MySQL • InnoDB’s architecture is well-suited to modern, on- line transactional applications; as well as embedded applications. • InnoDB’s file format is designed for high durability, better performance, and easy to manage
  • 26. Q U E S T I O N SQ U E S T I O N S A N S W E R SA N S W E R S
  • 27. InnoDB Size Limits • Max # of tables: 4 G • Max size of a table: 32TB • Columns per table: 1000 • Max row size: n*4 GB • 8 kB if stored on the same page • n*4 GB with n BLOBs • Max key length: 3500 • Maximum tablespace size: 64 TB • Max # of concurrent trxs: 1023