SlideShare a Scribd company logo
DBA Level 400
About me
An independent SQL Consultant
A user of SQL Server from version 2000 onwards with 12+ years
experience.
Speaker, both at UK user group events and at conferences.
I have a passion for understanding how the database engine works
at a deep level.
Demonstration
“Everything fits in memory, so
performance is as good as it will
get. It fits in memory therefore
end of story”
Which Query Has The Lowest Elapsed Time ?
28.98Mb column store Vs. 107.30Mb column store
How Well Do The Two Column Stores Scale On Larger Hardware ?
0
10000
20000
30000
40000
50000
60000
70000
80000
2 4 6 8 10 12 14 16 18 20 22 24
Time(ms)
Degree of Parallelism
Non-sorted column store Sorted column store
Column store created on 1095600000 rows
Can We Use All Available CPU Resource ?
0
10
20
30
40
50
60
70
80
90
100
2 4 6 8 10 12 14 16 18 20 22 24
PercentageCPUUtilization
Degree of Parallelism
Non-sorted Sorted
Memory access
should consume
all available
CPU cycles ?!?
Looking For Clues
 Why does the query using the
column store on pre-sorted data
run faster ?
 Why can we not utilise 100% CPU
capacity ?
 Lets start with tried and trusted
tools and techniques.
Wait Statistics Do Not Help Here !
Stats are for the query ran with a DOP of 24, a warm column store object
pool and the column store created on pre-sorted data
(1095600000 rows ).
CXPACKET waits can be ignored 99.99 % of the time.
Spin Locks Do Not Provide Any Clues Either
Executes in 775 ms for a warm column store object pool
12 cores x 2.0 Ghz x 0.775 = 1,860,000,000 CPU cycles
Total spins 293,491
SELECT [CalendarQuarter]
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM [FactInternetSalesBig] f
JOIN [DimDate] d
ON f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
OPTION (MAXDOP 24)
Could Query Costs Help Solve The Two Mysteries ?
Assumptions:
The buffer cache is cold.
IO cannot be performed in parallel.
Data in different columns is never correlated ( better in SQL 2014 )
Hash distribution is always uniform.
Etc. . . .
Costings based on the amount of time it took a
developers machine to complete certain
operations, a Dell OptiPlex ( according to legend ).
Our View Of Database Engine Resource Usage Is Based On . . .
wait stats, perfmon counters, extended
events and dynamic management views.
We need to know and understand:
 Where all our CPU cycles are going
How the database engine utilises the
CPU at a deep architectural level.
Core
“i Series” CPU Architecture
L3 Cache
L1 Instruction
Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
Power
and
Clock
QPI
Memory
Controller
L1 Data Cache
32KB
Core
CoreL1 Instruction
Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
Bi-directional ring bus
IOTLBMemory bus
C P U
QPI…
Un-core
 system-on-chip ( SOC ) design
with CPU cores as the basic
building block.
 Utility services provisioned by
the ‘Un-core’ part of the CPU
die.
 Three level cache hierarchy
( four for Sandybridge+)
4
4
4
11
11
11
14
18
38
167
0 50 100 150 200
L1 Cache sequential access
L1 Cache In Page Random access
L1 Cache In Full Random access
L2 Cache sequential access
L2 Cache In Page Random access
L2 Cache Full Random access
L3 Cache sequential access
L3 Cache In Page Random access
L3 Cache Full Random access
Main memory
The CPU Cache Hierarchy Latencies In CPU Cycles Memory
Memory Access Can Become More Costly With NUMA
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Remote
memory
access
Local
memory
access
Local
memory
access
NUMA Node 0 NUMA Node 1
NUMA Node Remote Memory Access LatencyMemory Access Latency
An additional 20% overhead when
accessing ‘Foreign’ memory !
( from coreinfo )
Local Vs Remote Memory Access and Thread Locality
How does SQLOS
schedule hyper-
threads in relation
to physical cores ?
( 6 cores per socket )
CPU
socket 0
CPU
socket 1
Core 0 Core 1 Core 2
Core 3 Core 4 Core 5
 Leverage the pre-fetcher as
much as possible.
 Larger CPU caches
 L4 Cache => Crystalwell eDram
 DDR4 memory
 By pass main memory
 Stacked memory
 Hybrid memory cubes (Intel)
 High Bandwidth memory (AMD)
Main Memory Is Holding The CPU Back, Solutions . . .
Main
memory
CPU
What The Pre-Fetcher Loves
Column Store index
Sequential scan
What The Pre-Fetcher Hates !
Hash Table
Can this be
improved ?
Making Use Of CPU Stalls With Hyper Threading ( Core i series onwards )
Access
n row B-tree
L2
L3
Last level
cache miss n row B-tree
Session 1 performs an index
seek, the page is not found
in the CPU cache. A CPU stall takes place
( 160 clock cycles+ ) whilst
the page is retrieved from
memory.
The ‘Dead’ CPU stall cycles
gives the physical core the
opportunity to run a 2nd
( hyper ) thread.
Core
L1
Back To Our Mystery, A Not So Well Documented Tool To The Rescue !
Stack Walking The Database Engine
SELECT p.EnglishProductName
,SUM([OrderQuantity])
,SUM([UnitPrice])
,SUM([ExtendedAmount])
,SUM([UnitPriceDiscountPct])
,SUM([DiscountAmount])
,SUM([ProductStandardCost])
,SUM([TotalProductCost])
,SUM([SalesAmount])
,SUM([TaxAmt])
,SUM([Freight])
FROM [dbo].[FactInternetSales] f
JOIN [dbo].[DimProduct] p
ON f.ProductKey = p.ProductKey
GOUP BY p.EnglishProductName
xperf –on base –stackwalk profile
xperf –d stackwalk.etl
WPA
Call Stack For Query Against Column Store On Non-Pre-Sorted Data
Hash agg lookup
weight 65,329.87
Column Store scan
weight 28,488.73
Control flow
Data flow
Where Is The Bottleneck In The Plan ?
The stack trace is indicating that the
Bottleneck is right here
Call Stack For Query Against Column Store On Pre-Sorted Data
Hash agg lookup
weight:
 now 275.00
 before 65,329.87
Column Store scan
weight
 now 45,764.07
 before 28,488.73
Does The OrderDateKey Column Fit In The L3 Cache ?
Table Name Column Name Size (Mb)
FactInternetSalesBigNoSort
OrderDateKey 1786182
Price1 3871
Price2 3871
Price3 3871
FactInternetSalesBigSorted
OrderDateKey 738
Price1 2965127
Price2 2965127
Price3 2965127
No , L3 cache is 20Mb in size
The Case Of The Two Column Store Index Sizes: Conclusion
Turning the memory access on
the hash aggregate table from random
to sequential probes
=
CPU savings > cost of scanning an
enlarged column store
Batch Mode Hash Join / Aggregate Performance and SkewMode Hash Joins
27
Row Mode Batch Mode
Expensive to repartition inputs
Data skew reduces parallelism
Hash join
B1
B2
B3
Bn
Hash table
(shared)
Thread
Thread
Build input
Thread
Thread
Thread
Probe input
B1
B2
B4
Bm
B3
Build input
Hash join
Exchange Exchange
Probe input
Exchange Exchange
Data skew speeds up processing
No repartitioning
The Case Of The 60% CPU Utilisation Ceiling
If CPU capacity or IO bandwidth
cannot be fully consumed, some
form of contention must be
present . . .
Batch Engine Call Stack
Throttling !
The Case Of The 60% CPU Utilisation Ceiling: Conclusion
The hash aggregate cannot
keep up with the column store
scan.
The batch engine therefore
throttles the column store scan
by calling sleep system calls !!!.
The integration services engine does
something very similar, known as
data flow engine “Back pressure”.
The Case Of The Two Column Stores
The hash aggregate using the column
stored created on pre-sorted data is
very CPU efficient.
Why ?
Introducing Intel VTune Amplifier XE
Investigating events at the CPU cache,
clock cycle and instruction level requires
software outside the standard Windows
and SQL Server tool set.
Refer to Appendix D for an overview of what “General exploration”
provides.
This Is What The CPU Stall Picture Looks Like Against DOP
2 4 6 8 10 12 14 16 18 20 22 24
Non-sorted 13,200,924 30,902,163 161,411,298 1,835,828,499 2,069,544,858 4,580,720,628 2,796,495,741 3,080,615,628 3,950,376,507 4,419,593,391 4,952,446,647 5,311,271,763
Sorted 3,000,210 1,200,084 16,203,164 29,102,037 34,802,436 35,102,457 48,903,413 64,204,494 63,004,410 85,205,964 68,404,788 72,605,082
0
1,000,000,000
2,000,000,000
3,000,000,000
4,000,000,000
5,000,000,000
6,000,000,000
LLCMisses
Degree of Parallelism
Non-sorted Sorted
Which Memory Your Data Is In Matters !, Locality Matters !
L3 Cache
Here ?
L1 Instruction
Cache
L2 Unified Cache
Here ?
L1 Data Cache
Here ?
Core
Memory
bus
Where is my data ?
C P U
Hopefully
not here ?!?
The Case of The CPU Pressure Point
Where are the pressure points on
the CPU and what can be done to
resolve this ?.
CPU Pipeline Architecture
C P U
Front end
Back end
 A ‘Pipeline’ of logical slots
runs through the processor.
 Front end can issue four
micro ops per clock cycle.
 Back end can retire up to
four micro operations per
clock cycle.
AllocationRetirement
Pipeline ‘Bubbles’ Are Bad !
C P U
Front end
Back end
 Empty slots are referred to as
‘Bubbles’.
 Causes of front end bubbles:
 Bad speculation.
 CPU stalls.
 Data dependencies
A = B + C
E = A + D
 Back end bubbles can be due
to excessive demand for
specific types of execution unit.
AllocationRetirement
Bubble
Bubble
Bubble
Bubble
Making Efficient Use Of The CPU In The “In Memory” World
C P U
D A T A F L O W
D A T A F L O WFront endBack end
Backend Pressure
Retirement throttled due to
pressure on back end
resources ( port saturation )
Frontend Pressure
Front end issuing < 4 uops per cycle whilst
backend is ready to accept uops
( CPU stalls, bad speculation, data
dependencies )
Lots Of KPIs To Choose From, Which To Select ?
 CPU Cycles Per Retired Instruction (CPI)
This should ideally be 0.25, anything
approaching 1.0 is bad.
 Front end bound
The front end under supplying the back end with
work (lower values are better).
 Back end bound
The Back end cannot accept work from the front
end because there is excessive demand for
specific execution units (lower values better).
These Are The Pressure Point Statistics For The ‘Sorted’ Column Store
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2 4 6 8 10 12 14 16 18 20 22 24
KPIValue
Degree of Parallelism
CPI Front end Bound Back end bound
Refer to Appendix C for the formulae from
which these metrics are derived.
The Backend Of The CPU Is Now The Bottleneck For The Batch mode Engine
C P U
Front end

Back end

Performance
Can we help the
back end keep up
with the front end ?
Single Instruction Multiple Data ( SIMD )
 A class of CPU instruction that can
process multiple data points to be
processed simultaneously.
 A form of vectorised processing.
 Once CPU stalls are minimised, the
challenge becomes processing data
on the CPU ( rate of instruction
retirement ) as fast as possible.
Lowering Clock Cycles Per Instruction By Leveraging SIMD
A(1) B(1)
+ C(1)
=
+ =
+ =
+ =
Using conventional processing,
adding two arrays together,
each comprising of four
elements requires four
instructions.
A(2)
A(3)
A(4)
B(2)
B(3)
B(4)
C(2)
C(3)
C(4)
Lowering Clock Cycles Per Instruction By Leveraging SIMD
A(1) A(2) A(3) A(4)
B(1) B(2) B(3) B(4)
+
C(1) C(2) C(3) C(4)
=
Using SIMD –
“Single instruction multiple data”
commands, the addition can be
performed using a single instruction.
Does The SQL Server Database Engine Leverage SIMD Instructions ?

 VTune amplifier does not provide the option to pick out
Streaming SIMD Extension (SSE) integer events.
 However, for a floating point hash aggregate we would hope to
see floating point AVX instructions in use.
Sql sever engine batch mode and cpu architectures
chris1adkin@yahoo.co.uk
http://uk.linkedin.com/in/wollatondba
ChrisAdkin8
Sql sever engine batch mode and cpu architectures
Appendix A: Instruction Execution And The CPU Front / Back Ends
Cache
Fetch
Decode
Execute
Branch
Predict
Decoded
Instruction
Buffer
Execute
Reorder
And
Retire
Front end Back end
Appendix B - The CPU Front / Back Ends In Detail
Front end Back end
 Front end bound ( smaller is better )
 IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)
 Bad speculation
 (UOPS_ISSUED.ANY – UOPS.RETIRED.RETIRED_SLOTS + 4 *
INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks)
 Retiring
 UOPS_RETIRE_SLOTS / (4 * Clock ticks)
 Back end bound ( ideally, should = 1 - Retiring)
 1 – (Front end bound + Bad speculation + Retiring)
Appendix C - CPU Pressure Points, Important Calculations
An illustration of what the
“General exploration” analysis
capability of the tool provides
Appendix D - VTune Amplifier General Exploration
Ad

Recommended

Sql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ram
Chris Adkin
 
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
Chris Adkin
 
An introduction to column store indexes and batch mode
An introduction to column store indexes and batch mode
Chris Adkin
 
Building scalable application with sql server
Building scalable application with sql server
Chris Adkin
 
Sql server scalability fundamentals
Sql server scalability fundamentals
Chris Adkin
 
Leveraging memory in sql server
Leveraging memory in sql server
Chris Adkin
 
Column store indexes and batch processing mode (nx power lite)
Column store indexes and batch processing mode (nx power lite)
Chris Adkin
 
Super scaling singleton inserts
Super scaling singleton inserts
Chris Adkin
 
Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insert
Chris Adkin
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
Tanel Poder
 
Oracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BI
Franck Pachot
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
BertrandDrouvot
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualization
Franck Pachot
 
Christo kutrovsky oracle, memory & linux
Christo kutrovsky oracle, memory & linux
Kyle Hailey
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
Equnix Business Solutions
 
PostgreSQL and RAM usage
PostgreSQL and RAM usage
Alexey Bashtanov
 
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
Equnix Business Solutions
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
PostgreSQL Experts, Inc.
 
Exadata X3 in action: Measuring Smart Scan efficiency with AWR
Exadata X3 in action: Measuring Smart Scan efficiency with AWR
Franck Pachot
 
Dbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easy
Franck Pachot
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
PGConf APAC
 
Oracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known Features
Tanel Poder
 
Hug Hbase Presentation.
Hug Hbase Presentation.
Jack Levin
 
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
Franck Pachot
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
 
Introduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XID
PGConf APAC
 
PostgreSQL 9.6 Performance-Scalability Improvements
PostgreSQL 9.6 Performance-Scalability Improvements
PGConf APAC
 
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
In-Memory Computing Summit
 
PARALLEL DATABASE SYSTEM in Computer Science.pptx
PARALLEL DATABASE SYSTEM in Computer Science.pptx
Sisodetrupti
 

More Related Content

What's hot (20)

Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insert
Chris Adkin
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
Tanel Poder
 
Oracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BI
Franck Pachot
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
BertrandDrouvot
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualization
Franck Pachot
 
Christo kutrovsky oracle, memory & linux
Christo kutrovsky oracle, memory & linux
Kyle Hailey
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
Equnix Business Solutions
 
PostgreSQL and RAM usage
PostgreSQL and RAM usage
Alexey Bashtanov
 
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
Equnix Business Solutions
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
PostgreSQL Experts, Inc.
 
Exadata X3 in action: Measuring Smart Scan efficiency with AWR
Exadata X3 in action: Measuring Smart Scan efficiency with AWR
Franck Pachot
 
Dbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easy
Franck Pachot
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
PGConf APAC
 
Oracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known Features
Tanel Poder
 
Hug Hbase Presentation.
Hug Hbase Presentation.
Jack Levin
 
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
Franck Pachot
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
 
Introduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XID
PGConf APAC
 
PostgreSQL 9.6 Performance-Scalability Improvements
PostgreSQL 9.6 Performance-Scalability Improvements
PGConf APAC
 
Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insert
Chris Adkin
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
Tanel Poder
 
Oracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BI
Franck Pachot
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
BertrandDrouvot
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualization
Franck Pachot
 
Christo kutrovsky oracle, memory & linux
Christo kutrovsky oracle, memory & linux
Kyle Hailey
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
Equnix Business Solutions
 
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
Equnix Business Solutions
 
Exadata X3 in action: Measuring Smart Scan efficiency with AWR
Exadata X3 in action: Measuring Smart Scan efficiency with AWR
Franck Pachot
 
Dbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easy
Franck Pachot
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
PGConf APAC
 
Oracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known Features
Tanel Poder
 
Hug Hbase Presentation.
Hug Hbase Presentation.
Jack Levin
 
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
CBO choice between Index and Full Scan: the good, the bad and the ugly param...
Franck Pachot
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
 
Introduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XID
PGConf APAC
 
PostgreSQL 9.6 Performance-Scalability Improvements
PostgreSQL 9.6 Performance-Scalability Improvements
PGConf APAC
 

Similar to Sql sever engine batch mode and cpu architectures (20)

IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
In-Memory Computing Summit
 
PARALLEL DATABASE SYSTEM in Computer Science.pptx
PARALLEL DATABASE SYSTEM in Computer Science.pptx
Sisodetrupti
 
Oracle ebs capacity_analysisusingstatisticalmethods
Oracle ebs capacity_analysisusingstatisticalmethods
Ajith Narayanan
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
Hazelcast
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder
 
In-Memory Databases, Trends and Technologies (2012)
In-Memory Databases, Trends and Technologies (2012)
Vilho Raatikka
 
Investigate SQL Server Memory Like Sherlock Holmes
Investigate SQL Server Memory Like Sherlock Holmes
Richard Douglas
 
Extra performance out of thin air
Extra performance out of thin air
Konstantine Krutiy
 
Best storage engine for MySQL
Best storage engine for MySQL
tomflemingh2
 
Tips and Tricks for SAP Sybase IQ
Tips and Tricks for SAP Sybase IQ
Don Brizendine
 
Database story by DevOps
Database story by DevOps
Anton Martynenko
 
SQL Server 2014 for Developers (Cristian Lefter)
SQL Server 2014 for Developers (Cristian Lefter)
ITCamp
 
Dba lounge-sql server-performance-troubleshooting
Dba lounge-sql server-performance-troubleshooting
Dan Andrei Stefan
 
Hekaton introduction for .Net developers
Hekaton introduction for .Net developers
Shy Engelberg
 
JSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance Tuning
Kenichiro Nakamura
 
SQL Server Performance Analysis
SQL Server Performance Analysis
Eduardo Castro
 
Ajuste (tuning) del rendimiento de SQL Server 2008
Ajuste (tuning) del rendimiento de SQL Server 2008
Eduardo Castro
 
PASS Summit 2009 Keynote Dave DeWitt
PASS Summit 2009 Keynote Dave DeWitt
Mark Ginnebaugh
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
Command Prompt., Inc
 
Five steps perform_2009 (1)
Five steps perform_2009 (1)
PostgreSQL Experts, Inc.
 
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
IMCSummit 2015 - Day 1 Developer Session - The Science and Engineering Behind...
In-Memory Computing Summit
 
PARALLEL DATABASE SYSTEM in Computer Science.pptx
PARALLEL DATABASE SYSTEM in Computer Science.pptx
Sisodetrupti
 
Oracle ebs capacity_analysisusingstatisticalmethods
Oracle ebs capacity_analysisusingstatisticalmethods
Ajith Narayanan
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
Hazelcast
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder
 
In-Memory Databases, Trends and Technologies (2012)
In-Memory Databases, Trends and Technologies (2012)
Vilho Raatikka
 
Investigate SQL Server Memory Like Sherlock Holmes
Investigate SQL Server Memory Like Sherlock Holmes
Richard Douglas
 
Extra performance out of thin air
Extra performance out of thin air
Konstantine Krutiy
 
Best storage engine for MySQL
Best storage engine for MySQL
tomflemingh2
 
Tips and Tricks for SAP Sybase IQ
Tips and Tricks for SAP Sybase IQ
Don Brizendine
 
SQL Server 2014 for Developers (Cristian Lefter)
SQL Server 2014 for Developers (Cristian Lefter)
ITCamp
 
Dba lounge-sql server-performance-troubleshooting
Dba lounge-sql server-performance-troubleshooting
Dan Andrei Stefan
 
Hekaton introduction for .Net developers
Hekaton introduction for .Net developers
Shy Engelberg
 
JSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance Tuning
Kenichiro Nakamura
 
SQL Server Performance Analysis
SQL Server Performance Analysis
Eduardo Castro
 
Ajuste (tuning) del rendimiento de SQL Server 2008
Ajuste (tuning) del rendimiento de SQL Server 2008
Eduardo Castro
 
PASS Summit 2009 Keynote Dave DeWitt
PASS Summit 2009 Keynote Dave DeWitt
Mark Ginnebaugh
 
Ad

More from Chris Adkin (9)

Bdc from bare metal to k8s
Bdc from bare metal to k8s
Chris Adkin
 
Data weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clusters
Chris Adkin
 
Data relay introduction to big data clusters
Data relay introduction to big data clusters
Chris Adkin
 
Ci with jenkins docker and mssql belgium
Ci with jenkins docker and mssql belgium
Chris Adkin
 
Continuous Integration With Jenkins Docker SQL Server
Continuous Integration With Jenkins Docker SQL Server
Chris Adkin
 
TSQL Coding Guidelines
TSQL Coding Guidelines
Chris Adkin
 
J2EE Performance And Scalability Bp
J2EE Performance And Scalability Bp
Chris Adkin
 
J2EE Batch Processing
J2EE Batch Processing
Chris Adkin
 
Oracle Sql Tuning
Oracle Sql Tuning
Chris Adkin
 
Bdc from bare metal to k8s
Bdc from bare metal to k8s
Chris Adkin
 
Data weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clusters
Chris Adkin
 
Data relay introduction to big data clusters
Data relay introduction to big data clusters
Chris Adkin
 
Ci with jenkins docker and mssql belgium
Ci with jenkins docker and mssql belgium
Chris Adkin
 
Continuous Integration With Jenkins Docker SQL Server
Continuous Integration With Jenkins Docker SQL Server
Chris Adkin
 
TSQL Coding Guidelines
TSQL Coding Guidelines
Chris Adkin
 
J2EE Performance And Scalability Bp
J2EE Performance And Scalability Bp
Chris Adkin
 
J2EE Batch Processing
J2EE Batch Processing
Chris Adkin
 
Oracle Sql Tuning
Oracle Sql Tuning
Chris Adkin
 
Ad

Recently uploaded (20)

Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
MRI Pulse Sequence in radiology physics.pptx
MRI Pulse Sequence in radiology physics.pptx
BelaynehBishaw
 
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
Ameya Patekar
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
THE LINEAR REGRESSION MODEL: AN OVERVIEW
THE LINEAR REGRESSION MODEL: AN OVERVIEW
Ameya Patekar
 
LECTURE_skakkakwowowkwkkwkskwkqowoowoaoaoa.cooos
LECTURE_skakkakwowowkwkkwkskwkqowoowoaoaoa.cooos
ssuseraf13da
 
Measurecamp Copenhagen - Consent Context
Measurecamp Copenhagen - Consent Context
Human37
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
LECTURE_2skkkkskskskskksksksosoowowowowkwkw.ccoo
LECTURE_2skkkkskskskskksksksosoowowowowkwkw.ccoo
ssuseraf13da
 
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Institut de l'Elevage - Idele
 
Verweven van EM Legacy en OTL-data bij AWV
Verweven van EM Legacy en OTL-data bij AWV
jacoba18
 
Team_Mercury.pdf hai kya hai kya hai kya hai kya hai kya
Team_Mercury.pdf hai kya hai kya hai kya hai kya hai kya
genadit49
 
Power BI API Connectors - Best Practices for Scalable Data Connections
Power BI API Connectors - Best Practices for Scalable Data Connections
Vidicorp Ltd
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
Boost Business Efficiency with Professional Data Entry Services
Boost Business Efficiency with Professional Data Entry Services
eloiacs eloiacs
 
最新版美国史蒂文斯理工学院毕业证(SIT毕业证书)原版定制
最新版美国史蒂文斯理工学院毕业证(SIT毕业证书)原版定制
Taqyea
 
llm_presentation and deep learning methods
llm_presentation and deep learning methods
sayedabdussalam11
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
MRI Pulse Sequence in radiology physics.pptx
MRI Pulse Sequence in radiology physics.pptx
BelaynehBishaw
 
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
Ameya Patekar
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
THE LINEAR REGRESSION MODEL: AN OVERVIEW
THE LINEAR REGRESSION MODEL: AN OVERVIEW
Ameya Patekar
 
LECTURE_skakkakwowowkwkkwkskwkqowoowoaoaoa.cooos
LECTURE_skakkakwowowkwkkwkskwkqowoowoaoaoa.cooos
ssuseraf13da
 
Measurecamp Copenhagen - Consent Context
Measurecamp Copenhagen - Consent Context
Human37
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
LECTURE_2skkkkskskskskksksksosoowowowowkwkw.ccoo
LECTURE_2skkkkskskskskksksksosoowowowowkwkw.ccoo
ssuseraf13da
 
Verweven van EM Legacy en OTL-data bij AWV
Verweven van EM Legacy en OTL-data bij AWV
jacoba18
 
Team_Mercury.pdf hai kya hai kya hai kya hai kya hai kya
Team_Mercury.pdf hai kya hai kya hai kya hai kya hai kya
genadit49
 
Power BI API Connectors - Best Practices for Scalable Data Connections
Power BI API Connectors - Best Practices for Scalable Data Connections
Vidicorp Ltd
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
Boost Business Efficiency with Professional Data Entry Services
Boost Business Efficiency with Professional Data Entry Services
eloiacs eloiacs
 
最新版美国史蒂文斯理工学院毕业证(SIT毕业证书)原版定制
最新版美国史蒂文斯理工学院毕业证(SIT毕业证书)原版定制
Taqyea
 
llm_presentation and deep learning methods
llm_presentation and deep learning methods
sayedabdussalam11
 

Sql sever engine batch mode and cpu architectures

  • 2. About me An independent SQL Consultant A user of SQL Server from version 2000 onwards with 12+ years experience. Speaker, both at UK user group events and at conferences. I have a passion for understanding how the database engine works at a deep level.
  • 3. Demonstration “Everything fits in memory, so performance is as good as it will get. It fits in memory therefore end of story”
  • 4. Which Query Has The Lowest Elapsed Time ? 28.98Mb column store Vs. 107.30Mb column store
  • 5. How Well Do The Two Column Stores Scale On Larger Hardware ? 0 10000 20000 30000 40000 50000 60000 70000 80000 2 4 6 8 10 12 14 16 18 20 22 24 Time(ms) Degree of Parallelism Non-sorted column store Sorted column store Column store created on 1095600000 rows
  • 6. Can We Use All Available CPU Resource ? 0 10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14 16 18 20 22 24 PercentageCPUUtilization Degree of Parallelism Non-sorted Sorted Memory access should consume all available CPU cycles ?!?
  • 7. Looking For Clues  Why does the query using the column store on pre-sorted data run faster ?  Why can we not utilise 100% CPU capacity ?  Lets start with tried and trusted tools and techniques.
  • 8. Wait Statistics Do Not Help Here ! Stats are for the query ran with a DOP of 24, a warm column store object pool and the column store created on pre-sorted data (1095600000 rows ). CXPACKET waits can be ignored 99.99 % of the time.
  • 9. Spin Locks Do Not Provide Any Clues Either Executes in 775 ms for a warm column store object pool 12 cores x 2.0 Ghz x 0.775 = 1,860,000,000 CPU cycles Total spins 293,491 SELECT [CalendarQuarter] ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [FactInternetSalesBig] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter OPTION (MAXDOP 24)
  • 10. Could Query Costs Help Solve The Two Mysteries ? Assumptions: The buffer cache is cold. IO cannot be performed in parallel. Data in different columns is never correlated ( better in SQL 2014 ) Hash distribution is always uniform. Etc. . . . Costings based on the amount of time it took a developers machine to complete certain operations, a Dell OptiPlex ( according to legend ).
  • 11. Our View Of Database Engine Resource Usage Is Based On . . . wait stats, perfmon counters, extended events and dynamic management views. We need to know and understand:  Where all our CPU cycles are going How the database engine utilises the CPU at a deep architectural level.
  • 12. Core “i Series” CPU Architecture L3 Cache L1 Instruction Cache 32KB L0 UOP cache L2 Unified Cache 256K Power and Clock QPI Memory Controller L1 Data Cache 32KB Core CoreL1 Instruction Cache 32KB L0 UOP cache L2 Unified Cache 256K L1 Data Cache 32KB Core Bi-directional ring bus IOTLBMemory bus C P U QPI… Un-core  system-on-chip ( SOC ) design with CPU cores as the basic building block.  Utility services provisioned by the ‘Un-core’ part of the CPU die.  Three level cache hierarchy ( four for Sandybridge+)
  • 13. 4 4 4 11 11 11 14 18 38 167 0 50 100 150 200 L1 Cache sequential access L1 Cache In Page Random access L1 Cache In Full Random access L2 Cache sequential access L2 Cache In Page Random access L2 Cache Full Random access L3 Cache sequential access L3 Cache In Page Random access L3 Cache Full Random access Main memory The CPU Cache Hierarchy Latencies In CPU Cycles Memory
  • 14. Memory Access Can Become More Costly With NUMA Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 Remote memory access Local memory access Local memory access NUMA Node 0 NUMA Node 1
  • 15. NUMA Node Remote Memory Access LatencyMemory Access Latency An additional 20% overhead when accessing ‘Foreign’ memory ! ( from coreinfo )
  • 16. Local Vs Remote Memory Access and Thread Locality How does SQLOS schedule hyper- threads in relation to physical cores ? ( 6 cores per socket ) CPU socket 0 CPU socket 1 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5
  • 17.  Leverage the pre-fetcher as much as possible.  Larger CPU caches  L4 Cache => Crystalwell eDram  DDR4 memory  By pass main memory  Stacked memory  Hybrid memory cubes (Intel)  High Bandwidth memory (AMD) Main Memory Is Holding The CPU Back, Solutions . . . Main memory CPU
  • 18. What The Pre-Fetcher Loves Column Store index Sequential scan
  • 19. What The Pre-Fetcher Hates ! Hash Table Can this be improved ?
  • 20. Making Use Of CPU Stalls With Hyper Threading ( Core i series onwards ) Access n row B-tree L2 L3 Last level cache miss n row B-tree Session 1 performs an index seek, the page is not found in the CPU cache. A CPU stall takes place ( 160 clock cycles+ ) whilst the page is retrieved from memory. The ‘Dead’ CPU stall cycles gives the physical core the opportunity to run a 2nd ( hyper ) thread. Core L1
  • 21. Back To Our Mystery, A Not So Well Documented Tool To The Rescue ! Stack Walking The Database Engine SELECT p.EnglishProductName ,SUM([OrderQuantity]) ,SUM([UnitPrice]) ,SUM([ExtendedAmount]) ,SUM([UnitPriceDiscountPct]) ,SUM([DiscountAmount]) ,SUM([ProductStandardCost]) ,SUM([TotalProductCost]) ,SUM([SalesAmount]) ,SUM([TaxAmt]) ,SUM([Freight]) FROM [dbo].[FactInternetSales] f JOIN [dbo].[DimProduct] p ON f.ProductKey = p.ProductKey GOUP BY p.EnglishProductName xperf –on base –stackwalk profile xperf –d stackwalk.etl WPA
  • 22. Call Stack For Query Against Column Store On Non-Pre-Sorted Data Hash agg lookup weight 65,329.87 Column Store scan weight 28,488.73
  • 23. Control flow Data flow Where Is The Bottleneck In The Plan ? The stack trace is indicating that the Bottleneck is right here
  • 24. Call Stack For Query Against Column Store On Pre-Sorted Data Hash agg lookup weight:  now 275.00  before 65,329.87 Column Store scan weight  now 45,764.07  before 28,488.73
  • 25. Does The OrderDateKey Column Fit In The L3 Cache ? Table Name Column Name Size (Mb) FactInternetSalesBigNoSort OrderDateKey 1786182 Price1 3871 Price2 3871 Price3 3871 FactInternetSalesBigSorted OrderDateKey 738 Price1 2965127 Price2 2965127 Price3 2965127 No , L3 cache is 20Mb in size
  • 26. The Case Of The Two Column Store Index Sizes: Conclusion Turning the memory access on the hash aggregate table from random to sequential probes = CPU savings > cost of scanning an enlarged column store
  • 27. Batch Mode Hash Join / Aggregate Performance and SkewMode Hash Joins 27 Row Mode Batch Mode Expensive to repartition inputs Data skew reduces parallelism Hash join B1 B2 B3 Bn Hash table (shared) Thread Thread Build input Thread Thread Thread Probe input B1 B2 B4 Bm B3 Build input Hash join Exchange Exchange Probe input Exchange Exchange Data skew speeds up processing No repartitioning
  • 28. The Case Of The 60% CPU Utilisation Ceiling If CPU capacity or IO bandwidth cannot be fully consumed, some form of contention must be present . . .
  • 29. Batch Engine Call Stack Throttling !
  • 30. The Case Of The 60% CPU Utilisation Ceiling: Conclusion The hash aggregate cannot keep up with the column store scan. The batch engine therefore throttles the column store scan by calling sleep system calls !!!. The integration services engine does something very similar, known as data flow engine “Back pressure”.
  • 31. The Case Of The Two Column Stores The hash aggregate using the column stored created on pre-sorted data is very CPU efficient. Why ?
  • 32. Introducing Intel VTune Amplifier XE Investigating events at the CPU cache, clock cycle and instruction level requires software outside the standard Windows and SQL Server tool set. Refer to Appendix D for an overview of what “General exploration” provides.
  • 33. This Is What The CPU Stall Picture Looks Like Against DOP 2 4 6 8 10 12 14 16 18 20 22 24 Non-sorted 13,200,924 30,902,163 161,411,298 1,835,828,499 2,069,544,858 4,580,720,628 2,796,495,741 3,080,615,628 3,950,376,507 4,419,593,391 4,952,446,647 5,311,271,763 Sorted 3,000,210 1,200,084 16,203,164 29,102,037 34,802,436 35,102,457 48,903,413 64,204,494 63,004,410 85,205,964 68,404,788 72,605,082 0 1,000,000,000 2,000,000,000 3,000,000,000 4,000,000,000 5,000,000,000 6,000,000,000 LLCMisses Degree of Parallelism Non-sorted Sorted
  • 34. Which Memory Your Data Is In Matters !, Locality Matters ! L3 Cache Here ? L1 Instruction Cache L2 Unified Cache Here ? L1 Data Cache Here ? Core Memory bus Where is my data ? C P U Hopefully not here ?!?
  • 35. The Case of The CPU Pressure Point Where are the pressure points on the CPU and what can be done to resolve this ?.
  • 36. CPU Pipeline Architecture C P U Front end Back end  A ‘Pipeline’ of logical slots runs through the processor.  Front end can issue four micro ops per clock cycle.  Back end can retire up to four micro operations per clock cycle. AllocationRetirement
  • 37. Pipeline ‘Bubbles’ Are Bad ! C P U Front end Back end  Empty slots are referred to as ‘Bubbles’.  Causes of front end bubbles:  Bad speculation.  CPU stalls.  Data dependencies A = B + C E = A + D  Back end bubbles can be due to excessive demand for specific types of execution unit. AllocationRetirement Bubble Bubble Bubble Bubble
  • 38. Making Efficient Use Of The CPU In The “In Memory” World C P U D A T A F L O W D A T A F L O WFront endBack end Backend Pressure Retirement throttled due to pressure on back end resources ( port saturation ) Frontend Pressure Front end issuing < 4 uops per cycle whilst backend is ready to accept uops ( CPU stalls, bad speculation, data dependencies )
  • 39. Lots Of KPIs To Choose From, Which To Select ?  CPU Cycles Per Retired Instruction (CPI) This should ideally be 0.25, anything approaching 1.0 is bad.  Front end bound The front end under supplying the back end with work (lower values are better).  Back end bound The Back end cannot accept work from the front end because there is excessive demand for specific execution units (lower values better).
  • 40. These Are The Pressure Point Statistics For The ‘Sorted’ Column Store 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 2 4 6 8 10 12 14 16 18 20 22 24 KPIValue Degree of Parallelism CPI Front end Bound Back end bound Refer to Appendix C for the formulae from which these metrics are derived.
  • 41. The Backend Of The CPU Is Now The Bottleneck For The Batch mode Engine C P U Front end  Back end  Performance Can we help the back end keep up with the front end ?
  • 42. Single Instruction Multiple Data ( SIMD )  A class of CPU instruction that can process multiple data points to be processed simultaneously.  A form of vectorised processing.  Once CPU stalls are minimised, the challenge becomes processing data on the CPU ( rate of instruction retirement ) as fast as possible.
  • 43. Lowering Clock Cycles Per Instruction By Leveraging SIMD A(1) B(1) + C(1) = + = + = + = Using conventional processing, adding two arrays together, each comprising of four elements requires four instructions. A(2) A(3) A(4) B(2) B(3) B(4) C(2) C(3) C(4)
  • 44. Lowering Clock Cycles Per Instruction By Leveraging SIMD A(1) A(2) A(3) A(4) B(1) B(2) B(3) B(4) + C(1) C(2) C(3) C(4) = Using SIMD – “Single instruction multiple data” commands, the addition can be performed using a single instruction.
  • 45. Does The SQL Server Database Engine Leverage SIMD Instructions ?   VTune amplifier does not provide the option to pick out Streaming SIMD Extension (SSE) integer events.  However, for a floating point hash aggregate we would hope to see floating point AVX instructions in use.
  • 49. Appendix A: Instruction Execution And The CPU Front / Back Ends Cache Fetch Decode Execute Branch Predict Decoded Instruction Buffer Execute Reorder And Retire Front end Back end
  • 50. Appendix B - The CPU Front / Back Ends In Detail Front end Back end
  • 51.  Front end bound ( smaller is better )  IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)  Bad speculation  (UOPS_ISSUED.ANY – UOPS.RETIRED.RETIRED_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks)  Retiring  UOPS_RETIRE_SLOTS / (4 * Clock ticks)  Back end bound ( ideally, should = 1 - Retiring)  1 – (Front end bound + Bad speculation + Retiring) Appendix C - CPU Pressure Points, Important Calculations
  • 52. An illustration of what the “General exploration” analysis capability of the tool provides Appendix D - VTune Amplifier General Exploration

Editor's Notes

  • #6: If the TOP 5000 is changed to TOP 30000 in the query to created the enlarged FactInternetSales table, it can be scaled up to 1095600000 rows.
  • #7: The key takeaway point from this slide is that memory access is 100% CPU intensive, it is not subject to any locking or throttling back via contention.
  • #9: Signal wait time = total wait time is to be expected for short waits on uncontended spin locks. The big problem with wait statistics is that they shed no light whatsoever on where CPU cycles are being expended whilst a thread is using its 4 ms quantum of scheduler time.
  • #13: The current generation of Xeons is the i-series, where as the previous Core 2 architecture achieved a quad core design by gluing two dual core CPUs together on the same packaging, this is a native – “From the ground up” modular design. The memory controller was integrated onto the processor from the outset, also with the Sandybridge architecture, the IO hub is integrated into the uncore area and the L3 cache is connected to the cores via a bi-directional ring bus.
  • #14: A big penalty is paid for going off CPU for data, this is also the case with using main memory in order to access the memory page table. Hekaton is not mentioned in this slide deck for the simple fact that memory optimised tables need to fit into memory, level 3 caches tend to be around 24Mb in size at a maximum for an eight core CPU, hence it is highly unlikely that any table of note can fit inside a CPU cache. When data is required that cannot be found in the L0, 1, 2, or 3 cache, this is known as a “CPU stall” or “Last level cache miss”.
  • #15: Note the difference in latency between accessing the on CPU cache and main memory, accessing main memory incurs a large penalty in terms of lost CPU cycles, this is important and one of the drivers behind the new optimizer mode that was introduced in SQL Server 2012 in order to support column store indexes.
  • #21: The core I series use the empty pipelines caused by CPU stalls to schedule a second ( hyper ) thread per core, prior to this Intel used a technology called ‘Net burst’, this aimed to fill all empty slots, as a technology this produced disappointing results, hyper threading with the core i series is a different matter however. AMD favours ‘Strong’ threads whereby each thread runs on its own dedicated core.
  • #22: Xperf can provide deep insights into the database engine that other tools cannot, in this case we can walk the stack associated with query execution and observe the total CPU consumption up to any point in the stack in milliseconds
  • #23: This slide is the “Big reveal”, for the column store created on the pre-sorted data, there are four points of interest when comparing the two call stacks: The hash table lookup is more expensive for the query using the column store created on the non-sorted data, by an order of magnitude. We see a line below the Flookup for the query against the column store created on non-sorted data which we do not see in the other call stack. The hash aggregate lookup for the query on the “Non-sorted” column store is throttling back the column store scan more than its counterpart in the other call stack, this is based on the number of times this is called. The scan for the sorted column store is twice as expensive as that for the non-sorted column store.
  • #25: This slide is the “Big reveal”, for the column store created on the pre-sorted data, there are four points of interest when comparing the two call stacks: The hash table lookup is more expensive for the query using the column store created on the non-sorted data, by an order of magnitude. We see a line below the Flookup for the query against the column store created on non-sorted data which we do not see in the other call stack. The hash aggregate lookup for the query on the “Non-sorted” column store is throttling back the column store scan more than its counterpart in the other call stack, this is based on the number of times this is called. The scan for the sorted column store is twice as expensive as that for the non-sorted column store.
  • #28: Thread numbers taken away… Removed bottom
  • #37: Modern Intel CPUs speak two languages, programs are compiled into x86 architecture instructions, internally the CPU uses micro operations which are similar in nature to RISC instructions such as those that ARM processors use. In short this is so that the CPU can leverage RISC-like performance optimisations.
  • #38: To get the best possible throughput from the CPU, all pipeline slots should be utilised, ‘Bubbles’ or empty slots are to be avoided at all costs, of all the things that can degrade CPU performance, getting branch prediction wrong is devastating in that this leads to the flushing of whole instruction pipe lines. There are techniques for utilising empty slots such as out of order execution and hyper threading. However, great efforts are put into the design of branch prediction logic in order to avoid the pipeline flushing in the first place, generally speaking each new micro architecture always includes some form of enhancement to the branch prediction logic.
  • #41: The results for this graph have been obtained using a single instance with large pages, LPIM, 40Gb max memory, the ‘Sorted’ column store index and a warm large object cache.