Sql sever engine batch mode and cpu architectures

About me
An independent SQL Consultant
A user of SQL Server from version 2000 onwards with 12+ years
experience.
Speaker, both at UK user group events and at conferences.
I have a passion for understanding how the database engine works
at a deep level.

Demonstration
“Everything fits in memory, so
performance is as good as it will
get. It fits in memory therefore
end of story”

Which Query Has The Lowest Elapsed Time ?
28.98Mb column store Vs. 107.30Mb column store

How Well Do The Two Column Stores Scale On Larger Hardware ?
0
10000
20000
30000
40000
50000
60000
70000
80000
2 4 6 8 10 12 14 16 18 20 22 24
Time(ms)
Degree of Parallelism
Non-sorted column store Sorted column store
Column store created on 1095600000 rows

Can We Use All Available CPU Resource ?
0
10
20
30
40
50
60
70
80
90
100
2 4 6 8 10 12 14 16 18 20 22 24
PercentageCPUUtilization
Non-sorted Sorted
Memory access
should consume
all available
CPU cycles ?!?

Looking For Clues
 Why does the query using the
column store on pre-sorted data
run faster ?
 Why can we not utilise 100% CPU
capacity ?
 Lets start with tried and trusted
tools and techniques.

Wait Statistics Do Not Help Here !
Stats are for the query ran with a DOP of 24, a warm column store object
pool and the column store created on pre-sorted data
(1095600000 rows ).
CXPACKET waits can be ignored 99.99 % of the time.

Spin Locks Do Not Provide Any Clues Either
Executes in 775 ms for a warm column store object pool
12 cores x 2.0 Ghz x 0.775 = 1,860,000,000 CPU cycles
Total spins 293,491
SELECT [CalendarQuarter]
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM [FactInternetSalesBig] f
JOIN [DimDate] d
ON f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
OPTION (MAXDOP 24)

Could Query Costs Help Solve The Two Mysteries ?
Assumptions:
The buffer cache is cold.
IO cannot be performed in parallel.
Data in different columns is never correlated ( better in SQL 2014 )
Hash distribution is always uniform.
Etc. . . .
Costings based on the amount of time it took a
developers machine to complete certain
operations, a Dell OptiPlex ( according to legend ).

Our View Of Database Engine Resource Usage Is Based On . . .
wait stats, perfmon counters, extended
events and dynamic management views.
We need to know and understand:
 Where all our CPU cycles are going
How the database engine utilises the
CPU at a deep architectural level.

Core
“i Series” CPU Architecture
L3 Cache
L1 Instruction
Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
Power
and
Clock
QPI
Memory
Controller
L1 Data Cache
32KB
Core
CoreL1 Instruction
Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
Bi-directional ring bus
IOTLBMemory bus
C P U
QPI…
Un-core
 system-on-chip ( SOC ) design
with CPU cores as the basic
building block.
 Utility services provisioned by
the ‘Un-core’ part of the CPU
die.
 Three level cache hierarchy
( four for Sandybridge+)

4
4
4
11
11
11
14
18
38
167
0 50 100 150 200
L1 Cache sequential access
L1 Cache In Page Random access
L1 Cache In Full Random access
L2 Cache Full Random access
L3 Cache Full Random access
Main memory
The CPU Cache Hierarchy Latencies In CPU Cycles Memory

Memory Access Can Become More Costly With NUMA
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Remote
memory
access
Local
memory
access
Local
memory
access
NUMA Node 0 NUMA Node 1

NUMA Node Remote Memory Access LatencyMemory Access Latency
An additional 20% overhead when
accessing ‘Foreign’ memory !
( from coreinfo )

Local Vs Remote Memory Access and Thread Locality
How does SQLOS
schedule hyper-
threads in relation
to physical cores ?
( 6 cores per socket )
CPU
socket 0
CPU
socket 1
Core 0 Core 1 Core 2
Core 3 Core 4 Core 5

 Leverage the pre-fetcher as
much as possible.
 Larger CPU caches
 L4 Cache => Crystalwell eDram
 DDR4 memory
 By pass main memory
 Stacked memory
 Hybrid memory cubes (Intel)
 High Bandwidth memory (AMD)
Main Memory Is Holding The CPU Back, Solutions . . .
Main
memory
CPU

What The Pre-Fetcher Loves
Column Store index
Sequential scan

What The Pre-Fetcher Hates !
Hash Table
Can this be
improved ?

Making Use Of CPU Stalls With Hyper Threading ( Core i series onwards )
Access
n row B-tree
L2
L3
Last level
cache miss n row B-tree
Session 1 performs an index
seek, the page is not found
in the CPU cache. A CPU stall takes place
( 160 clock cycles+ ) whilst
the page is retrieved from
memory.
The ‘Dead’ CPU stall cycles
gives the physical core the
opportunity to run a 2nd
( hyper ) thread.
Core
L1

Back To Our Mystery, A Not So Well Documented Tool To The Rescue !
Stack Walking The Database Engine
SELECT p.EnglishProductName
,SUM([OrderQuantity])
,SUM([UnitPrice])
,SUM([ExtendedAmount])
,SUM([UnitPriceDiscountPct])
,SUM([DiscountAmount])
,SUM([ProductStandardCost])
,SUM([TotalProductCost])
,SUM([SalesAmount])
,SUM([TaxAmt])
,SUM([Freight])
FROM [dbo].[FactInternetSales] f
JOIN [dbo].[DimProduct] p
ON f.ProductKey = p.ProductKey
GOUP BY p.EnglishProductName
xperf –on base –stackwalk profile
xperf –d stackwalk.etl
WPA

Call Stack For Query Against Column Store On Non-Pre-Sorted Data
Hash agg lookup
weight 65,329.87
Column Store scan
weight 28,488.73

Control flow
Data flow
Where Is The Bottleneck In The Plan ?
The stack trace is indicating that the
Bottleneck is right here

Call Stack For Query Against Column Store On Pre-Sorted Data
Hash agg lookup
weight:
 now 275.00
 before 65,329.87
Column Store scan
weight
 now 45,764.07
 before 28,488.73

Does The OrderDateKey Column Fit In The L3 Cache ?
Table Name Column Name Size (Mb)
FactInternetSalesBigNoSort
OrderDateKey 1786182
Price1 3871
Price2 3871
Price3 3871
FactInternetSalesBigSorted
OrderDateKey 738
Price1 2965127
Price2 2965127
Price3 2965127
No , L3 cache is 20Mb in size

The Case Of The Two Column Store Index Sizes: Conclusion
Turning the memory access on
the hash aggregate table from random
to sequential probes
=
CPU savings > cost of scanning an
enlarged column store

Batch Mode Hash Join / Aggregate Performance and SkewMode Hash Joins
27
Row Mode Batch Mode
Expensive to repartition inputs
Data skew reduces parallelism
Hash join
B1
B2
B3
Bn
Hash table
(shared)
Thread
Thread
Build input
Thread
Thread
Thread
Probe input
B1
B2
B4
Bm
B3
Build input
Hash join
Exchange Exchange
Probe input
Exchange Exchange
Data skew speeds up processing
No repartitioning

The Case Of The 60% CPU Utilisation Ceiling
If CPU capacity or IO bandwidth
cannot be fully consumed, some
form of contention must be
present . . .

Batch Engine Call Stack
Throttling !

The Case Of The 60% CPU Utilisation Ceiling: Conclusion
The hash aggregate cannot
keep up with the column store
scan.
The batch engine therefore
throttles the column store scan
by calling sleep system calls !!!.
The integration services engine does
something very similar, known as
data flow engine “Back pressure”.

The Case Of The Two Column Stores
The hash aggregate using the column
stored created on pre-sorted data is
very CPU efficient.
Why ?

Introducing Intel VTune Amplifier XE
Investigating events at the CPU cache,
clock cycle and instruction level requires
software outside the standard Windows
and SQL Server tool set.
Refer to Appendix D for an overview of what “General exploration”
provides.

This Is What The CPU Stall Picture Looks Like Against DOP
2 4 6 8 10 12 14 16 18 20 22 24
Non-sorted 13,200,924 30,902,163 161,411,298 1,835,828,499 2,069,544,858 4,580,720,628 2,796,495,741 3,080,615,628 3,950,376,507 4,419,593,391 4,952,446,647 5,311,271,763
Sorted 3,000,210 1,200,084 16,203,164 29,102,037 34,802,436 35,102,457 48,903,413 64,204,494 63,004,410 85,205,964 68,404,788 72,605,082
0
1,000,000,000
2,000,000,000
3,000,000,000
4,000,000,000
5,000,000,000
6,000,000,000
LLCMisses
Non-sorted Sorted

Which Memory Your Data Is In Matters !, Locality Matters !
L3 Cache
Here ?
L1 Instruction
Cache
L2 Unified Cache
Here ?
L1 Data Cache
Here ?
Core
Memory
bus
Where is my data ?
C P U
Hopefully
not here ?!?

The Case of The CPU Pressure Point
Where are the pressure points on
the CPU and what can be done to
resolve this ?.

CPU Pipeline Architecture
C P U
Front end
Back end
 A ‘Pipeline’ of logical slots
runs through the processor.
 Front end can issue four
micro ops per clock cycle.
 Back end can retire up to
four micro operations per
clock cycle.
AllocationRetirement

Pipeline ‘Bubbles’ Are Bad !
C P U
Front end
Back end
 Empty slots are referred to as
‘Bubbles’.
 Causes of front end bubbles:
 Bad speculation.
 CPU stalls.
 Data dependencies
A = B + C
E = A + D
 Back end bubbles can be due
to excessive demand for
specific types of execution unit.
AllocationRetirement
Bubble
Bubble
Bubble
Bubble

Making Efficient Use Of The CPU In The “In Memory” World
C P U
D A T A F L O W
D A T A F L O WFront endBack end
Backend Pressure
Retirement throttled due to
pressure on back end
resources ( port saturation )
Frontend Pressure
Front end issuing < 4 uops per cycle whilst
backend is ready to accept uops
( CPU stalls, bad speculation, data
dependencies )

Lots Of KPIs To Choose From, Which To Select ?
 CPU Cycles Per Retired Instruction (CPI)
This should ideally be 0.25, anything
approaching 1.0 is bad.
 Front end bound
The front end under supplying the back end with
work (lower values are better).
 Back end bound
The Back end cannot accept work from the front
end because there is excessive demand for
specific execution units (lower values better).

These Are The Pressure Point Statistics For The ‘Sorted’ Column Store
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2 4 6 8 10 12 14 16 18 20 22 24
KPIValue
CPI Front end Bound Back end bound
Refer to Appendix C for the formulae from
which these metrics are derived.

The Backend Of The CPU Is Now The Bottleneck For The Batch mode Engine
C P U
Front end

Back end

Performance
Can we help the
back end keep up
with the front end ?

Single Instruction Multiple Data ( SIMD )
 A class of CPU instruction that can
process multiple data points to be
processed simultaneously.
 A form of vectorised processing.
 Once CPU stalls are minimised, the
challenge becomes processing data
on the CPU ( rate of instruction
retirement ) as fast as possible.

Lowering Clock Cycles Per Instruction By Leveraging SIMD
A(1) B(1)
+ C(1)
=
+ =
+ =
+ =
Using conventional processing,
adding two arrays together,
each comprising of four
elements requires four
instructions.
A(2)
A(3)
A(4)
B(2)
B(3)
B(4)
C(2)
C(3)
C(4)

Lowering Clock Cycles Per Instruction By Leveraging SIMD
A(1) A(2) A(3) A(4)
B(1) B(2) B(3) B(4)
+
C(1) C(2) C(3) C(4)
=
Using SIMD –
“Single instruction multiple data”
commands, the addition can be
performed using a single instruction.

Does The SQL Server Database Engine Leverage SIMD Instructions ?

 VTune amplifier does not provide the option to pick out
Streaming SIMD Extension (SSE) integer events.
 However, for a floating point hash aggregate we would hope to
see floating point AVX instructions in use.

Sql sever engine batch mode and cpu architectures

chris1adkin@yahoo.co.uk
http://uk.linkedin.com/in/wollatondba
ChrisAdkin8

Appendix A: Instruction Execution And The CPU Front / Back Ends
Cache
Fetch
Decode
Execute
Branch
Predict
Decoded
Instruction
Buffer
Execute
Reorder
And
Retire
Front end Back end

Appendix B - The CPU Front / Back Ends In Detail
Front end Back end

 Front end bound ( smaller is better )
 IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)
 Bad speculation
 (UOPS_ISSUED.ANY – UOPS.RETIRED.RETIRED_SLOTS + 4 *
INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks)
 Retiring
 UOPS_RETIRE_SLOTS / (4 * Clock ticks)
 Back end bound ( ideally, should = 1 - Retiring)
 1 – (Front end bound + Bad speculation + Retiring)
Appendix C - CPU Pressure Points, Important Calculations

An illustration of what the
“General exploration” analysis
capability of the tool provides
Appendix D - VTune Amplifier General Exploration

Sql sever engine batch mode and cpu architectures

Recommended

More Related Content

What's hot (20)

Similar to Sql sever engine batch mode and cpu architectures (20)

More from Chris Adkin (9)

Recently uploaded (20)

Sql sever engine batch mode and cpu architectures

Editor's Notes