SlideShare a Scribd company logo
Physical Memory Models: the ways Linux
kernel addresses physical memory (physical
page frame)
Adrian Huang | June, 2022
* Kernel 5.11 (x86_64)
Agenda
• Four physical memory models
✓Purpose: page descriptor <-> PFN (Page Frame Number)
• Sparse memory model
• Sparse Memory Virtual Memmap: subsection
• page->flags
Four Physical Memory Models
• Flat Memory Model (CONFIG_FLATMEM)
✓UMA (Uniform Memory Access) with mostly continuous physical memory
• Discontinuous Memory Model (CONFIG_DISCONTIGMEM)
✓NUMA (Non-Uniform Memory Access) with mostly continuous physical memory
✓Removed since v5.14 because sparse memory model can cover this scope
• https://lore.kernel.org/linux-mm/20210602105348.13387-1-rppt@kernel.org/
• Sparse Memory (CONFIG_SPARSEMEM)
✓NUMA with discontinuous physical memory
• Sparse Memory Virtual Memmap (CONFIG_SPARSEMEM_VMEMMAP)
✓NUMA with discontinuous physical memory: A quick way to get page struct and
pfn
Memory Model – Flat Memory
struct page #n
....
struct page #1
struct page #0
Dynamic page structure
(Kernel Virtual Address Space)
struct page *mem_map
page frame #n
....
page frame #1
page frame #0
Physical Memory
Note
Page structure array
(Kernel Virtual Address Space)
1. [mem_map] Dynamic page structure: pre-allocate all page structures based on the number of page frames
✓ Allocate/Init page structures based on node’s memory info (struct pglist_data)
▪ Refer from: pglist_data.node_start_pfn & pglist_data.node_spanned_pages
2. Scenario: Continuous page frames (no memory holes) in UMA
3. Drawback
✓ Waste node_mem_map space if memory holes
✓ does not support memory hotplug
4. Check kernel function alloc_node_mem_map() in mm/page_alloc.c
Memory Model – Flat Memory
Memory Model – Discontinuous Memory
struct pglist_data *
page frame #000
....
page frame #1000
Physical Memory
1. [node_mem_map] Dynamic page structure: pre-allocate all page structures based on the
number of page frames
✓ Allocate/Init page structures based on node’s memory info (struct pglist_data)
▪ Refer from: pglist_data.node_start_pfn & pglist_data.node_spanned_pages
2. Scenario: Each node has continuous page frames (no memory holes) in NUMA
3. Drawback
✓ Waste node_mem_map space if memory holes
✓ does not support memory hotplug
NUMA Node Structure
(Kernel Virtual Address Space) struct pglist_data *
struct pglist_data *
…
struct pglist_data *node_data[]
page frame #999
....
page frame #0
struct page #n
....
struct page #0
struct page #n
....
struct page #0
node_mem_map
node_mem_map Node #1
Node #0
Note
Memory Model – Sparse Memory
struct mem_section
page frame
....
page frame
Physical Memory
**mem_section
struct mem_section
struct mem_section
…
struct mem_section *
page frame
....
page frame
....
struct page #0
struct page #n
....
struct page #0
Node #1
(hotplug)
Node #0
…
struct mem_section *
1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the
number of available page frames
✓ Refer from: memblock structure
2. Support physical memory hotplug
3. Minimum unit: PAGES_PER_SECTION = 32768
✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB
4. [NUMA] : reduce the memory hole impact due to “struct mem_section”
Note
struct page #m+n-1
struct mem_section
page frame
....
page frame
Physical Memory
struct mem_section
struct mem_section
…
struct mem_section *
page frame
....
page frame
struct page #m+n-1
....
struct page #m
struct page #n
....
struct page #0
Node #1
Node #0
…
struct mem_section *
Memory Model – Sparse Memory Virtual Memmap
vmemmap
Memory Section
(two-dimension array)
Note
1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the number of
available page frames
✓ Refer from: memblock structure
2. Support physical memory hotplug
3. Minimum unit: PAGES_PER_SECTION = 32768
✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB
4. [NUMA] : reduce the memory hole impact due to “struct mem_section”
5. Employ virtual memory map (vmemmap/ vmemmap_base) – A quick way to get page struct and pfn
6. Default configuration in Linux kernel
Memory Model – Sparse Memory Virtual Memmap: Detail
SECTIONS_PER_ROOT PAGES_PER_SECTION
0
14
15
22
NR_SECTION_ROOTS
23
33
PFN
63
struct mem_section
Physical Memory
struct mem_section
…
struct mem_section *
page frame
struct page #32767
....
struct page #0
…
struct mem_section *
vmemmap
**mem_section
(two-dimension array)
struct mem_section
struct mem_section
…
. . .
0
0
0
255
255
struct page
....
struct page
struct page
....
struct page
2047
+
…
page frame
page frame
…
page frame
page frame
…
Hot add
Hot add
Hot remove
....
+
page frame
128 MB
PFN
SECTIONS_PER_ROOT PAGES_PER_SECTION
0
14
15
22
NR_SECTION_ROOTS
23
33
PFN
63
struct mem_section
Physical Memory
struct mem_section
…
struct mem_section *
page frame
struct page #32767
....
struct page #0
…
struct mem_section *
vmemmap
**mem_section
(two-dimension array)
struct mem_section
struct mem_section
…
. . .
0
0
0
255
255
struct page
....
struct page
struct page
....
struct page
2047
+
…
page frame
page frame
…
page frame
page frame
…
Hot add
Hot add
Hot remove
....
+
page frame
128 MB
PFN
Memory Model – Sparse Memory Virtual Memmap: Detail
Sparse Memory Model
1. How to know available memory pages in a system?
2. Page Table Configuration for Direct Mapping
3. Sparse Memory Model Initialization – Detail
How to know available memory pages in a system?
BIOS e820 memblock Zone Page Frame Allocator
e820__memblock_setup() __free_pages_core()
[Call Path] memblock frees available memory space to zone page frame allocator
Zone page allocator detail will be discussed in another session:
physical memory management
setup_arch() -- Focus on memory portion
setup_arch
Reserve memblock for kernel code +
data/bss sections, page #0 and init ramdisk
e820__memory_setup
Setup init_mm struct for members
‘start_code’, ‘end_code’, ‘end_data’ and ‘brk’
memblock_x86_reserve_range_setup_data
e820__reserve_setup_data
e820__finish_early_params
efi_init
dmi_setup
e820_add_kernel_range
trim_bios_range
max_pfn = e820__end_of_ram_pfn()
kernel_randomize_memory
e820__memblock_setup
init_mem_mapping
x86_init.paging.pagetable_init
early_alloc_pgt_buf
reserve_brk
init_memory_mapping()
• Create 4-level page table (direct mapping) based on
‘memory’ type of memblock configuration.
x86_init.paging.pagetable_init()
• Init sparse
• Init zone structure
x86 - setup_arch() -- init_mem_mapping() – Page Table
Configuration for Direct Mapping
init_mem_mapping
probe_page_size_mask
setup_pcid
memory_map_top_down(ISA_END_ADDRESS, end)
init_memory_mapping(0, ISA_END_ADDRESS, PAGE_KERNEL)
init_range_memory_mapping(start, last_start)
split_mem_range
kernel_physical_mapping_init
add_pfn_range_mapped
early_ioremap_page_table_range_init [x86 only]
load_cr3(swapper_pg_dir)
__flush_tlb_all
init_memory_mapping() -> kernel_physical_mapping_init()
• Create 4-level page table (direct mapping) based on
‘memory’ type of memblock configuration.
split_mem_range()
• Split different the groups of page size based on the input
memory range (start address and end address)
✓ Try larger page size first
▪ 1G huge page -> 2M huge page -> 4K page
while (last_start > map_start)
init_memory_mapping(start, end, PAGE_KERNEL)
for_each_mem_pfn_range() → memblock stuff
Page Table Configuration for Direct Mapping
Kernel Space
0x0000_7FFF_FFFF_FFFF
0xFFFF_8000_0000_0000
128TB
Page frame direct
mapping (64TB)
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
page_offset_base
0
16MB
64-bit Virtual Address
Kernel Virtual Address
Physical Memory
0
0xFFFF_FFFF_FFFF_FFFF
Guard hole (8TB)
LDT remap for PTI (0.5TB)
Unused hole (0.5TB)
vmalloc/ioremap (32TB)
vmalloc_base
Unused hole (1TB)
Virtual memory map – 1TB
(store page frame descriptor)
…
vmemmap_base
64TB
*page
…
*page
…
*page
…
Page Frame
Descriptor
vmemmap_base
page_ofset_base = 0xFFFF_8880_0000_0000
vmalloc_base = 0xFFFF_C900_0000_0000
vmemmap_base = 0xFFFF_EA00_0000_0000
* Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c")
Default Configuration
Kernel text mapping from
physical address 0
Kernel code [.text, .data…]
Modules
__START_KERNEL_map = 0xFFFF_FFFF_8000_0000
__START_KERNEL = 0xFFFF_FFFF_8100_0000
MODULES_VADDR
0xFFFF_8000_0000_0000
Empty Space
User Space
128TB
1GB or 512MB
1GB or 1.5GB Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START
Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000
0xFFFF_FFFF_FFFF_FFFF
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
Reference: Documentation/x86/x86_64/mm.rst
Note: Refer from page #5 in the slide deck Decompressed vmlinux: linux kernel initialization from page table configuration perspective
init_mem_mapping() – Page Table Configuration for Direct Mapping
Note
• 2-socket server with 32GB memory
init_mem_mapping() – Page Table Configuration for Direct Mapping
Note
• 2-socket server with 32GB memory
setup_arch() -- init_mem_mapping() – Page Table
Configuration for Direct Mapping
init_memory_mapping() -> kernel_physical_mapping_init()
• Create 4-level page table (direct mapping) based on
‘memory’ type of the memblock configuration.
x86 - setup_arch() -- x86_init.paging.pagetable_init()
x86_init.paging.pagetable_init
native_pagetable_init
Remove mappings in the end of physical
memory from the boot time page table
paging_init
pagetable_init
__flush_tlb_all
sparse_init
zone_sizes_init
permanent_kmaps_init
x86_init.paging.pagetable_init
native_pagetable_init
paging_init
sparse_init
zone_sizes_init
x86 x86_64
cfg number of pfn for each zone
free_area_init
Sparse Memory Model Initialization: sparse_init()
sparse_init
memblocks_present
pnum_begin = first_present_section_nr();
nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
for_each_mem_pfn_range(..)
memory_present(nid, start, end)
1. for_each_mem_pfn_range(): Walk through available memory range
from memblock subsystem
Allocate pointer array of section root if necessary
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
sparse_index_init
set_section_nid
section_mark_present
cfg ‘ms->section_mem_map’ via
sparse_encode_early_nid()
for_each_present_section_nr(pnum_begin + 1, pnum_end)
sparse_init_nid
sparse_init_nid [Cover last cpu node]
Mark the present bit for each allocated mem_section
cfg ms->section_mem_map flag bits
1. Allocate a mem_section_usage struct
2. cfg ms->section_mem_map with the valid page descriptor
[During boot]
Temporary: Store nid in
ms->section_mem_map
[During boot]
Temporary: get nid in ms->section_mem_map
memblock
bottom_up
current_limit
memory
reserved
memblock_type
cnt
max
total_size
*regions
name
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memory_present()
memblock_type
cnt
max
total_size
*regions
name
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
struct mem_section * #2047
…
struct mem_section * #0
struct mem_section #0
struct mem_section #255
…
**mem_section
Initialized object
Initialized object
Uninitialized object
memblock
bottom_up
current_limit
memory
reserved
memblock_type
cnt
max
total_size
*regions
name
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memory_present()
memblock_type
cnt
max
total_size
*regions
name
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
struct mem_section #255
…
**mem_section
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section #0
struct mem_section * #2047
…
struct mem_section * #0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
. . .
Initialized object
Initialized object
Uninitialized object
P: Present, M: Memory map, O: Online, E: Early
memblock
bottom_up
current_limit
memory
reserved
memblock_type
cnt
max
total_size
*regions
name
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memory_present()
memblock_type
cnt
max
total_size
*regions
name
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
…
struct mem_section #255
…
**mem_section
0
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section #5
struct mem_section #0
. . .
struct mem_section * #2047
…
struct mem_section * #0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
. . .
Initialized object
Initialized object
Uninitialized object
memblock
bottom_up
current_limit
memory
reserved
memblock_type
cnt
max
total_size
*regions
name
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memory_present()
memblock_type
cnt
max
total_size
*regions
name
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
…
struct mem_section #255
…
**mem_section
0
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section #5
struct mem_section #0
. . .
struct mem_section #6
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section * #2047
…
struct mem_section * #0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
. . .
Initialized object
Initialized object
Uninitialized object
memblock
bottom_up
current_limit
memory
reserved
memblock_type
cnt
max
total_size
*regions
name
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memory_present()
memblock_type
cnt
max
total_size
*regions
name
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
struct mem_section * #2047
…
struct mem_section * #0
…
struct mem_section #255
…
**mem_section
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section #5
struct mem_section #0
. . .
struct mem_section #6
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
…
struct mem_section #138
struct mem_section * #1
…
struct mem_section #9
struct mem_section #0
…
struct mem_section #255
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
. . .
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
. . .
Initialized object
Initialized object
Uninitialized object
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
sparse_init_nid(): cfg mem_section_map
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
struct mem_section * #2047
…
struct mem_section * #0
…
struct mem_section #255
…
**mem_section
struct mem_section #5
struct mem_section #0
struct mem_section #6
…
struct mem_section #138
struct mem_section * #1
…
struct mem_section #9
struct mem_section #0
…
struct mem_section #255
. . .
struct page #65535
struct page #32767
struct page #0
struct page #32768
…
....
...
vmemmap = VMEMMAP_START =
vmemmap_base
section #0
section_roots #0
section #1
section_roots #0
struct mem_section_usage #n
…
struct mem_section_usage #0
Per-node basis
Number of available
‘struct mem_section
(map_count)’.
Initialized object
Uninitialized object
Allocate page structs for each
mem_section and map them to the page
table (Virtual Memory Map)
Note
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
. . .
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
. . .
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=0
M=1
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
struct mem_section * #2047
…
struct mem_section * #0
…
struct mem_section #255
…
**mem_section
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
struct mem_section #5
struct mem_section #0
. . .
struct mem_section #6
…
struct mem_section #138
struct mem_section * #1
…
struct mem_section #9
struct mem_section #0
…
struct mem_section #255
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
. . .
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
. . .
struct page #0
vmemmap = VMEMMAP_START =
vmemmap_base
section #0,
section_roots #0
section #1,
section_roots #0
struct mem_section_usage #n
…
struct mem_section_usage #0
Per-node basis
Number of available
‘struct mem_section
(map_count)’.
…
struct page #32767
struct page #32768
…
struct page #65535
…
struct page #229375
…
…
struct page #4521984
…
struct page #8388607
struct page #8388608
…
struct page #8683520
section #2-6,
section_roots #0
section #138-255,
section_roots #0
…
section #0-9,
section_roots #1
Initialized object
Allocated & Uninitialized object
Unallocated object
sparse_init_nid(): cfg mem_section_map
Allocate page structs for each
mem_section and map them to the
page table (Virtual Memory Map)
Note
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=0
M=1
64-bit Virtual Address
Kernel Space
0x0000_7FFF_FFFF_FFFF
0xFFFF_8000_0000_0000
128TB
Page frame direct
mapping (64TB)
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
page_offset_base
0
16MB
64-bit Virtual Address
Kernel Virtual Address
Physical Memory
0
0xFFFF_FFFF_FFFF_FFFF
Guard hole (8TB)
LDT remap for PTI (0.5TB)
Unused hole (0.5TB)
vmalloc/ioremap (32TB)
vmalloc_base
Unused hole (1TB)
Virtual memory map – 1TB
(store page frame descriptor)
…
vmemmap_base
64TB
*page
…
*page
…
*page
…
Page Frame
Descriptor
vmemmap_base
page_ofset_base = 0xFFFF_8880_0000_0000
vmalloc_base = 0xFFFF_C900_0000_0000
vmemmap_base = 0xFFFF_EA00_0000_0000
* Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c")
Default Configuration
Kernel text mapping from
physical address 0
Kernel code [.text, .data…]
Modules
__START_KERNEL_map = 0xFFFF_FFFF_8000_0000
__START_KERNEL = 0xFFFF_FFFF_8100_0000
MODULES_VADDR
0xFFFF_8000_0000_0000
Empty Space
User Space
128TB
1GB or 512MB
1GB or 1.5GB Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START
Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000
0xFFFF_FFFF_FFFF_FFFF
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
Reference: Documentation/x86/x86_64/mm.rst
Note: Refer from page #5 in the slide deck Decompressed vmlinux: linux kernel initialization from page table configuration perspective
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
. . .
struct page #0
vmemmap = VMEMMAP_START =
vmemmap_base
section #0,
section_roots #0
section #1,
section_roots #0
…
struct page #32767
struct page #32768
…
struct page #65535
…
struct page #229375
…
…
struct page #4521984
…
struct page #8388607
struct page #8388608
…
struct page #8683520
section #2-6,
section_roots #0
section #138-255,
section_roots #0
…
section #0-9,
section_roots #1
Re-visit sparse memory
Sparse Memory: Refer to section_mem_map
Sparse Memory with vmemmap: Refer to vmemmap
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=0
M=1
Sparse Memory Virtual Memmap:
subsection
1. Introduction
2. Subsection users?
3. pageblock_flags: pageblock migration type
Sparse Memory Virtual Memmap: subsection (1/4)
SECTIONS_PER_ROOT PAGES_PER_SECTION
0
14
15
22
NR_SECTION_ROOTS
23
33
PFN
63
SECTION_SIZE_BITS = 27
PAGES_PER_SUBSECTION
SUBSECTIONS_PER
_SECTION
14 9 0
8
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
• subsection_map: bitmap to indicate if the corresponding subsection is valid
• pageblock_flags: pages of a subsection have the same flag (migration type)
sparsemem vmemmap *only*
Sparse Memory Virtual Memmap: subsection (2/4)
Some macros are expanded manually
Note
Sparse Memory Virtual Memmap: subsection (3/4)
SECTIONS_PER_ROOT PAGES_PER_SECTION
0
14
15
22
NR_SECTION_ROOTS
23
33
PFN
63
SECTION_SIZE_BITS = 27
PAGES_PER_SUBSECTION
SUBSECTIONS_PER
_SECTION
14 9 0
8
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
• PAGES_PER_SUBSECTION = 512 pages
✓ 512 pages * 4KB = 2MB → 2MB huge page
in x86_64
Sparse Memory Virtual Memmap: subsection (4/4)
• SUBSECTION_SIZE
✓ (1UL << 21) = 2MB → 2MB huge
page in x86_64.
SECTIONS_PER_ROOT PAGES_PER_SECTION
0
14
15
22
NR_SECTION_ROOTS
23
33
PFN
63
SECTION_SIZE_BITS = 27
PAGES_PER_SUBSECTION
SUBSECTIONS_PER
_SECTION
14 9 0
8
Some macros are expanded manually
Note
subsection: subsection_map users?
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
• init stage
✓ paging_init -> zone_sizes_init -> free_area_init -> subsection_map_init -> subsection_mask_set
➢ Set the corresponding bit map for the specific subsection
• Reference stage
✓ pfn_section_valid(struct mem_section *ms, unsigned long pfn)
➢ Users
▪ [mm/page_alloc.c: 5089] free_pages -> virt_addr_valid -> __virt_addr_valid -> pfn_valid -> pfn_section_valid
▪ [drivers/char/mem.c: 416] mmap_kmem -> pfn_valid -> pfn_section_valid ➔ /dev/mem (`man mem`)
▪ …
subsection_map users
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
• Hotplug stage
✓ Add
➢ #A1 [drivers/acpi/acpi_memhotplug.c: 311] acpi_memory_device_add -> acpi_memory_enable_device ->
__add_memory -> add_memory_resource -> arch_add_memory -> add_pages -> __add_pages -> sparse_add_section
-> section_activate -> fill_subsection_map -> subsection_mask_set
➢ #A2 [drivers/dax/kmem.c: 43] dev_dax_kmem_probe -> add_memory_driver_managed -> add_memory_resource ->
same with #A1
✓ Remove
➢ #R1 [drivers/acpi/acpi_memhotplug.c: 311] acpi_memory_device_remove -> __remove_memory ->
try_remove_memory -> arch_remove_memory -> __remove_pages -> __remove_section -> sparse_remove_section ->
section_deactivate -> clear_subsection_map
➢ #R2 [drivers/dax/kmem.c: 139] dev_dax_kmem_remove -> remove_memory -> try_remove_memory -> same with #R1
subsection_map users
subsection: subsection_map users?
pageblock_flags: pageblock migration type
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
unsigned long pageblock_flags[4]
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
[0]
Dynamically allocated
[1]
[2]
[3]
subsection #0: Migration Type
subsection #16: Migration Type
subsection #32: Migration Type
subsection #48: Migration Type
Migration type is configured in setup_arch -> … -> memmap_init_zone
pageblock_flags: pageblock migration type
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
unsigned long pageblock_flags[4]
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
[0]
Dynamically allocated
[1]
[2]
[3]
subsection #0: Migration Type
subsection #16: Migration Type
subsection #32: Migration Type
subsection #48: Migration Type
pageblock: set migration type
free_area_init
print zone ranges and early memory node ranges
for_each_mem_pfn_range(..)
print memory range for each memblock
subsection_map_init
mminit_verify_pageflags_layout
setup_nr_node_ids
init_unavailable_mem
for_each_online_node(nid)
free_area_init_node
node_set_state
check_for_memory
get_pfn_range_for_nid
calculate_node_totalpages
pgdat_set_deferred_range
free_area_init_core
free_area_init_core
memmap_init
for (j = 0; j < MAX_NR_ZONES; j++)
memmap_init_zone
subsection_map_init
subsection_mask_set
for (nr = start_sec; nr <= end_sec; nr++)
bitmap_set
calculate arch_zone_{lowest, highest}_possible_pfn[]
for (pfn = start_pfn; pfn < end_pfn;)
set_pageblock_migratetype
__init_single_page
set_pageblock_migratetype
• [System init stage] each pageblock is initialized to MIGRATE_MOVABLE
zone
present_pages = 1311744
Page
. . .
pageblock #0
Page
pageblock #1
Page
pageblock #N
CONFIG_HUGETLB_PAGE Number of Pages
Y 512 = Huge page size
N 1024 (MAX_ORDER - 1)
pageblock size
N = round_up(present_pages / pageblock_size) - 1
Example
pageblocks = round_up(1311744 / 512) = 2562
pageblock
16 + 2544 + 2 = 2562
1
1
2
2
pageblock_flags: pageblock migration type
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
unsigned long pageblock_flags[4]
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
[0]
Dynamically allocated
[1]
[2]
[3]
subsection #0: Migration Type
subsection #16: Migration Type
subsection #32: Migration Type
subsection #48: Migration Type
[CONFIG_HUGETLB_PAGE=y]
pages of subsection = pages of pageblock = 512 pages (order = 9)
page->flags layout
Node Zone … flags
Node Zone … flags
LAST_CPUPID
Node Zone … flags
Section
Node Zone flags
Section
Zone … flags
Section
…
LAST_CPUPID
No sparsemem or sparsemem
vmemmap
No sparsemem or sparsemem
vmemmap + last_cpupid
sparsemem
sparsemem + last_cpupid
sparsemem wo/ node
1. last_cpupid: Support for NUMA balancing (NUMA-optimizing scheduler)
2. sparsemem: Enabled by CONFIG_SPARSEMEM
Note
…
page->flags layout
0
63
page->flags layout: sparsemem vmemmap + last_cpupid
Kernel Configuration: qemu – v5.11 kernel
...
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
…
CONFIG_NR_CPUS=64
…
CONFIG_NODES_SHIFT=10
…
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
…
# CONFIG_KASAN is not set
Node Zone … flags (enum pageflags)
LAST_CPUPID
0
22
38
52
54
63
23-bit pageflags
2-bit zone
page->flags layout - sparsemem vmemmap + last_cpupid
Kernel Configuration: qemu – v5.11 kernel
...
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
…
CONFIG_NR_CPUS=64
…
CONFIG_NODES_SHIFT=10
…
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
…
# CONFIG_KASAN is not set
Node Zone … flags (enum pageflags)
LAST_CPUPID
0
22
38
52
54
63
Node Zone flags
Section …
LAST_CPUPID
sparsemem + last_cpupid
page->flags: section field (sparsemem wo/ vmemmap)
Sparse Memory: Refer to section_mem_map
Memory Model – Sparse Memory (sparsemem wo/ vmemmap)
struct mem_section
page frame
....
page frame
Physical Memory
**mem_section
struct mem_section
struct mem_section
…
struct mem_section *
page frame
....
page frame
....
struct page #0
struct page #n
....
struct page #0
Node #1
(hotplug)
Node #0
…
struct mem_section *
1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the
number of available page frames
✓ Refer from: memblock structure
2. Support physical memory hotplug
3. Minimum unit: mem_section - PAGES_PER_SECTION = 32768
✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB
4. [NUMA] : reduce the memory hole impact due to “struct mem_section”
Note
struct page #m+n-1
Reference
• https://www.kernel.org/doc/html/v5.17/vm/memory-model.html
Backup
/sys/devices/system/memory/block_size_bytes
/sys/devices/system/memory/block_size_bytes
System memory
< 64GB?
block_size_bytes = 0x800_0000
(MIN_MEMORY_BLOCK_SIZE = 128 MB)
block_size_bytes = 0x8000_0000
(MAX_BLOCK_SIZE = 2 GB)
* Ignore SGI UV system platform
!X86_FEATURE_HYPERVISOR?
Find the largest allowed block size that
aligns to memory end (check ‘max_pfn’)
Range: 0x8000_0000 - 0x800_0000
Y
N
Y
N
* Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
/sys/devices/system/memory/block_size_bytes
System memory
< 64GB?
block_size_bytes = 0x800_0000
(MIN_MEMORY_BLOCK_SIZE = 128 MB)
block_size_bytes = 0x8000_0000
(MAX_BLOCK_SIZE = 2 GB)
* Ignore SGI UV system platform
!X86_FEATURE_HYPERVISOR?
Find the largest allowed block size that
aligns to memory end (check ‘max_pfn’)
Range: 0x8000_0000 - 0x800_0000
Y
N
Y
N
* Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
/sys/devices/system/memory/block_size_bytes
System memory
< 64GB?
block_size_bytes = 0x800_0000
(MIN_MEMORY_BLOCK_SIZE = 128 MB)
block_size_bytes = 0x8000_0000
(MAX_BLOCK_SIZE = 2 GB)
* Ignore SGI UV system platform
!X86_FEATURE_HYPERVISOR?
Find the largest allowed block size that
aligns to memory end (check ‘max_pfn’)
Range: 0x8000_0000 - 0x800_0000
Y
N
Y
N
QEMU – Guest OS
* Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
/sys/devices/system/memory/block_size_bytes
System memory
< 64GB?
block_size_bytes = 0x800_0000
(MIN_MEMORY_BLOCK_SIZE = 128 MB)
block_size_bytes = 0x8000_0000
(MAX_BLOCK_SIZE = 2 GB)
* Ignore SGI UV system platform
!X86_FEATURE_HYPERVISOR?
Find the largest allowed block size that
aligns to memory end (check ‘max_pfn’)
Range: 0x8000_0000 - 0x800_0000
Y
N
Y
N
QEMU – Guest OS
* Source code: arch/x86/mm/init_64.c: probe_memory_block_size()

More Related Content

PDF
Physical Memory Management.pdf
PDF
Process Address Space: The way to create virtual address (page table) of user...
PDF
Decompressed vmlinux: linux kernel initialization from page table configurati...
PDF
Memory Management with Page Folios
PDF
Anatomy of the loadable kernel module (lkm)
PDF
Memory Compaction in Linux Kernel.pdf
PDF
Page cache in Linux kernel
PDF
Reverse Mapping (rmap) in Linux Kernel
Physical Memory Management.pdf
Process Address Space: The way to create virtual address (page table) of user...
Decompressed vmlinux: linux kernel initialization from page table configurati...
Memory Management with Page Folios
Anatomy of the loadable kernel module (lkm)
Memory Compaction in Linux Kernel.pdf
Page cache in Linux kernel
Reverse Mapping (rmap) in Linux Kernel

What's hot (20)

PPTX
Slab Allocator in Linux Kernel
PDF
spinlock.pdf
PDF
malloc & vmalloc in Linux
PDF
Memory Mapping Implementation (mmap) in Linux Kernel
PDF
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
PDF
Linux Kernel - Virtual File System
PDF
semaphore & mutex.pdf
PDF
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
PPTX
Linux Memory Management with CMA (Contiguous Memory Allocator)
PDF
Memory management in Linux kernel
PPTX
Linux Initialization Process (1)
PPTX
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
PPTX
Linux Kernel Booting Process (2) - For NLKB
PPTX
Linux Initialization Process (2)
PDF
eBPF Trace from Kernel to Userspace
PDF
BPF: Tracing and more
PDF
Linux Synchronization Mechanism: RCU (Read Copy Update)
PDF
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
PDF
Kdump and the kernel crash dump analysis
PPTX
Linux Kernel Module - For NLKB
Slab Allocator in Linux Kernel
spinlock.pdf
malloc & vmalloc in Linux
Memory Mapping Implementation (mmap) in Linux Kernel
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Linux Kernel - Virtual File System
semaphore & mutex.pdf
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
Linux Memory Management with CMA (Contiguous Memory Allocator)
Memory management in Linux kernel
Linux Initialization Process (1)
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
Linux Kernel Booting Process (2) - For NLKB
Linux Initialization Process (2)
eBPF Trace from Kernel to Userspace
BPF: Tracing and more
Linux Synchronization Mechanism: RCU (Read Copy Update)
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kdump and the kernel crash dump analysis
Linux Kernel Module - For NLKB
Ad

Similar to Physical Memory Models.pdf (20)

PPT
Linux memory
PPT
Driver development – memory management
PPT
memory_mapping.ppt
PPT
Os8 2
PDF
Virtual memory 20070222-en
PPTX
VIRTUAL MEMORY
PPT
Windows memory manager internals
PPTX
UNIT IV.pptx
PDF
Vmreport
PPTX
Computer architecture virtual memory
PDF
An Efficient Virtual Memory using Graceful Code
PDF
AOS Lab 7: Page tables
PPT
NOV11 virtual memory.ppt
PPT
virtual memory.ppt
PPT
Chapter 09 - Virtual Memory.ppt
PDF
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
PPT
Segmentation with paging methods and techniques
PPT
Power Point Presentation on Virtual Memory.ppt
PPTX
Paging +Algorithem+Segmentation+memory management
Linux memory
Driver development – memory management
memory_mapping.ppt
Os8 2
Virtual memory 20070222-en
VIRTUAL MEMORY
Windows memory manager internals
UNIT IV.pptx
Vmreport
Computer architecture virtual memory
An Efficient Virtual Memory using Graceful Code
AOS Lab 7: Page tables
NOV11 virtual memory.ppt
virtual memory.ppt
Chapter 09 - Virtual Memory.ppt
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Segmentation with paging methods and techniques
Power Point Presentation on Virtual Memory.ppt
Paging +Algorithem+Segmentation+memory management
Ad

Recently uploaded (20)

PPTX
L1 - Introduction to python Backend.pptx
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
AutoCAD Professional Crack 2025 With License Key
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PDF
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Autodesk AutoCAD Crack Free Download 2025
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
Nekopoi APK 2025 free lastest update
L1 - Introduction to python Backend.pptx
Why Generative AI is the Future of Content, Code & Creativity?
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Operating system designcfffgfgggggggvggggggggg
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
Oracle Fusion HCM Cloud Demo for Beginners
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Reimagine Home Health with the Power of Agentic AI​
AutoCAD Professional Crack 2025 With License Key
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Monitoring Stack: Grafana, Loki & Promtail
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Design an Analysis of Algorithms II-SECS-1021-03
Autodesk AutoCAD Crack Free Download 2025
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Patient Appointment Booking in Odoo with online payment
Nekopoi APK 2025 free lastest update

Physical Memory Models.pdf

  • 1. Physical Memory Models: the ways Linux kernel addresses physical memory (physical page frame) Adrian Huang | June, 2022 * Kernel 5.11 (x86_64)
  • 2. Agenda • Four physical memory models ✓Purpose: page descriptor <-> PFN (Page Frame Number) • Sparse memory model • Sparse Memory Virtual Memmap: subsection • page->flags
  • 3. Four Physical Memory Models • Flat Memory Model (CONFIG_FLATMEM) ✓UMA (Uniform Memory Access) with mostly continuous physical memory • Discontinuous Memory Model (CONFIG_DISCONTIGMEM) ✓NUMA (Non-Uniform Memory Access) with mostly continuous physical memory ✓Removed since v5.14 because sparse memory model can cover this scope • https://lore.kernel.org/linux-mm/[email protected]/ • Sparse Memory (CONFIG_SPARSEMEM) ✓NUMA with discontinuous physical memory • Sparse Memory Virtual Memmap (CONFIG_SPARSEMEM_VMEMMAP) ✓NUMA with discontinuous physical memory: A quick way to get page struct and pfn
  • 4. Memory Model – Flat Memory struct page #n .... struct page #1 struct page #0 Dynamic page structure (Kernel Virtual Address Space) struct page *mem_map page frame #n .... page frame #1 page frame #0 Physical Memory Note Page structure array (Kernel Virtual Address Space) 1. [mem_map] Dynamic page structure: pre-allocate all page structures based on the number of page frames ✓ Allocate/Init page structures based on node’s memory info (struct pglist_data) ▪ Refer from: pglist_data.node_start_pfn & pglist_data.node_spanned_pages 2. Scenario: Continuous page frames (no memory holes) in UMA 3. Drawback ✓ Waste node_mem_map space if memory holes ✓ does not support memory hotplug 4. Check kernel function alloc_node_mem_map() in mm/page_alloc.c
  • 5. Memory Model – Flat Memory
  • 6. Memory Model – Discontinuous Memory struct pglist_data * page frame #000 .... page frame #1000 Physical Memory 1. [node_mem_map] Dynamic page structure: pre-allocate all page structures based on the number of page frames ✓ Allocate/Init page structures based on node’s memory info (struct pglist_data) ▪ Refer from: pglist_data.node_start_pfn & pglist_data.node_spanned_pages 2. Scenario: Each node has continuous page frames (no memory holes) in NUMA 3. Drawback ✓ Waste node_mem_map space if memory holes ✓ does not support memory hotplug NUMA Node Structure (Kernel Virtual Address Space) struct pglist_data * struct pglist_data * … struct pglist_data *node_data[] page frame #999 .... page frame #0 struct page #n .... struct page #0 struct page #n .... struct page #0 node_mem_map node_mem_map Node #1 Node #0 Note
  • 7. Memory Model – Sparse Memory struct mem_section page frame .... page frame Physical Memory **mem_section struct mem_section struct mem_section … struct mem_section * page frame .... page frame .... struct page #0 struct page #n .... struct page #0 Node #1 (hotplug) Node #0 … struct mem_section * 1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the number of available page frames ✓ Refer from: memblock structure 2. Support physical memory hotplug 3. Minimum unit: PAGES_PER_SECTION = 32768 ✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB 4. [NUMA] : reduce the memory hole impact due to “struct mem_section” Note struct page #m+n-1
  • 8. struct mem_section page frame .... page frame Physical Memory struct mem_section struct mem_section … struct mem_section * page frame .... page frame struct page #m+n-1 .... struct page #m struct page #n .... struct page #0 Node #1 Node #0 … struct mem_section * Memory Model – Sparse Memory Virtual Memmap vmemmap Memory Section (two-dimension array) Note 1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the number of available page frames ✓ Refer from: memblock structure 2. Support physical memory hotplug 3. Minimum unit: PAGES_PER_SECTION = 32768 ✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB 4. [NUMA] : reduce the memory hole impact due to “struct mem_section” 5. Employ virtual memory map (vmemmap/ vmemmap_base) – A quick way to get page struct and pfn 6. Default configuration in Linux kernel
  • 9. Memory Model – Sparse Memory Virtual Memmap: Detail SECTIONS_PER_ROOT PAGES_PER_SECTION 0 14 15 22 NR_SECTION_ROOTS 23 33 PFN 63 struct mem_section Physical Memory struct mem_section … struct mem_section * page frame struct page #32767 .... struct page #0 … struct mem_section * vmemmap **mem_section (two-dimension array) struct mem_section struct mem_section … . . . 0 0 0 255 255 struct page .... struct page struct page .... struct page 2047 + … page frame page frame … page frame page frame … Hot add Hot add Hot remove .... + page frame 128 MB PFN
  • 10. SECTIONS_PER_ROOT PAGES_PER_SECTION 0 14 15 22 NR_SECTION_ROOTS 23 33 PFN 63 struct mem_section Physical Memory struct mem_section … struct mem_section * page frame struct page #32767 .... struct page #0 … struct mem_section * vmemmap **mem_section (two-dimension array) struct mem_section struct mem_section … . . . 0 0 0 255 255 struct page .... struct page struct page .... struct page 2047 + … page frame page frame … page frame page frame … Hot add Hot add Hot remove .... + page frame 128 MB PFN Memory Model – Sparse Memory Virtual Memmap: Detail
  • 11. Sparse Memory Model 1. How to know available memory pages in a system? 2. Page Table Configuration for Direct Mapping 3. Sparse Memory Model Initialization – Detail
  • 12. How to know available memory pages in a system? BIOS e820 memblock Zone Page Frame Allocator e820__memblock_setup() __free_pages_core() [Call Path] memblock frees available memory space to zone page frame allocator Zone page allocator detail will be discussed in another session: physical memory management
  • 13. setup_arch() -- Focus on memory portion setup_arch Reserve memblock for kernel code + data/bss sections, page #0 and init ramdisk e820__memory_setup Setup init_mm struct for members ‘start_code’, ‘end_code’, ‘end_data’ and ‘brk’ memblock_x86_reserve_range_setup_data e820__reserve_setup_data e820__finish_early_params efi_init dmi_setup e820_add_kernel_range trim_bios_range max_pfn = e820__end_of_ram_pfn() kernel_randomize_memory e820__memblock_setup init_mem_mapping x86_init.paging.pagetable_init early_alloc_pgt_buf reserve_brk init_memory_mapping() • Create 4-level page table (direct mapping) based on ‘memory’ type of memblock configuration. x86_init.paging.pagetable_init() • Init sparse • Init zone structure
  • 14. x86 - setup_arch() -- init_mem_mapping() – Page Table Configuration for Direct Mapping init_mem_mapping probe_page_size_mask setup_pcid memory_map_top_down(ISA_END_ADDRESS, end) init_memory_mapping(0, ISA_END_ADDRESS, PAGE_KERNEL) init_range_memory_mapping(start, last_start) split_mem_range kernel_physical_mapping_init add_pfn_range_mapped early_ioremap_page_table_range_init [x86 only] load_cr3(swapper_pg_dir) __flush_tlb_all init_memory_mapping() -> kernel_physical_mapping_init() • Create 4-level page table (direct mapping) based on ‘memory’ type of memblock configuration. split_mem_range() • Split different the groups of page size based on the input memory range (start address and end address) ✓ Try larger page size first ▪ 1G huge page -> 2M huge page -> 4K page while (last_start > map_start) init_memory_mapping(start, end, PAGE_KERNEL) for_each_mem_pfn_range() → memblock stuff
  • 15. Page Table Configuration for Direct Mapping Kernel Space 0x0000_7FFF_FFFF_FFFF 0xFFFF_8000_0000_0000 128TB Page frame direct mapping (64TB) ZONE_DMA ZONE_DMA32 ZONE_NORMAL page_offset_base 0 16MB 64-bit Virtual Address Kernel Virtual Address Physical Memory 0 0xFFFF_FFFF_FFFF_FFFF Guard hole (8TB) LDT remap for PTI (0.5TB) Unused hole (0.5TB) vmalloc/ioremap (32TB) vmalloc_base Unused hole (1TB) Virtual memory map – 1TB (store page frame descriptor) … vmemmap_base 64TB *page … *page … *page … Page Frame Descriptor vmemmap_base page_ofset_base = 0xFFFF_8880_0000_0000 vmalloc_base = 0xFFFF_C900_0000_0000 vmemmap_base = 0xFFFF_EA00_0000_0000 * Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c") Default Configuration Kernel text mapping from physical address 0 Kernel code [.text, .data…] Modules __START_KERNEL_map = 0xFFFF_FFFF_8000_0000 __START_KERNEL = 0xFFFF_FFFF_8100_0000 MODULES_VADDR 0xFFFF_8000_0000_0000 Empty Space User Space 128TB 1GB or 512MB 1GB or 1.5GB Fix-mapped address space (Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000 0xFFFF_FFFF_FFFF_FFFF FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000 Reference: Documentation/x86/x86_64/mm.rst Note: Refer from page #5 in the slide deck Decompressed vmlinux: linux kernel initialization from page table configuration perspective
  • 16. init_mem_mapping() – Page Table Configuration for Direct Mapping Note • 2-socket server with 32GB memory
  • 17. init_mem_mapping() – Page Table Configuration for Direct Mapping Note • 2-socket server with 32GB memory
  • 18. setup_arch() -- init_mem_mapping() – Page Table Configuration for Direct Mapping init_memory_mapping() -> kernel_physical_mapping_init() • Create 4-level page table (direct mapping) based on ‘memory’ type of the memblock configuration.
  • 19. x86 - setup_arch() -- x86_init.paging.pagetable_init() x86_init.paging.pagetable_init native_pagetable_init Remove mappings in the end of physical memory from the boot time page table paging_init pagetable_init __flush_tlb_all sparse_init zone_sizes_init permanent_kmaps_init x86_init.paging.pagetable_init native_pagetable_init paging_init sparse_init zone_sizes_init x86 x86_64 cfg number of pfn for each zone free_area_init
  • 20. Sparse Memory Model Initialization: sparse_init() sparse_init memblocks_present pnum_begin = first_present_section_nr(); nid_begin = sparse_early_nid(__nr_to_section(pnum_begin)); for_each_mem_pfn_range(..) memory_present(nid, start, end) 1. for_each_mem_pfn_range(): Walk through available memory range from memblock subsystem Allocate pointer array of section root if necessary for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) sparse_index_init set_section_nid section_mark_present cfg ‘ms->section_mem_map’ via sparse_encode_early_nid() for_each_present_section_nr(pnum_begin + 1, pnum_end) sparse_init_nid sparse_init_nid [Cover last cpu node] Mark the present bit for each allocated mem_section cfg ms->section_mem_map flag bits 1. Allocate a mem_section_usage struct 2. cfg ms->section_mem_map with the valid page descriptor [During boot] Temporary: Store nid in ms->section_mem_map [During boot] Temporary: get nid in ms->section_mem_map
  • 21. memblock bottom_up current_limit memory reserved memblock_type cnt max total_size *regions name memblock_region #0 base = 0x1000 size = 0x9f000 flags nid = 0 memory_present() memblock_type cnt max total_size *regions name memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 struct mem_section * #2047 … struct mem_section * #0 struct mem_section #0 struct mem_section #255 … **mem_section Initialized object Initialized object Uninitialized object
  • 22. memblock bottom_up current_limit memory reserved memblock_type cnt max total_size *regions name memblock_region #0 base = 0x1000 size = 0x9f000 flags nid = 0 memory_present() memblock_type cnt max total_size *regions name memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 struct mem_section #255 … **mem_section struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section #0 struct mem_section * #2047 … struct mem_section * #0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 . . . Initialized object Initialized object Uninitialized object P: Present, M: Memory map, O: Online, E: Early
  • 23. memblock bottom_up current_limit memory reserved memblock_type cnt max total_size *regions name memblock_region #0 base = 0x1000 size = 0x9f000 flags nid = 0 memory_present() memblock_type cnt max total_size *regions name memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 … struct mem_section #255 … **mem_section 0 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section #5 struct mem_section #0 . . . struct mem_section * #2047 … struct mem_section * #0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 . . . Initialized object Initialized object Uninitialized object
  • 24. memblock bottom_up current_limit memory reserved memblock_type cnt max total_size *regions name memblock_region #0 base = 0x1000 size = 0x9f000 flags nid = 0 memory_present() memblock_type cnt max total_size *regions name memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 … struct mem_section #255 … **mem_section 0 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section #5 struct mem_section #0 . . . struct mem_section #6 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section * #2047 … struct mem_section * #0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 . . . Initialized object Initialized object Uninitialized object
  • 25. memblock bottom_up current_limit memory reserved memblock_type cnt max total_size *regions name memblock_region #0 base = 0x1000 size = 0x9f000 flags nid = 0 memory_present() memblock_type cnt max total_size *regions name memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 struct mem_section * #2047 … struct mem_section * #0 … struct mem_section #255 … **mem_section struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section #5 struct mem_section #0 . . . struct mem_section #6 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 … struct mem_section #138 struct mem_section * #1 … struct mem_section #9 struct mem_section #0 … struct mem_section #255 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 . . . struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 . . . Initialized object Initialized object Uninitialized object
  • 26. memblock_region #0 base = 0x1000 size = 0x9f000 flags nid = 0 sparse_init_nid(): cfg mem_section_map memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 struct mem_section * #2047 … struct mem_section * #0 … struct mem_section #255 … **mem_section struct mem_section #5 struct mem_section #0 struct mem_section #6 … struct mem_section #138 struct mem_section * #1 … struct mem_section #9 struct mem_section #0 … struct mem_section #255 . . . struct page #65535 struct page #32767 struct page #0 struct page #32768 … .... ... vmemmap = VMEMMAP_START = vmemmap_base section #0 section_roots #0 section #1 section_roots #0 struct mem_section_usage #n … struct mem_section_usage #0 Per-node basis Number of available ‘struct mem_section (map_count)’. Initialized object Uninitialized object Allocate page structs for each mem_section and map them to the page table (Virtual Memory Map) Note struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 . . . struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 . . . struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=0 M=1
  • 27. memblock_region #0 base = 0x1000 size = 0x9f000 flags nid = 0 memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 struct mem_section * #2047 … struct mem_section * #0 … struct mem_section #255 … **mem_section struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 struct mem_section #5 struct mem_section #0 . . . struct mem_section #6 … struct mem_section #138 struct mem_section * #1 … struct mem_section #9 struct mem_section #0 … struct mem_section #255 struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 . . . struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 . . . struct page #0 vmemmap = VMEMMAP_START = vmemmap_base section #0, section_roots #0 section #1, section_roots #0 struct mem_section_usage #n … struct mem_section_usage #0 Per-node basis Number of available ‘struct mem_section (map_count)’. … struct page #32767 struct page #32768 … struct page #65535 … struct page #229375 … … struct page #4521984 … struct page #8388607 struct page #8388608 … struct page #8683520 section #2-6, section_roots #0 section #138-255, section_roots #0 … section #0-9, section_roots #1 Initialized object Allocated & Uninitialized object Unallocated object sparse_init_nid(): cfg mem_section_map Allocate page structs for each mem_section and map them to the page table (Virtual Memory Map) Note struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=0 M=1
  • 28. 64-bit Virtual Address Kernel Space 0x0000_7FFF_FFFF_FFFF 0xFFFF_8000_0000_0000 128TB Page frame direct mapping (64TB) ZONE_DMA ZONE_DMA32 ZONE_NORMAL page_offset_base 0 16MB 64-bit Virtual Address Kernel Virtual Address Physical Memory 0 0xFFFF_FFFF_FFFF_FFFF Guard hole (8TB) LDT remap for PTI (0.5TB) Unused hole (0.5TB) vmalloc/ioremap (32TB) vmalloc_base Unused hole (1TB) Virtual memory map – 1TB (store page frame descriptor) … vmemmap_base 64TB *page … *page … *page … Page Frame Descriptor vmemmap_base page_ofset_base = 0xFFFF_8880_0000_0000 vmalloc_base = 0xFFFF_C900_0000_0000 vmemmap_base = 0xFFFF_EA00_0000_0000 * Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c") Default Configuration Kernel text mapping from physical address 0 Kernel code [.text, .data…] Modules __START_KERNEL_map = 0xFFFF_FFFF_8000_0000 __START_KERNEL = 0xFFFF_FFFF_8100_0000 MODULES_VADDR 0xFFFF_8000_0000_0000 Empty Space User Space 128TB 1GB or 512MB 1GB or 1.5GB Fix-mapped address space (Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000 0xFFFF_FFFF_FFFF_FFFF FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000 Reference: Documentation/x86/x86_64/mm.rst Note: Refer from page #5 in the slide deck Decompressed vmlinux: linux kernel initialization from page table configuration perspective
  • 29. struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 . . . struct page #0 vmemmap = VMEMMAP_START = vmemmap_base section #0, section_roots #0 section #1, section_roots #0 … struct page #32767 struct page #32768 … struct page #65535 … struct page #229375 … … struct page #4521984 … struct page #8388607 struct page #8388608 … struct page #8683520 section #2-6, section_roots #0 section #138-255, section_roots #0 … section #0-9, section_roots #1 Re-visit sparse memory Sparse Memory: Refer to section_mem_map Sparse Memory with vmemmap: Refer to vmemmap struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=0 M=1
  • 30. Sparse Memory Virtual Memmap: subsection 1. Introduction 2. Subsection users? 3. pageblock_flags: pageblock migration type
  • 31. Sparse Memory Virtual Memmap: subsection (1/4) SECTIONS_PER_ROOT PAGES_PER_SECTION 0 14 15 22 NR_SECTION_ROOTS 23 33 PFN 63 SECTION_SIZE_BITS = 27 PAGES_PER_SUBSECTION SUBSECTIONS_PER _SECTION 14 9 0 8 struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … • subsection_map: bitmap to indicate if the corresponding subsection is valid • pageblock_flags: pages of a subsection have the same flag (migration type) sparsemem vmemmap *only*
  • 32. Sparse Memory Virtual Memmap: subsection (2/4) Some macros are expanded manually Note
  • 33. Sparse Memory Virtual Memmap: subsection (3/4) SECTIONS_PER_ROOT PAGES_PER_SECTION 0 14 15 22 NR_SECTION_ROOTS 23 33 PFN 63 SECTION_SIZE_BITS = 27 PAGES_PER_SUBSECTION SUBSECTIONS_PER _SECTION 14 9 0 8 struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … • PAGES_PER_SUBSECTION = 512 pages ✓ 512 pages * 4KB = 2MB → 2MB huge page in x86_64
  • 34. Sparse Memory Virtual Memmap: subsection (4/4) • SUBSECTION_SIZE ✓ (1UL << 21) = 2MB → 2MB huge page in x86_64. SECTIONS_PER_ROOT PAGES_PER_SECTION 0 14 15 22 NR_SECTION_ROOTS 23 33 PFN 63 SECTION_SIZE_BITS = 27 PAGES_PER_SUBSECTION SUBSECTIONS_PER _SECTION 14 9 0 8 Some macros are expanded manually Note
  • 35. subsection: subsection_map users? struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … • init stage ✓ paging_init -> zone_sizes_init -> free_area_init -> subsection_map_init -> subsection_mask_set ➢ Set the corresponding bit map for the specific subsection • Reference stage ✓ pfn_section_valid(struct mem_section *ms, unsigned long pfn) ➢ Users ▪ [mm/page_alloc.c: 5089] free_pages -> virt_addr_valid -> __virt_addr_valid -> pfn_valid -> pfn_section_valid ▪ [drivers/char/mem.c: 416] mmap_kmem -> pfn_valid -> pfn_section_valid ➔ /dev/mem (`man mem`) ▪ … subsection_map users
  • 36. struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … • Hotplug stage ✓ Add ➢ #A1 [drivers/acpi/acpi_memhotplug.c: 311] acpi_memory_device_add -> acpi_memory_enable_device -> __add_memory -> add_memory_resource -> arch_add_memory -> add_pages -> __add_pages -> sparse_add_section -> section_activate -> fill_subsection_map -> subsection_mask_set ➢ #A2 [drivers/dax/kmem.c: 43] dev_dax_kmem_probe -> add_memory_driver_managed -> add_memory_resource -> same with #A1 ✓ Remove ➢ #R1 [drivers/acpi/acpi_memhotplug.c: 311] acpi_memory_device_remove -> __remove_memory -> try_remove_memory -> arch_remove_memory -> __remove_pages -> __remove_section -> sparse_remove_section -> section_deactivate -> clear_subsection_map ➢ #R2 [drivers/dax/kmem.c: 139] dev_dax_kmem_remove -> remove_memory -> try_remove_memory -> same with #R1 subsection_map users subsection: subsection_map users?
  • 37. pageblock_flags: pageblock migration type struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … unsigned long pageblock_flags[4] 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT [0] Dynamically allocated [1] [2] [3] subsection #0: Migration Type subsection #16: Migration Type subsection #32: Migration Type subsection #48: Migration Type Migration type is configured in setup_arch -> … -> memmap_init_zone
  • 38. pageblock_flags: pageblock migration type struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … unsigned long pageblock_flags[4] 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT [0] Dynamically allocated [1] [2] [3] subsection #0: Migration Type subsection #16: Migration Type subsection #32: Migration Type subsection #48: Migration Type
  • 39. pageblock: set migration type free_area_init print zone ranges and early memory node ranges for_each_mem_pfn_range(..) print memory range for each memblock subsection_map_init mminit_verify_pageflags_layout setup_nr_node_ids init_unavailable_mem for_each_online_node(nid) free_area_init_node node_set_state check_for_memory get_pfn_range_for_nid calculate_node_totalpages pgdat_set_deferred_range free_area_init_core free_area_init_core memmap_init for (j = 0; j < MAX_NR_ZONES; j++) memmap_init_zone subsection_map_init subsection_mask_set for (nr = start_sec; nr <= end_sec; nr++) bitmap_set calculate arch_zone_{lowest, highest}_possible_pfn[] for (pfn = start_pfn; pfn < end_pfn;) set_pageblock_migratetype __init_single_page set_pageblock_migratetype • [System init stage] each pageblock is initialized to MIGRATE_MOVABLE
  • 40. zone present_pages = 1311744 Page . . . pageblock #0 Page pageblock #1 Page pageblock #N CONFIG_HUGETLB_PAGE Number of Pages Y 512 = Huge page size N 1024 (MAX_ORDER - 1) pageblock size N = round_up(present_pages / pageblock_size) - 1 Example pageblocks = round_up(1311744 / 512) = 2562 pageblock 16 + 2544 + 2 = 2562 1 1 2 2
  • 41. pageblock_flags: pageblock migration type struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … unsigned long pageblock_flags[4] 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT [0] Dynamically allocated [1] [2] [3] subsection #0: Migration Type subsection #16: Migration Type subsection #32: Migration Type subsection #48: Migration Type [CONFIG_HUGETLB_PAGE=y] pages of subsection = pages of pageblock = 512 pages (order = 9)
  • 43. Node Zone … flags Node Zone … flags LAST_CPUPID Node Zone … flags Section Node Zone flags Section Zone … flags Section … LAST_CPUPID No sparsemem or sparsemem vmemmap No sparsemem or sparsemem vmemmap + last_cpupid sparsemem sparsemem + last_cpupid sparsemem wo/ node 1. last_cpupid: Support for NUMA balancing (NUMA-optimizing scheduler) 2. sparsemem: Enabled by CONFIG_SPARSEMEM Note … page->flags layout 0 63
  • 44. page->flags layout: sparsemem vmemmap + last_cpupid Kernel Configuration: qemu – v5.11 kernel ... CONFIG_NUMA_BALANCING=y CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y … CONFIG_NR_CPUS=64 … CONFIG_NODES_SHIFT=10 … CONFIG_SPARSEMEM_MANUAL=y CONFIG_SPARSEMEM=y CONFIG_NEED_MULTIPLE_NODES=y CONFIG_SPARSEMEM_EXTREME=y CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y CONFIG_SPARSEMEM_VMEMMAP=y … # CONFIG_KASAN is not set Node Zone … flags (enum pageflags) LAST_CPUPID 0 22 38 52 54 63 23-bit pageflags 2-bit zone
  • 45. page->flags layout - sparsemem vmemmap + last_cpupid Kernel Configuration: qemu – v5.11 kernel ... CONFIG_NUMA_BALANCING=y CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y … CONFIG_NR_CPUS=64 … CONFIG_NODES_SHIFT=10 … CONFIG_SPARSEMEM_MANUAL=y CONFIG_SPARSEMEM=y CONFIG_NEED_MULTIPLE_NODES=y CONFIG_SPARSEMEM_EXTREME=y CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y CONFIG_SPARSEMEM_VMEMMAP=y … # CONFIG_KASAN is not set Node Zone … flags (enum pageflags) LAST_CPUPID 0 22 38 52 54 63
  • 46. Node Zone flags Section … LAST_CPUPID sparsemem + last_cpupid page->flags: section field (sparsemem wo/ vmemmap) Sparse Memory: Refer to section_mem_map
  • 47. Memory Model – Sparse Memory (sparsemem wo/ vmemmap) struct mem_section page frame .... page frame Physical Memory **mem_section struct mem_section struct mem_section … struct mem_section * page frame .... page frame .... struct page #0 struct page #n .... struct page #0 Node #1 (hotplug) Node #0 … struct mem_section * 1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the number of available page frames ✓ Refer from: memblock structure 2. Support physical memory hotplug 3. Minimum unit: mem_section - PAGES_PER_SECTION = 32768 ✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB 4. [NUMA] : reduce the memory hole impact due to “struct mem_section” Note struct page #m+n-1
  • 51. /sys/devices/system/memory/block_size_bytes System memory < 64GB? block_size_bytes = 0x800_0000 (MIN_MEMORY_BLOCK_SIZE = 128 MB) block_size_bytes = 0x8000_0000 (MAX_BLOCK_SIZE = 2 GB) * Ignore SGI UV system platform !X86_FEATURE_HYPERVISOR? Find the largest allowed block size that aligns to memory end (check ‘max_pfn’) Range: 0x8000_0000 - 0x800_0000 Y N Y N * Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
  • 52. /sys/devices/system/memory/block_size_bytes System memory < 64GB? block_size_bytes = 0x800_0000 (MIN_MEMORY_BLOCK_SIZE = 128 MB) block_size_bytes = 0x8000_0000 (MAX_BLOCK_SIZE = 2 GB) * Ignore SGI UV system platform !X86_FEATURE_HYPERVISOR? Find the largest allowed block size that aligns to memory end (check ‘max_pfn’) Range: 0x8000_0000 - 0x800_0000 Y N Y N * Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
  • 53. /sys/devices/system/memory/block_size_bytes System memory < 64GB? block_size_bytes = 0x800_0000 (MIN_MEMORY_BLOCK_SIZE = 128 MB) block_size_bytes = 0x8000_0000 (MAX_BLOCK_SIZE = 2 GB) * Ignore SGI UV system platform !X86_FEATURE_HYPERVISOR? Find the largest allowed block size that aligns to memory end (check ‘max_pfn’) Range: 0x8000_0000 - 0x800_0000 Y N Y N QEMU – Guest OS * Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
  • 54. /sys/devices/system/memory/block_size_bytes System memory < 64GB? block_size_bytes = 0x800_0000 (MIN_MEMORY_BLOCK_SIZE = 128 MB) block_size_bytes = 0x8000_0000 (MAX_BLOCK_SIZE = 2 GB) * Ignore SGI UV system platform !X86_FEATURE_HYPERVISOR? Find the largest allowed block size that aligns to memory end (check ‘max_pfn’) Range: 0x8000_0000 - 0x800_0000 Y N Y N QEMU – Guest OS * Source code: arch/x86/mm/init_64.c: probe_memory_block_size()