| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The config is in fact an additional upper limit of pageblock_order, so
rename it to avoid confusion.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Acked-by: Juan Yescas <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: "Isaac J. Manjarres" <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: T.J. Mercier <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- "zram: support algorithm-specific parameters" from Sergey Senozhatsky
adds infrastructure for passing algorithm-specific parameters into
zram. A single parameter `winbits' is implemented at this time.
- "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg
charging nmi-safe, which is required by BFP, which can operate in NMI
context.
- "Some random fixes and cleanup to shmem" from Kemeng Shi implements
small fixes and cleanups in the shmem code.
- "Skip mm selftests instead when kernel features are not present" from
Zi Yan fixes some issues in the MM selftest code.
- "mm/damon: build-enable essential DAMON components by default" from
SeongJae Park reworks DAMON Kconfig to make it easier to enable
CONFIG_DAMON.
- "sched/numa: add statistics of numa balance task migration" from Libo
Chen adds more info into sysfs and procfs files to improve visibility
into the NUMA balancer's task migration activity.
- "selftests/mm: cow and gup_longterm cleanups" from Mark Brown
provides various updates to some of the MM selftests to make them
play better with the overall containing framework.
* tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits)
mm/khugepaged: clean up refcount check using folio_expected_ref_count()
selftests/mm: fix test result reporting in gup_longterm
selftests/mm: report unique test names for each cow test
selftests/mm: add helper for logging test start and results
selftests/mm: use standard ksft_finished() in cow and gup_longterm
selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled
sched/numa: add statistics of numa balance task
sched/numa: fix task swap by skipping kernel threads
tools/testing: check correct variable in open_procmap()
tools/testing/vma: add missing function stub
mm/gup: update comment explaining why gup_fast() disables IRQs
selftests/mm: two fixes for the pfnmap test
mm/khugepaged: fix race with folio split/free using temporary reference
mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order
mmu_notifiers: remove leftover stub macros
selftests/mm: deduplicate test names in madv_populate
kcov: rust: add flags for KCOV with Rust
mm: rust: make CONFIG_MMU ifdefs more narrow
mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()
mm/damon/Kconfig: enable CONFIG_DAMON by default
...
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Problem: On large page size configurations (16KiB, 64KiB), the CMA
alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
and this causes the CMA reservations to be larger than necessary. This
means that system will have less available MIGRATE_UNMOVABLE and
MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
The CMA_MIN_ALIGNMENT_BYTES increases because it depends on MAX_PAGE_ORDER
which depends on ARCH_FORCE_MAX_ORDER. The value of ARCH_FORCE_MAX_ORDER
increases on 16k and 64k kernels.
For example, in ARM, the CMA alignment requirement when:
- CONFIG_ARCH_FORCE_MAX_ORDER default value is used
- CONFIG_TRANSPARENT_HUGEPAGE is set:
PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
-----------------------------------------------------------------------
4KiB | 10 | 9 | 4KiB * (2 ^ 9) = 2MiB
16Kib | 11 | 11 | 16KiB * (2 ^ 11) = 32MiB
64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
There are some extreme cases for the CMA alignment requirement when:
- CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
- CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
- CONFIG_HUGETLB_PAGE is NOT set
PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
------------------------------------------------------------------------
4KiB | 15 | 15 | 4KiB * (2 ^ 15) = 128MiB
16Kib | 13 | 13 | 16KiB * (2 ^ 13) = 128MiB
64KiB | 13 | 13 | 64KiB * (2 ^ 13) = 512MiB
This affects the CMA reservations for the drivers. If a driver in a
4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
reservation has to be 32MiB due to the alignment requirements:
reserved-memory {
...
cma_test_reserve: cma_test_reserve {
compatible = "shared-dma-pool";
size = <0x0 0x400000>; /* 4 MiB */
...
};
};
reserved-memory {
...
cma_test_reserve: cma_test_reserve {
compatible = "shared-dma-pool";
size = <0x0 0x2000000>; /* 32 MiB */
...
};
};
Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that allows to set the
page block order in all the architectures. The maximum page block order
will be given by ARCH_FORCE_MAX_ORDER.
By default, CONFIG_PAGE_BLOCK_ORDER will have the same value that
ARCH_FORCE_MAX_ORDER. This will make sure that current kernel
configurations won't be affected by this change. It is a opt-in change.
This patch will allow to have the same CMA alignment requirements for
large page sizes (16KiB, 64KiB) as that in 4kb kernels by setting a lower
pageblock_order.
Tests:
- Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 on
4k and 16k kernels.
- Verified that Transparent Huge Pages work when pageblock_order is 1,
7, 10 on 4k and 16k kernels.
- Verified that dma-buf heaps allocations work when pageblock_order is
1, 7, 10 on 4k and 16k kernels.
Benchmarks:
The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
reason for the pageblock_order 7 is because this value makes the min CMA
alignment requirement the same as that in 4kb kernels (2MB).
- Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
(https://developer.android.com/ndk/guides/simpleperf) to measure the #
of instructions and page-faults on 16k kernels. The benchmark was
executed 10 times. The averages are below:
# instructions | #page-faults
order 10 | order 7 | order 10 | order 7
--------------------------------------------------------
13,891,765,770 | 11,425,777,314 | 220 | 217
14,456,293,487 | 12,660,819,302 | 224 | 219
13,924,261,018 | 13,243,970,736 | 217 | 221
13,910,886,504 | 13,845,519,630 | 217 | 221
14,388,071,190 | 13,498,583,098 | 223 | 224
13,656,442,167 | 12,915,831,681 | 216 | 218
13,300,268,343 | 12,930,484,776 | 222 | 218
13,625,470,223 | 14,234,092,777 | 219 | 218
13,508,964,965 | 13,432,689,094 | 225 | 219
13,368,950,667 | 13,683,587,37 | 219 | 225
-------------------------------------------------------------------
13,803,137,433 | 13,131,974,268 | 220 | 220 Averages
There were 4.85% #instructions when order was 7, in comparison with order
10.
13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
The number of page faults in order 7 and 10 were the same.
These results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.
- Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
on the 16k kernels with pageblock_order 7 and 10.
order 10 | order 7 | order 7 - order 10 | (order 7 - order 10) %
-------------------------------------------------------------------
15.8 | 16.4 | 0.6 | 3.80%
16.4 | 16.2 | -0.2 | -1.22%
16.6 | 16.3 | -0.3 | -1.81%
16.8 | 16.3 | -0.5 | -2.98%
16.6 | 16.8 | 0.2 | 1.20%
-------------------------------------------------------------------
16.44 16.4 -0.04 -0.24% Averages
The results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Juan Yescas <[email protected]>
Acked-by: Zi Yan <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |\|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, memmap_init initializes pfn_hole with 0 instead of
ARCH_PFN_OFFSET. Then init_unavailable_range will start iterating each
page from the page at address zero to the first available page, but it
won't do anything for pages below ARCH_PFN_OFFSET because pfn_valid
won't pass.
If ARCH_PFN_OFFSET is very large (e.g., something like 2^64-2GiB if the
kernel is used as a library and loaded at a very high address), the
pointless iteration for pages below ARCH_PFN_OFFSET will take a very long
time, and the kernel will look stuck at boot time.
Use for_each_valid_pfn() to skip the pointless iterations.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Woodhouse <[email protected]>
Reported-by: Ruihan Li <[email protected]>
Suggested-by: Mike Rapoport <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Ruihan Li <[email protected]>
Tested-by: Lorenzo Stoakes <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Marc Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Patch series "mm: Introduce for_each_valid_pfn()", v4.
There are cases where a naïve loop over a PFN range, calling pfn_valid()
on each one, is horribly inefficient. Ruihan Li reported the case where
memmap_init() iterates all the way from zero to a potentially large value
of ARCH_PFN_OFFSET, and we at Amazon found the reserve_bootmem_region()
one as it affects hypervisor live update. Others are more cosmetic.
By introducing a for_each_valid_pfn() helper it can optimise away a lot of
pointless calls to pfn_valid(), skipping immediately to the next valid PFN
and also skipping *all* checks within a valid (sub)region according to the
granularity of the memory model in use.
This patch (of 7)
Especially since commit 9092d4f7a1f8 ("memblock: update initialization of
reserved pages"), the reserve_bootmem_region() function can spend a
significant amount of time iterating over every 4KiB PFN in a range,
calling pfn_valid() on each one, and ultimately doing absolutely nothing.
On a platform used for virtualization, with large NOMAP regions that
eventually get used for guest RAM, this leads to a significant increase in
steal time experienced during kexec for a live update.
Introduce for_each_valid_pfn() and use it from reserve_bootmem_region().
This implementation is precisely the same naïve loop that the functio
used to have, but subsequent commits will provide optimised versions for
FLATMEM and SPARSEMEM, and this version will remain for those
architectures which provide their own pfn_valid() implementation,
until/unless they also provide a matching for_each_valid_pfn().
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Woodhouse <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Marc Rutland <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Ruihan Li <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add the infrastructure to generate Kexec HandOver metadata. Kexec
HandOver is a mechanism that allows Linux to preserve state - arbitrary
properties as well as memory locations - across kexec.
It does so using 2 concepts:
1) KHO FDT - Every KHO kexec carries a KHO specific flattened device tree
blob that describes preserved memory regions. Device drivers can
register to KHO to serialize and preserve their states before kexec.
2) Scratch Regions - CMA regions that we allocate in the first kernel.
CMA gives us the guarantee that no handover pages land in those
regions, because handover pages must be at a static physical memory
location. We use these regions as the place to load future kexec
images so that they won't collide with any handover data.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Graf <[email protected]>
Co-developed-by: Mike Rapoport (Microsoft) <[email protected]>
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Co-developed-by: Pratyush Yadav <[email protected]>
Signed-off-by: Pratyush Yadav <[email protected]>
Co-developed-by: Changyuan Lyu <[email protected]>
Signed-off-by: Changyuan Lyu <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Anthony Yznaga <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Ashish Kalra <[email protected]>
Cc: Ben Herrenschmidt <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Gowans <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Marc Rutland <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Saravana Kannan <[email protected]>
Cc: Stanislav Kinsburskii <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Thomas Lendacky <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
With deferred initialization of struct page it will be necessary to
initialize memory map for KHO scratch regions early.
Add memmap_init_kho_scratch() method that will allow such initialization
in upcoming patches.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Signed-off-by: Changyuan Lyu <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Anthony Yznaga <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Ashish Kalra <[email protected]>
Cc: Ben Herrenschmidt <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Gowans <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Marc Rutland <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Pratyush Yadav <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Saravana Kannan <[email protected]>
Cc: Stanislav Kinsburskii <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Thomas Lendacky <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Corrected minor typos in comments:
- 'contigious' -> 'contiguous'
- 'hierarcy' -> 'hierarchy'
This is a non-functional change in comment text only.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Prabhav Kumar Vaish <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |\ \
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull compiler version requirement update from Arnd Bergmann:
"Require gcc-8 and binutils-2.30
x86 already uses gcc-8 as the minimum version, this changes all other
architectures to the same version. gcc-8 is used is Debian 10 and Red
Hat Enterprise Linux 8, both of which are still supported, and
binutils 2.30 is the oldest corresponding version on those.
Ubuntu Pro 18.04 and SUSE Linux Enterprise Server 15 both use gcc-7 as
the system compiler but additionally include toolchains that remain
supported.
With the new minimum toolchain versions, a number of workarounds for
older versions can be dropped, in particular on x86_64 and arm64.
Importantly, the updated compiler version allows removing two of the
five remaining gcc plugins, as support for sancov and structeak
features is already included in modern compiler versions.
I tried collecting the known changes that are possible based on the
new toolchain version, but expect that more cleanups will be possible.
Since this touches multiple architectures, I merged the patches
through the asm-generic tree."
* tag 'gcc-minimum-version-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
Makefile.kcov: apply needed compiler option unconditionally in CFLAGS_KCOV
Documentation: update binutils-2.30 version reference
gcc-plugins: remove SANCOV gcc plugin
Kbuild: remove structleak gcc plugin
arm64: drop binutils version checks
raid6: skip avx512 checks
kbuild: require gcc-8 and binutils-2.30
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
gcc-12 and higher support the -ftrivial-auto-var-init= flag, after
gcc-8 is the minimum version, this is half of the supported ones, and
the vast majority of the versions that users are actually likely to
have, so it seems like a good time to stop having the fallback
plugin implementation
Older toolchains are still able to build kernels normally without
this plugin, but won't be able to use variable initialization..
Signed-off-by: Arnd Bergmann <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The page allocator tracks the number of zones that have unaccepted memory
using static_branch_enc/dec() and uses that static branch in hot paths to
determine if it needs to deal with unaccepted memory.
Borislav and Thomas pointed out that the tracking is racy: operations on
static_branch are not serialized against adding/removing unaccepted pages
to/from the zone.
Sanity checks inside static_branch machinery detects it:
WARNING: CPU: 0 PID: 10 at kernel/jump_label.c:276 __static_key_slow_dec_cpuslocked+0x8e/0xa0
The comment around the WARN() explains the problem:
/*
* Warn about the '-1' case though; since that means a
* decrement is concurrent with a first (0->1) increment. IOW
* people are trying to disable something that wasn't yet fully
* enabled. This suggests an ordering problem on the user side.
*/
The effect of this static_branch optimization is only visible on
microbenchmark.
Instead of adding more complexity around it, remove it altogether.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory")
Link: https://lore.kernel.org/all/20250506092445.GBaBnVXXyvnazly6iF@fat_crate.local
Reported-by: Borislav Petkov <[email protected]>
Tested-by: Borislav Petkov (AMD) <[email protected]>
Reported-by: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Brendan Jackman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: <[email protected]> [6.5+]
Signed-off-by: Andrew Morton <[email protected]>
|
| |/
|
|
|
|
|
|
|
|
|
|
|
| |
set_high_memory() touches arch_zone_lowest_possible_pfn which is
marked as __initdata, which creates a section mismatch.
Since the only user of the function is free_area_init() which is also marked
as __init, mark set_high_memory() as __init as well.
Signed-off-by: Oscar Salvador <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When the last page in the zone is accepted, __accept_page() calls
static_branch_dec(). This function takes cpu_hotplug_lock, which can lead
to a deadlock if the allocation occurs during CPU bringup path as
_cpu_up() also takes the lock.
To prevent this deadlock, defer static_branch_dec() to a workqueue.
Call static_branch_dec() only when the workqueue is not yet initialized.
Workqueues are initialized before CPU bring up, so this will not conflict
with the first scenario.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 55ad43e8ba0f ("mm: add a helper to accept page")
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Srikanth Aithal <[email protected]>
Tested-by: Srikanth Aithal <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ashish Kalra <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Edgecombe, Rick P" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: "Mike Rapoport (IBM)" <[email protected]>
Cc: Thomas Lendacky <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Patch series "mm: fixes for fallouts from mem_init() cleanup".
These are the fixes for fallouts from mem_init() cleanup reported by
Nathan Chancellor and kbuild. The details are in the commit messages.
This patch (of 2):
Kernel test robot reports the following crash on 32-bit system with
FLATMEM and DEBUG_VM_PGFLAGS enabled:
[ 0.478822][ T0] kernel BUG at include/linux/page-flags.h:536!
[ 0.479312][ T0] Oops: invalid opcode: 0000 [#1] PREEMPT SMP
[ 0.479768][ T0] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc6-00357-g8268af309d07 #1
[ 0.480470][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 0.481260][ T0] EIP: reserve_bootmem_region (include/linux/page-flags.h:536)
[ 0.481683][ T0] Code: 5d c3 01 f1 89 c8 ba e1 38 f4 c3 e8 1e 37 8e fc 0f 0b b8 90 e2 62 c4 e8 e2 05 5e fc 01 f1 89 c8 ba be 85 f7 c3 e8 04 37 8e fc <0f> 0b b8 80 e2 62 c4 e8 c8 05 5e fc 55 89 e5 53 57 56 83 ec 10 89
[ 0.483177][ T0] EAX: 00000000 EBX: c425df50 ECX: 00000000 EDX: 00000000
[ 0.483712][ T0] ESI: 017ffc00 EDI: ffffffff EBP: c425df34 ESP: c425df2c
[ 0.484248][ T0] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
[ 0.484846][ T0] CR0: 80050033 CR2: 00000000 CR3: 04b48000 CR4: 00000090
[ 0.485376][ T0] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 0.485907][ T0] DR6: fffe0ff0 DR7: 00000400
[ 0.486253][ T0] Call Trace:
[ 0.486494][ T0] ? __die_body (arch/x86/kernel/dumpstack.c:478)
[ 0.486822][ T0] ? die (arch/x86/kernel/dumpstack.c:?)
[ 0.487099][ T0] ? do_trap (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:197)
[ 0.487409][ T0] ? do_error_trap (arch/x86/kernel/traps.c:217)
[ 0.487752][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536)
[ 0.488153][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301)
[ 0.488490][ T0] ? handle_invalid_op (arch/x86/kernel/traps.c:254)
[ 0.488869][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536)
[ 0.489271][ T0] ? exc_invalid_op (arch/x86/kernel/traps.c:316)
[ 0.489619][ T0] ? handle_exception (arch/x86/entry/entry_32.S:1055)
[ 0.489996][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301)
[ 0.490332][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536)
[ 0.490733][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301)
[ 0.491068][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536)
[ 0.491470][ T0] memmap_init_reserved_pages (mm/memblock.c:2203)
[ 0.491887][ T0] free_low_memory_core_early (mm/memblock.c:?)
[ 0.492302][ T0] memblock_free_all (mm/memblock.c:2272 include/linux/atomic/atomic-arch-fallback.h:546 include/linux/atomic/atomic-long.h:123 include/linux/atomic/atomic-instrumented.h:3261 include/linux/mm.h:67 mm/memblock.c:2273)
[ 0.492659][ T0] mem_init (arch/x86/mm/init_32.c:735)
[ 0.492952][ T0] mm_core_init (mm/mm_init.c:2730)
[ 0.493271][ T0] start_kernel (init/main.c:958)
[ 0.493604][ T0] i386_start_kernel (arch/x86/kernel/head32.c:79)
[ 0.493969][ T0] startup_32_smp (arch/x86/kernel/head_32.S:292)
The crash happens because after commit 8268af309d07 ("arch, mm: set
max_mapnr when allocating memory map for FLATMEM") max_mapnr is rounded up
to MAX_ORDER_NR_PAGES and the pages in the end of the memory map are
passing pfn_valid() check in reserve_bootmem_region().
Make sure that that pages in the end of the memory map are initialized,
just like the pages in the end of the last section for SPARSEMEM.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 8268af309d07 ("arch, mm: set max_mapnr when allocating memory map for FLATMEM")
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, init_reserved_page()
function performs initialization of a struct page that would have been
deferred normally.
Rename it to init_deferred_page() to better reflect what the function does.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Cc: Frank van der Linden <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Changyuan Lyu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
__init_reserved_page_zone() function finds the zone for pfn and nid and
performs initialization of a struct page with that zone and nid. There is
nothing in that function about reserved pages and it is misnamed.
Rename it to __init_page_from_nid() to better reflect what the function
does.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Cc: Frank van der Linden <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The point where the memory is released from memblock to the buddy
allocator is hidden inside arch-specific mem_init()s and the call to
memblock_free_all() is needlessly duplicated in every artiste cure and
after introduction of arch_mm_preinit() hook, mem_init() implementation on
many architecture only contains the call to memblock_free_all().
Pull memblock_free_all() call into mm_core_init() and drop mem_init() on
relevant architectures to make it more explicit where the free memory is
released from memblock to the buddy allocator and to reduce code
duplication in architecture specific code.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: Dave Hansen <[email protected]> [x86]
Acked-by: Geert Uytterhoeven <[email protected]> [m68k]
Tested-by: Mark Brown <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Guo Ren (csky) <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russel King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, implementation of mem_init() in every architecture consists of
one or more of the following:
* initializations that must run before page allocator is active, for
instance swiotlb_init()
* a call to memblock_free_all() to release all the memory to the buddy
allocator
* initializations that must run after page allocator is ready and there is
no arch-specific hook other than mem_init() for that, like for example
register_page_bootmem_info() in x86 and sparc64 or simple setting of
mem_init_done = 1 in several architectures
* a bunch of semi-related stuff that apparently had no better place to
live, for example a ton of BUILD_BUG_ON()s in parisc.
Introduce arch_mm_preinit() that will be the first thing called from
mm_core_init(). On architectures that have initializations that must happen
before the page allocator is ready, move those into arch_mm_preinit() along
with the code that does not depend on ordering with page allocator setup.
On several architectures this results in reduction of mem_init() to a
single call to memblock_free_all() that allows its consolidation next.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: Dave Hansen <[email protected]> [x86]
Tested-by: Mark Brown <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Guo Ren (csky) <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russel King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
high_memory defines upper bound on the directly mapped memory. This bound
is defined by the beginning of ZONE_HIGHMEM when a system has high memory
and by the end of memory otherwise.
All this is known to generic memory management initialization code that
can set high_memory while initializing core mm structures.
Add a generic calculation of high_memory to free_area_init() and remove
per-architecture calculation except for the architectures that set and use
high_memory earlier than that.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: Dave Hansen <[email protected]> [x86]
Tested-by: Mark Brown <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Guo Ren (csky) <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russel King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
max_mapnr is essentially the size of the memory map for systems that use
FLATMEM. There is no reason to calculate it in each and every architecture
when it's anyway calculated in alloc_node_mem_map().
Drop setting of max_mapnr from architecture code and set it once in
alloc_node_mem_map().
While on it, move definition of mem_map and max_mapnr to mm/mm_init.c so
there won't be two copies for MMU and !MMU variants.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Acked-by: Dave Hansen <[email protected]> [x86]
Tested-by: Mark Brown <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Betkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Guo Ren (csky) <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russel King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleinxer <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently fs dax pages are considered free when the refcount drops to one
and their refcounts are not increased when mapped via PTEs or decreased
when unmapped. This requires special logic in mm paths to detect that
these pages should not be properly refcounted, and to detect when the
refcount drops to one instead of zero.
On the other hand get_user_pages(), etc. will properly refcount fs dax
pages by taking a reference and dropping it when the page is unpinned.
Tracking this special behaviour requires extra PTE bits (eg. pte_devmap)
and introduces rules that are potentially confusing and specific to FS DAX
pages. To fix this, and to possibly allow removal of the special PTE bits
in future, convert the fs dax page refcounts to be zero based and instead
take a reference on the page each time it is mapped as is currently the
case for normal pages.
This may also allow a future clean-up to remove the pgmap refcounting that
is currently done in mm/gup.c.
Link: https://lkml.kernel.org/r/c7d886ad7468a20452ef6e0ddab6cfe220874e7c.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Tested-by: Alison Schofield <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Asahi Lina <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Chunyan Zhang <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: linmiaohe <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Ted Ts'o <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are not
supported. This is because MEMORY_DEVICE_FS_DAX pages are the only user
of higher order zone device pages and have their own page reference
counting.
A future change will unify FS DAX reference counting with normal page
reference counting rules and remove the special FS DAX reference counting.
Supporting that requires compound zone device pages.
Supporting compound zone device pages requires compound_head() to
distinguish between head and tail pages whilst still preserving the
special struct page fields that are specific to zone device pages.
A tail page is distinguished by having bit zero being set in
page->compound_head, with the remaining bits pointing to the head page.
For zone device pages page->compound_head is shared with page->pgmap.
The page->pgmap field must be common to all pages within a folio, even if
the folio spans memory sections. Therefore pgmap is the same for both
head and tail pages and can be moved into the folio and we can use the
standard scheme to find compound_head from a tail page.
Link: https://lkml.kernel.org/r/67055d772e6102accf85161d0b57b0b3944292bf.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Tested-by: Alison Schofield <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Asahi Lina <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Chunyan Zhang <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: linmiaohe <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Ted Ts'o <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently ZONE_DEVICE page reference counts are initialised by core memory
management code in __init_zone_device_page() as part of the memremap()
call which driver modules make to obtain ZONE_DEVICE pages. This
initialises page refcounts to 1 before returning them to the driver.
This was presumably done because it drivers had a reference of sorts on
the page. It also ensured the page could always be mapped with
vm_insert_page() for example and would never get freed (ie. have a zero
refcount), freeing drivers of manipulating page reference counts.
However it complicates figuring out whether or not a page is free from the
mm perspective because it is no longer possible to just look at the
refcount. Instead the page type must be known and if GUP is used a
secondary pgmap reference is also sometimes needed.
To simplify this it is desirable to remove the page reference count for
the driver, so core mm can just use the refcount without having to account
for page type or do other types of tracking. This is possible because
drivers can always assume the page is valid as core kernel will never
offline or remove the struct page.
This means it is now up to drivers to initialise the page refcount as
required. P2PDMA uses vm_insert_page() to map the page, and that requires
a non-zero reference count when initialising the page so set that when the
page is first mapped.
Link: https://lkml.kernel.org/r/6aedb0ac2886dcc4503cb705273db5b3863a0b66.1740713401.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Tested-by: Alison Schofield <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Asahi Lina <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Chunyan Zhang <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: linmiaohe <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: Michael "Camp Drill Sergeant" Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Ted Ts'o <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It can be desirable to reserve memory in a CMA area before it is
activated, early in boot. Such reservations would effectively be memblock
allocations, but they can be returned to the CMA area later. This
functionality can be used to allow hugetlb bootmem allocations from a
hugetlb CMA area.
A new interface, cma_reserve_early is introduced. This allows for
pageblock-aligned reservations. These reservations are skipped during the
initial handoff of pages in a CMA area to the buddy allocator. The caller
is responsible for making sure that the page structures are set up, and
that the migrate type is set correctly, as with other memblock allocations
that stick around. If the CMA area fails to activate (because it
intersects with multiple zones), the reserved memory is not given to the
buddy allocator, the caller needs to take care of that.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Frank van der Linden <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin (Cruise) <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Bootmem hugetlb pages are allocated using memblock, which isn't (and
mostly can't be) aware of zones.
So, they may end up crossing zone boundaries. This would create
confusion, a hugetlb page that is part of multiple zones is bad. Worse,
HVO might then end up stealthily re-assigning pages to a different zone
when a hugetlb page is freed, since the tail page structures beyond the
first vmemmap page would inherit the zone of the first page structures.
While the chance of this happening is low, you can definitely create a
configuration where this happens (especially using ZONE_MOVABLE).
To avoid this issue, check if bootmem hugetlb pages intersect with
multiple zones during the gather phase, and discard them, handing them to
the page allocator, if they do. Record the number of invalid bootmem
pages per node and subtract them from the number of available pages at the
end, making it easier to do these checks in multiple places later on.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Frank van der Linden <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin (Cruise) <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sometimes page structs must be unconditionally initialized as reserved,
regardless of DEFERRED_STRUCT_PAGE_INIT.
Define a function, __init_reserved_page_zone, containing code that already
did all of the work in init_reserved_page, and make it available for use.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Frank van der Linden <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin (Cruise) <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add functions that are called just before the per-section memmap is
initialized and just before the memmap page structures are initialized.
They are called sparse_vmemmap_init_nid_early and
sparse_vmemmap_init_nid_late, respectively.
This allows for mm subsystems to add calls to initialize memmap and page
structures in a specific way, if using SPARSEMEM_VMEMMAP. Specifically,
hugetlb can pre-HVO bootmem allocated pages that way, so that no time and
resources are wasted on allocating vmemmap pages, only to free them later
(and possibly unnecessarily running the system out of memory in the
process).
Refactor some code and export a few convenience functions for external
use.
In sparse_init_nid, skip any sections that are already initialized, e.g.
they have been initialized by sparse_vmemmap_init_nid_early already.
The hugetlb code to use these functions will be added in a later commit.
Export section_map_size, as any alternate memmap init code will want to
use it.
The internal config option to enable this is SPARSEMEM_VMEMMAP_PREINIT,
which is selected if an architecture-specific option,
ARCH_WANT_HUGETLB_VMEMMAP_PREINIT, is set. In the future, if other
subsystems want to do preinit too, they can do it in a similar fashion.
The internal config option is there because a section flag is used, and
the number of flags available is architecture-dependent (see mmzone.h).
Architecures can decide if there is room for the flag when enabling
options that select SPARSEMEM_VMEMMAP_PREINIT.
Fortunately, as of right now, all sparse vmemmap using architectures do
have room.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Frank van der Linden <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin (Cruise) <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Convert the cmdline parameters (hugepagesz, hugepages, default_hugepagesz
and hugetlb_free_vmemmap) to early parameters.
Since parse_early_param might run before MMU setups on some platforms
(powerpc), validation of huge page sizes as specified in command line
parameters would fail. So instead, for the hstate-related values, just
record the them and parse them on demand, from hugetlb_bootmem_alloc.
The allocation of hugetlb bootmem pages is now done in
hugetlb_bootmem_alloc, which is called explicitly at the start of
mm_core_init(). core_initcall would be too late, as that happens with
memblock already torn down.
This change will allow earlier allocation and initialization of bootmem
hugetlb pages later on.
No functional change intended.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Frank van der Linden <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Roman Gushchin (Cruise) <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since pageblock_nr_pages and BITS_PER_LONG are power of 2, we could use
round_up() to calculate it.
Also we have renamed blockflags to pageblock_flags, adjust the comment
accordingly.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Suggested-by: Shivank Garg <[email protected]>
Reviewed-by: Shivank Garg <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At the beginning of find_zone_movable_pfns_for_nodes(), it has properly
set node_states[N_MEMORY] in early_calculate_totalpages().
Instead of iterating over all possible nodes, we can just do the alignment
on nodes with memory.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Dev Jain <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
| |
Since MAX_ORDER_NR_PAGES is power of 2, let's use a faster version.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
kmemleak explicitly scans the mem_map through the valid struct page
objects. However, memmap_alloc() was also adding this memory to the gray
object list, causing it to be scanned twice. Remove memmap_alloc() from
the scan list and add a comment to clarify the behavior.
Link: https://lore.kernel.org/lkml/CAOm6qn=FVeTpH54wGDFMHuCOeYtvoTx30ktnv9-w3Nh8RMofEA@mail.gmail.com/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Guo Weikang <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Cc: Mike Rapoport (Microsoft) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock updates from Mike Rapoport:
- replace hardcoded strings with str_on_off() in report_meminit()
- initialize reserved pages to MIGRATE_MOVABLE when deferred struct
page initialization is enabled so that if the reserved pages are
freed they are put on movable free lists like it is done now when
deferred struct page initialization is disabled
* tag 'memblock-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: uniformly initialize all reserved pages to MIGRATE_MOVABLE
mm: Use str_on_off() helper function in report_meminit()
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently when CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set, the reserved
pages are initialized to MIGRATE_MOVABLE by default in memmap_init.
Reserved memory mainly store the metadata of struct page. When
HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=Y and hugepages are allocated,
the HVO will remap the vmemmap virtual address range to the page which
vmemmap_reuse is mapped to. The pages previously mapping the range will
be freed to the buddy system.
Before this patch:
when CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set, the freed memory was
placed on the Movable list;
When CONFIG_DEFERRED_STRUCT_PAGE_INIT=Y, the freed memory was placed on
the Unmovable list.
After this patch, the freed memory is placed on the Movable list
regardless of whether CONFIG_DEFERRED_STRUCT_PAGE_INIT is set.
Eg:
Tested on a virtual machine(1000GB):
Intel(R) Xeon(R) Platinum 8358P CPU
After vm start:
echo 500000 > /proc/sys/vm/nr_hugepages
cat /proc/meminfo | grep -i huge
HugePages_Total: 500000
HugePages_Free: 500000
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 1024000000 kB
cat /proc/pagetypeinfo
before:
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
…
Node 0, zone Normal, type Unmovable 51 2 1 28 53 35 35 43 40 69 3852
Node 0, zone Normal, type Movable 6485 4610 666 202 200 185 208 87 54 2 240
Node 0, zone Normal, type Reclaimable 2 2 1 23 13 1 2 1 0 1 0
Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Unmovable ≈ 15GB
after:
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
…
Node 0, zone Normal, type Unmovable 0 1 1 0 0 0 0 1 1 1 0
Node 0, zone Normal, type Movable 1563 4107 1119 189 256 368 286 132 109 4 3841
Node 0, zone Normal, type Reclaimable 2 2 1 23 13 1 2 1 0 1 0
Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Signed-off-by: Hua Su <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| | |
Remove hard-coded strings by using the helper function str_on_off().
Signed-off-by: Thorsten Blum <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
|
| |/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement support for storing page allocation tag references directly in
the page flags instead of page extensions. sysctl.vm.mem_profiling boot
parameter it extended to provide a way for a user to request this mode.
Enabling compression eliminates memory overhead caused by page_ext and
results in better performance for page allocations. However this mode
will not work if the number of available page flag bits is insufficient to
address all kernel allocations. Such condition can happen during boot or
when loading a module. If this condition is detected, memory allocation
profiling gets disabled with an appropriate warning. By default
compression mode is disabled.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Daniel Gomez <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kalesh Singh <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport (Microsoft) <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Petr Pavlu <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Sourav Panda <[email protected]>
Cc: Steven Rostedt (Google) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Thomas Huth <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Xiongwei Song <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are no users of HAVE_ARCH_NODEDATA_EXTENSION left, so
arch_alloc_nodedata() and arch_refresh_nodedata() are not needed anymore.
Replace the call to arch_alloc_nodedata() in free_area_init() with a new
helper alloc_offline_node_data(), remove arch_refresh_nodedata() and
cleanup include/linux/memory_hotplug.h from the associated ifdefery.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64
Acked-by: Dan Williams <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiaxun Yang <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Rob Herring (Arm) <[email protected]>
Cc: Samuel Holland <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make accept_memory() and range_contains_unaccepted_memory() take 'start'
and 'size' arguments instead of 'start' and 'end'.
Remove accept_page(), replacing it with direct calls to accept_memory().
The accept_page() name is going to be used for a different function.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Rapoport (Microsoft) <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During CMA activation, pages in CMA area are prepared and then freed
without being allocated. This triggers warnings when memory allocation
debug config (CONFIG_MEM_ALLOC_PROFILING_DEBUG) is enabled. Fix this by
marking these pages not tagged before freeing them.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: d224eb0287fb ("codetag: debug: mark codetags for reserved pages as empty")
Signed-off-by: Suren Baghdasaryan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Pasha Tatashin <[email protected]>
Cc: Sourav Panda <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]> [6.10]
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In several cases we are freeing pages which were not allocated using
common page allocators. For such cases, in order to keep allocation
accounting correct, we should clear the page tag to indicate that the page
being freed is expected to not have a valid allocation tag. Introduce
clear_page_tag_ref() helper function to be used for this.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: d224eb0287fb ("codetag: debug: mark codetags for reserved pages as empty")
Signed-off-by: Suren Baghdasaryan <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Sourav Panda <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]> [6.10]
Signed-off-by: Andrew Morton <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix invalid access to pgdat during hot-remove operation:
ndctl users reported a GPF when trying to destroy a namespace:
$ ndctl destroy-namespace all -r all -f
Segmentation fault
dmesg:
Oops: general protection fault, probably for
non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN
PTI
KASAN: probably user-memory-access in range
[0x000000000002b280-0x000000000002b287]
CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1
Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS
2.20.1 09/13/2023
RIP: 0010:mod_node_page_state+0x2a/0x110
cxl-test users report a GPF when trying to unload the test module:
$ modrpobe -r cxl-test
dmesg
BUG: unable to handle page fault for address: 0000000000004200
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197
Tainted: [O]=OOT_MODULE, [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15
RIP: 0010:mod_node_page_state+0x6/0x90
Currently, when memory is hot-plugged or hot-removed the accounting is
done based on the assumption that memmap is allocated from the same node
as the hot-plugged/hot-removed memory, which is not always the case.
In addition, there are challenges with keeping the node id of the memory
that is being remove to the time when memmap accounting is actually
performed: since this is done after remove_pfn_range_from_zone(), and
also after remove_memory_block_devices(). Meaning that we cannot use
pgdat nor walking though memblocks to get the nid.
Given all of that, account the memmap overhead system wide instead.
For this we are going to be using global atomic counters, but given that
memmap size is rarely modified, and normally is only modified either
during early boot when there is only one CPU, or under a hotplug global
mutex lock, therefore there is no need for per-cpu optimizations.
Also, while we are here rename nr_memmap to nr_memmap_pages, and
nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the
units are in page count.
[[email protected]: address a few nits from David Hildenbrand]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 15995a352474 ("mm: report per-page metadata information")
Signed-off-by: Pasha Tatashin <[email protected]>
Reported-by: Yi Zhang <[email protected]>
Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@mail.gmail.com
Reported-by: Alison Schofield <[email protected]>
Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t
Tested-by: Dan Williams <[email protected]>
Tested-by: Alison Schofield <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: David Rientjes <[email protected]>
Tested-by: Yi Zhang <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Fan Ni <[email protected]>
Cc: Joel Granados <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Li Zhijian <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sourav Panda <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| |\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My
bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability
of cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache
index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of
the zeropage in MAP_SHARED mappings. I don't see any runtime effects
here - more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
of higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
the series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
has simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large
folio userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's
self testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
under config option" and "mm: memcg: put cgroup v1-specific memcg
data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of
excessive correctable memory errors. In order to permit userspace to
monitor and handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from
migrate folio" teaches the kernel to appropriately handle migration
from poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory
utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than
bare refcount increments. So these paes can first be moved aside if
they reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to
/proc/pid/maps for much faster reading of vma information. The series
is "query VMAs from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance
Yang improves the kernel's presentation of developer information
related to multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and
not very useful feature from slab fault injection.
* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
mm/mglru: fix ineffective protection calculation
mm/zswap: fix a white space issue
mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
mm/hugetlb: fix possible recursive locking detected warning
mm/gup: clear the LRU flag of a page before adding to LRU batch
mm/numa_balancing: teach mpol_to_str about the balancing mode
mm: memcg1: convert charge move flags to unsigned long long
alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
lib: reuse page_ext_data() to obtain codetag_ref
lib: add missing newline character in the warning message
mm/mglru: fix overshooting shrinker memory
mm/mglru: fix div-by-zero in vmpressure_calc_level()
mm/kmemleak: replace strncpy() with strscpy()
mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
mm: ignore data-race in __swap_writepage
hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
mm: shmem: rename mTHP shmem counters
mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
mm/migrate: putback split folios when numa hint migration fails
...
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Current check on MAX_ZONELISTS is wrapped in CONFIG_DEBUG_MEMORY_INIT,
which may not be triggered all the time.
Let's move it out to a more general place.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
instead of PageReserved()
We currently initialize the memmap such that PG_reserved is set and the
refcount of the page is 1. In virtio-mem code, we have to manually clear
that PG_reserved flag to make memory offlining with partially hotplugged
memory blocks possible: has_unmovable_pages() would otherwise bail out on
such pages.
We want to avoid PG_reserved where possible and move to typed pages
instead. Further, we want to further enlighten memory offlining code
about PG_offline: offline pages in an online memory section. One example
is handling managed page count adjustments in a cleaner way during memory
offlining.
So let's initialize the pages with PG_offline instead of PG_reserved.
generic_online_page()->__free_pages_core() will now clear that flag before
handing that memory to the buddy.
Note that the page refcount is still 1 and would forbid offlining of such
memory except when special care is take during GOING_OFFLINE as currently
only implemented by virtio-mem.
With this change, we can now get non-PageReserved() pages in the XEN
balloon list. From what I can tell, that can already happen via
decrease_reservation(), so that should be fine.
HV-balloon should not really observe a change: partial online memory
blocks still cannot get surprise-offlined, because the refcount of these
PageOffline() pages is 1.
Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
hotplugged pages are now PageOffline() instead of PageReserved() before
they are handed over to the buddy.
We'll leave the ZONE_DEVICE case alone for now.
Note that self-hosted vmemmap pages will no longer be marked as
reserved. This matches ordinary vmemmap pages allocated from the buddy
during memory hotplug. Now, really only vmemmap pages allocated from
memblock during early boot will be marked reserved. Existing
PageReserved() checks seem to be handling all relevant cases correctly
even after this change.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Oscar Salvador <[email protected]> [generic memory-hotplug bits]
Cc: Alexander Potapenko <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eugenio Pérez <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Oleksandr Tyshchenko <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Xuan Zhuo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Patch series "mm/memory_hotplug: use PageOffline() instead of
PageReserved() for !ZONE_DEVICE".
This can be a considered a long-overdue follow-up to some parts of [1].
The patches are based on [2], but they are not strictly required -- just
makes it clearer why we can use adjust_managed_page_count() for memory
hotplug without going into details about highmem.
We stop initializing pages with PageReserved() in memory hotplug code --
except when dealing with ZONE_DEVICE for now. Instead, we use
PageOffline(): all pages are initialized to PageOffline() when onlining a
memory section, and only the ones actually getting exposed to the
system/page allocator will get PageOffline cleared.
This way, we enlighten memory hotplug more about PageOffline() pages and
can cleanup some hacks we have in virtio-mem code.
What about ZONE_DEVICE? PageOffline() is wrong, but we might just stop
using PageReserved() for them later by simply checking for
is_zone_device_page() at suitable places. That will be a separate patch
set / proposal.
This primarily affects virtio-mem, HV-balloon and XEN balloon. I only
briefly tested with virtio-mem, which benefits most from these cleanups.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lkml.kernel.org/r/[email protected]
This patch (of 3):
In preparation for further changes, let's teach __free_pages_core() about
the differences of memory hotplug handling.
Move the memory hotplug specific handling from generic_online_page() to
__free_pages_core(), use adjust_managed_page_count() on the memory hotplug
path, and spell out why memory freed via memblock cannot currently use
adjust_managed_page_count().
[[email protected]: add missed CONFIG_DEFERRED_STRUCT_PAGE_INIT]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix up the memblock comment, per Oscar]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: add the parameter name also in the declaration]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dexuan Cui <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eugenio Pérez <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: "K. Y. Srinivasan" <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Oleksandr Tyshchenko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: Xuan Zhuo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Let's simply reinitialize the page->_mapcount directly. We can now get
rid of page_mapcount_reset().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Tested-by: Sergey Senozhatsky <[email protected]> [zram/zsmalloc workloads]
Cc: Hyeonggon Yoo <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Function deferred_[init|free]_pages are only used in
deferred_init_maxorder(), which makes sure the range to init/free is
within MAX_ORDER_NR_PAGES size.
With this knowledge, we can simplify these two functions. Since
* only the first pfn could be IS_MAX_ORDER_ALIGNED()
Also since the range passed to deferred_[init|free]_pages is always from
memblock.memory for those we have already allocated memmap to cover,
pfn_valid() always return true. Then we can remove related check.
[[email protected]: adjust function declaration indention per David]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: David Hildenbrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Today, we do not have any observability of per-page metadata and how much
it takes away from the machine capacity. Thus, we want to describe the
amount of memory that is going towards per-page metadata, which can vary
depending on build configuration, machine architecture, and system use.
This patch adds 2 fields to /proc/vmstat that can used as shown below:
Accounting per-page metadata allocated by boot-allocator:
/proc/vmstat:nr_memmap_boot * PAGE_SIZE
Accounting per-page metadata allocated by buddy-allocator:
/proc/vmstat:nr_memmap * PAGE_SIZE
Accounting total Perpage metadata allocated on the machine:
(/proc/vmstat:nr_memmap_boot +
/proc/vmstat:nr_memmap) * PAGE_SIZE
Utility for userspace:
Observability: Describe the amount of memory overhead that is going to
per-page metadata on the system at any given time since this overhead is
not currently observable.
Debugging: Tracking the changes or absolute value in struct pages can help
detect anomalies as they can be correlated with other metrics in the
machine (e.g., memtotal, number of huge pages, etc).
page_ext overheads: Some kernel features such as page_owner
page_table_check that use page_ext can be optionally enabled via kernel
parameters. Having the total per-page metadata information helps users
precisely measure impact. Furthermore, page-metadata metrics will reflect
the amount of struct pages reliquished (or overhead reduced) when
hugetlbfs pages are reserved which will vary depending on whether hugetlb
vmemmap optimization is enabled or not.
For background and results see:
lore.kernel.org/all/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sourav Panda <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Chen Linxuan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Ivan Babrou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tomas Mudrunka <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Yang Yang <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Current call flow looks like this:
start_kernel
mm_core_init
mem_init
mem_init_print_info
rest_init
kernel_init
kernel_init_freeable
page_alloc_init_late
deferred_init_memmap
If CONFIG_DEFERRED_STRUCT_PAGE_INIT, the time mem_init_print_info()
calls, pages are not totally initialized and freed to buddy.
This has one issue
* nr_free_pages() just contains partial free pages in the system,
which is not we expect.
Let's print the mem info after defer_init is done.
Also this would help changing totalram_pages accounting, since we plan
to move the accounting into __free_pages_core().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|