aboutsummaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'mm-hotfixes-stable-2025-09-27-22-35' of ↵Linus Torvalds2025-09-284-14/+31
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "7 hotfixes. 4 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 6 of these fixes are for MM. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-09-27-22-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: include/linux/pgtable.h: convert arch_enter_lazy_mmu_mode() and friends to static inlines mm/damon/sysfs: do not ignore callback's return value in damon_sysfs_damon_call() mailmap: add entry for Bence Csókás fs/proc/task_mmu: check p->vec_buf for NULL kmsan: fix out-of-bounds access to shadow memory mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_count mm/hugetlb: fix folio is still mapped when deleted
| * mm/damon/sysfs: do not ignore callback's return value in ↵Akinobu Mita2025-09-251-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | damon_sysfs_damon_call() The callback return value is ignored in damon_sysfs_damon_call(), which means that it is not possible to detect invalid user input when writing commands such as 'commit' to /sys/kernel/mm/damon/admin/kdamonds/<K>/state. Fix it. Link: https://lkml.kernel.org/r/[email protected] Fixes: f64539dcdb87 ("mm/damon/sysfs: use damon_call() for update_schemes_stats") Signed-off-by: Akinobu Mita <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Cc: <[email protected]> [6.14+] Signed-off-by: Andrew Morton <[email protected]>
| * kmsan: fix out-of-bounds access to shadow memoryEric Biggers2025-09-252-3/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running sha224_kunit on a KMSAN-enabled kernel results in a crash in kmsan_internal_set_shadow_origin(): BUG: unable to handle page fault for address: ffffbc3840291000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G N 6.17.0-rc3 #10 PREEMPT(voluntary) Tainted: [N]=TEST Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100 [...] Call Trace: <TASK> __msan_memset+0xee/0x1a0 sha224_final+0x9e/0x350 test_hash_buffer_overruns+0x46f/0x5f0 ? kmsan_get_shadow_origin_ptr+0x46/0xa0 ? __pfx_test_hash_buffer_overruns+0x10/0x10 kunit_try_run_case+0x198/0xa00 This occurs when memset() is called on a buffer that is not 4-byte aligned and extends to the end of a guard page, i.e. the next page is unmapped. The bug is that the loop at the end of kmsan_internal_set_shadow_origin() accesses the wrong shadow memory bytes when the address is not 4-byte aligned. Since each 4 bytes are associated with an origin, it rounds the address and size so that it can access all the origins that contain the buffer. However, when it checks the corresponding shadow bytes for a particular origin, it incorrectly uses the original unrounded shadow address. This results in reads from shadow memory beyond the end of the buffer's shadow memory, which crashes when that memory is not mapped. To fix this, correctly align the shadow address before accessing the 4 shadow bytes corresponding to each origin. Link: https://lkml.kernel.org/r/[email protected] Fixes: 2ef3cec44c60 ("kmsan: do not wipe out origin when doing partial unpoisoning") Signed-off-by: Eric Biggers <[email protected]> Tested-by: Alexander Potapenko <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Cc: Dmitriy Vyukov <[email protected]> Cc: Marco Elver <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_countJane Chu2025-09-251-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 59d9094df3d79 ("mm: hugetlb: independent PMD page table shared count") introduced ->pt_share_count dedicated to hugetlb PMD share count tracking, but omitted fixing copy_hugetlb_page_range(), leaving the function relying on page_count() for tracking that no longer works. When lazy page table copy for hugetlb is disabled, that is, revert commit bcd51a3c679d ("hugetlb: lazy page table copies in fork()") fork()'ing with hugetlb PMD sharing quickly lockup - [ 239.446559] watchdog: BUG: soft lockup - CPU#75 stuck for 27s! [ 239.446611] RIP: 0010:native_queued_spin_lock_slowpath+0x7e/0x2e0 [ 239.446631] Call Trace: [ 239.446633] <TASK> [ 239.446636] _raw_spin_lock+0x3f/0x60 [ 239.446639] copy_hugetlb_page_range+0x258/0xb50 [ 239.446645] copy_page_range+0x22b/0x2c0 [ 239.446651] dup_mmap+0x3e2/0x770 [ 239.446654] dup_mm.constprop.0+0x5e/0x230 [ 239.446657] copy_process+0xd17/0x1760 [ 239.446660] kernel_clone+0xc0/0x3e0 [ 239.446661] __do_sys_clone+0x65/0xa0 [ 239.446664] do_syscall_64+0x82/0x930 [ 239.446668] ? count_memcg_events+0xd2/0x190 [ 239.446671] ? syscall_trace_enter+0x14e/0x1f0 [ 239.446676] ? syscall_exit_work+0x118/0x150 [ 239.446677] ? arch_exit_to_user_mode_prepare.constprop.0+0x9/0xb0 [ 239.446681] ? clear_bhb_loop+0x30/0x80 [ 239.446684] ? clear_bhb_loop+0x30/0x80 [ 239.446686] entry_SYSCALL_64_after_hwframe+0x76/0x7e There are two options to resolve the potential latent issue: 1. warn against PMD sharing in copy_hugetlb_page_range(), 2. fix it. This patch opts for the second option. While at it, simplify the comment, the details are not actually relevant anymore. Link: https://lkml.kernel.org/r/[email protected] Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: Jane Chu <[email protected]> Reviewed-by: Harry Yoo <[email protected]> Acked-by: Oscar Salvador <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Jann Horn <[email protected]> Cc: Liu Shixin <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* | Merge tag 'mm-hotfixes-stable-2025-09-17-21-10' of ↵Linus Torvalds2025-09-186-41/+62
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "15 hotfixes. 11 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 13 of these fixes are for MM. The usual shower of singletons, plus - fixes from Hugh to address various misbehaviors in get_user_pages() - patches from SeongJae to address a quite severe issue in DAMON - another series also from SeongJae which completes some fixes for a DAMON startup issue" * tag 'mm-hotfixes-stable-2025-09-17-21-10' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: zram: fix slot write race condition nilfs2: fix CFI failure when accessing /sys/fs/nilfs2/features/* samples/damon/mtier: avoid starting DAMON before initialization samples/damon/prcl: avoid starting DAMON before initialization samples/damon/wsse: avoid starting DAMON before initialization MAINTAINERS: add Lance Yang as a THP reviewer MAINTAINERS: add Jann Horn as rmap reviewer mm/damon/sysfs: use dynamically allocated repeat mode damon_call_control mm/damon/core: introduce damon_call_control->dealloc_on_cancel mm: folio_may_be_lru_cached() unless folio_test_large() mm: revert "mm: vmscan.c: fix OOM on swap stress test" mm: revert "mm/gup: clear the LRU flag of a page before adding to LRU batch" mm/gup: local lru_add_drain() to avoid lru_add_drain_all() mm/gup: check ref_count instead of lru before migration
| * mm/damon/sysfs: use dynamically allocated repeat mode damon_call_controlSeongJae Park2025-09-131-8/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DAMON sysfs interface is using a single global repeat mode damon_call_control variable for refresh_ms handling, for all DAMON contexts. As a result, when there are more than one context, the single global damon_call_control is unexpectedly over-written (corrupted). Particularly the ->link field is overwritten by the multiple contexts and this can cause a user hangup, and/or a kernel crash. Fix it by using dynamically allocated damon_call_control object per DAMON context. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/[email protected] [1] Link: https://lore.kernel.org/[email protected] [2] Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work") Signed-off-by: SeongJae Park <[email protected]> Reported-by: Yunjeong Mun <[email protected]> Closes: https://lore.kernel.org/[email protected] Signed-off-by: Andrew Morton <[email protected]>
| * mm/damon/core: introduce damon_call_control->dealloc_on_cancelSeongJae Park2025-09-131-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm/damon/sysfs: fix refresh_ms control overwriting on multi-kdamonds usages". Automatic esssential DAMON/DAMOS status update feature of DAMON sysfs interface (refresh_ms) is broken [1] for multiple DAMON contexts (kdamonds) use case, since it uses a global single damon_call_control object for all created DAMON contexts. The fields of the object, particularly the list field is over-written for the contexts and it makes unexpected results including user-space hangup and kernel crashes [2]. Fix it by extending damon_call_control for the use case and updating the usage on DAMON sysfs interface to use per-context dynamically allocated damon_call_control object. This patch (of 2): When damon_call_control->repeat is set, damon_call() is executed asynchronously, and is eventually canceled when kdamond finishes. If the damon_call_control object is dynamically allocated, finding the place to deallocate the object is difficult. Introduce a new damon_call_control field, namely dealloc_on_cancel, to ask the kdamond deallocates those dynamically allocated objects when those are canceled. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work") Signed-off-by: SeongJae Park <[email protected]> Cc: Yunjeong Mun <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm: folio_may_be_lru_cached() unless folio_test_large()Hugh Dickins2025-09-133-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as a large folio is added: so collect_longterm_unpinnable_folios() just wastes effort when calling lru_add_drain[_all]() on a large folio. But although there is good reason not to batch up PMD-sized folios, we might well benefit from batching a small number of low-order mTHPs (though unclear how that "small number" limitation will be implemented). So ask if folio_may_be_lru_cached() rather than !folio_test_large(), to insulate those particular checks from future change. Name preferred to "folio_is_batchable" because large folios can well be put on a batch: it's just the per-CPU LRU caches, drained much later, which need care. Marked for stable, to counter the increase in lru_add_drain_all()s from "mm/gup: check ref_count instead of lru before migration". Link: https://lkml.kernel.org/r/[email protected] Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region") Signed-off-by: Hugh Dickins <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Chris Li <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Keir Fraser <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Li Zhe <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shivank Garg <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Xu <[email protected]> Cc: Will Deacon <[email protected]> Cc: yangge <[email protected]> Cc: Yuanchu Xie <[email protected]> Cc: Yu Zhao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm: revert "mm: vmscan.c: fix OOM on swap stress test"Hugh Dickins2025-09-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 0885ef470560: that was a fix to the reverted 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Chris Li <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Keir Fraser <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Li Zhe <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shivank Garg <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Xu <[email protected]> Cc: Will Deacon <[email protected]> Cc: yangge <[email protected]> Cc: Yuanchu Xie <[email protected]> Cc: Yu Zhao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm: revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"Hugh Dickins2025-09-131-24/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 33dfe9204f29: now that collect_longterm_unpinnable_folios() is checking ref_count instead of lru, and mlock/munlock do not participate in the revised LRU flag clearing, those changes are misleading, and enlarge the window during which mlock/munlock may miss an mlock_count update. It is possible (I'd hesitate to claim probable) that the greater likelihood of missed mlock_count updates would explain the "Realtime threads delayed due to kcompactd0" observed on 6.12 in the Link below. If that is the case, this reversion will help; but a complete solution needs also a further patch, beyond the scope of this series. Included some 80-column cleanup around folio_batch_add_and_move(). The role of folio_test_clear_lru() (before taking per-memcg lru_lock) is questionable since 6.13 removed mem_cgroup_move_account() etc; but perhaps there are still some races which need it - not examined here. Link: https://lore.kernel.org/linux-mm/DU0PR01MB10385345F7153F334100981888259A@DU0PR01MB10385.eurprd01.prod.exchangelabs.com/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Chris Li <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Keir Fraser <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Li Zhe <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shivank Garg <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Xu <[email protected]> Cc: Will Deacon <[email protected]> Cc: yangge <[email protected]> Cc: Yuanchu Xie <[email protected]> Cc: Yu Zhao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm/gup: local lru_add_drain() to avoid lru_add_drain_all()Hugh Dickins2025-09-131-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In many cases, if collect_longterm_unpinnable_folios() does need to drain the LRU cache to release a reference, the cache in question is on this same CPU, and much more efficiently drained by a preliminary local lru_add_drain(), than the later cross-CPU lru_add_drain_all(). Marked for stable, to counter the increase in lru_add_drain_all()s from "mm/gup: check ref_count instead of lru before migration". Note for clean backports: can take 6.16 commit a03db236aebf ("gup: optimize longterm pin_user_pages() for large folio") first. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Chris Li <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Keir Fraser <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Li Zhe <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shivank Garg <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Xu <[email protected]> Cc: Will Deacon <[email protected]> Cc: yangge <[email protected]> Cc: Yuanchu Xie <[email protected]> Cc: Yu Zhao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm/gup: check ref_count instead of lru before migrationHugh Dickins2025-09-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: better GUP pin lru_add_drain_all()", v2. Series of lru_add_drain_all()-related patches, arising from recent mm/gup migration report from Will Deacon. This patch (of 5): Will Deacon reports:- When taking a longterm GUP pin via pin_user_pages(), __gup_longterm_locked() tries to migrate target folios that should not be longterm pinned, for example because they reside in a CMA region or movable zone. This is done by first pinning all of the target folios anyway, collecting all of the longterm-unpinnable target folios into a list, dropping the pins that were just taken and finally handing the list off to migrate_pages() for the actual migration. It is critically important that no unexpected references are held on the folios being migrated, otherwise the migration will fail and pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is relatively easy to observe migration failures when running pKVM (which uses pin_user_pages() on crosvm's virtual address space to resolve stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and this results in the VM terminating prematurely. In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its mapping of guest memory prior to the pinning. Subsequently, when pin_user_pages() walks the page-table, the relevant 'pte' is not present and so the faulting logic allocates a new folio, mlocks it with mlock_folio() and maps it in the page-table. Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec"), mlock/munlock operations on a folio (formerly page), are deferred. For example, mlock_folio() takes an additional reference on the target folio before placing it into a per-cpu 'folio_batch' for later processing by mlock_folio_batch(), which drops the refcount once the operation is complete. Processing of the batches is coupled with the LRU batch logic and can be forcefully drained with lru_add_drain_all() but as long as a folio remains unprocessed on the batch, its refcount will be elevated. This deferred batching therefore interacts poorly with the pKVM pinning scenario as we can find ourselves in a situation where the migration code fails to migrate a folio due to the elevated refcount from the pending mlock operation. Hugh Dickins adds:- !folio_test_lru() has never been a very reliable way to tell if an lru_add_drain_all() is worth calling, to remove LRU cache references to make the folio migratable: the LRU flag may be set even while the folio is held with an extra reference in a per-CPU LRU cache. 5.18 commit 2fbb0c10d1e8 may have made it more unreliable. Then 6.11 commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch") tried to make it reliable, by moving LRU flag clearing; but missed the mlock/munlock batches, so still unreliable as reported. And it turns out to be difficult to extend 33dfe9204f29's LRU flag clearing to the mlock/munlock batches: if they do benefit from batching, mlock/munlock cannot be so effective when easily suppressed while !LRU. Instead, switch to an expected ref_count check, which was more reliable all along: some more false positives (unhelpful drains) than before, and never a guarantee that the folio will prove migratable, but better. Note on PG_private_2: ceph and nfs are still using the deprecated PG_private_2 flag, with the aid of netfs and filemap support functions. Although it is consistently matched by an increment of folio ref_count, folio_expected_ref_count() intentionally does not recognize it, and ceph folio migration currently depends on that for PG_private_2 folios to be rejected. New references to the deprecated flag are discouraged, so do not add it into the collect_longterm_unpinnable_folios() calculation: but longterm pinning of transiently PG_private_2 ceph and nfs folios (an uncommon case) may invoke a redundant lru_add_drain_all(). And this makes easy the backport to earlier releases: up to and including 6.12, btrfs also used PG_private_2, but without a ref_count increment. Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation"). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region") Signed-off-by: Hugh Dickins <[email protected]> Reported-by: Will Deacon <[email protected]> Closes: https://lore.kernel.org/linux-mm/[email protected]/ Acked-by: Kiryl Shutsemau <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Chris Li <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Keir Fraser <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Li Zhe <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shivank Garg <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Xu <[email protected]> Cc: yangge <[email protected]> Cc: Yuanchu Xie <[email protected]> Cc: Yu Zhao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* | Merge tag 'mm-hotfixes-stable-2025-09-10-20-00' of ↵Linus Torvalds2025-09-1112-45/+94
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes. 15 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 14 of these fixes are for MM. This includes - kexec fixes from Breno for a recently introduced use-uninitialized bug - DAMON fixes from Quanmin Yan to avoid div-by-zero crashes which can occur if the operator uses poorly-chosen insmod parameters and misc singleton fixes" * tag 'mm-hotfixes-stable-2025-09-10-20-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: add tree entry to numa memblocks and emulation block mm/damon/sysfs: fix use-after-free in state_show() proc: fix type confusion in pde_set_flags() compiler-clang.h: define __SANITIZE_*__ macros only when undefined mm/vmalloc, mm/kasan: respect gfp mask in kasan_populate_vmalloc() ocfs2: fix recursive semaphore deadlock in fiemap call mm/memory-failure: fix VM_BUG_ON_PAGE(PagePoisoned(page)) when unpoison memory mm/mremap: fix regression in vrm->new_addr check percpu: fix race on alloc failed warning limit mm/memory-failure: fix redundant updates for already poisoned pages s390: kexec: initialize kexec_buf struct riscv: kexec: initialize kexec_buf struct arm64: kexec: initialize kexec_buf struct in load_other_segments() mm/damon/reclaim: avoid divide-by-zero in damon_reclaim_apply_parameters() mm/damon/lru_sort: avoid divide-by-zero in damon_lru_sort_apply_parameters() mm/damon/core: set quota->charged_from to jiffies at first charge window mm/hugetlb: add missing hugetlb_lock in __unmap_hugepage_range() init/main.c: fix boot time tracing crash mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range() mm/khugepaged: fix the address passed to notifier on testing young
| * mm/damon/sysfs: fix use-after-free in state_show()Stanislav Fort2025-09-091-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | state_show() reads kdamond->damon_ctx without holding damon_sysfs_lock. This allows a use-after-free race: CPU 0 CPU 1 ----- ----- state_show() damon_sysfs_turn_damon_on() ctx = kdamond->damon_ctx; mutex_lock(&damon_sysfs_lock); damon_destroy_ctx(kdamond->damon_ctx); kdamond->damon_ctx = NULL; mutex_unlock(&damon_sysfs_lock); damon_is_running(ctx); /* ctx is freed */ mutex_lock(&ctx->kdamond_lock); /* UAF */ (The race can also occur with damon_sysfs_kdamonds_rm_dirs() and damon_sysfs_kdamond_release(), which free or replace the context under damon_sysfs_lock.) Fix by taking damon_sysfs_lock before dereferencing the context, mirroring the locking used in pid_show(). The bug has existed since state_show() first accessed kdamond->damon_ctx. Link: https://lkml.kernel.org/r/[email protected] Fixes: a61ea561c871 ("mm/damon/sysfs: link DAMON for virtual address spaces monitoring") Signed-off-by: Stanislav Fort <[email protected]> Reported-by: Stanislav Fort <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm/vmalloc, mm/kasan: respect gfp mask in kasan_populate_vmalloc()Uladzislau Rezki (Sony)2025-09-092-11/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kasan_populate_vmalloc() and its helpers ignore the caller's gfp_mask and always allocate memory using the hardcoded GFP_KERNEL flag. This makes them inconsistent with vmalloc(), which was recently extended to support GFP_NOFS and GFP_NOIO allocations. Page table allocations performed during shadow population also ignore the external gfp_mask. To preserve the intended semantics of GFP_NOFS and GFP_NOIO, wrap the apply_to_page_range() calls into the appropriate memalloc scope. xfs calls vmalloc with GFP_NOFS, so this bug could lead to deadlock. There was a report here https://lkml.kernel.org/r/[email protected] This patch: - Extends kasan_populate_vmalloc() and helpers to take gfp_mask; - Passes gfp_mask down to alloc_pages_bulk() and __get_free_page(); - Enforces GFP_NOFS/NOIO semantics with memalloc_*_save()/restore() around apply_to_page_range(); - Updates vmalloc.c and percpu allocator call sites accordingly. Link: https://lkml.kernel.org/r/[email protected] Fixes: 451769ebb7e7 ("mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc") Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reported-by: [email protected] Reviewed-by: Andrey Ryabinin <[email protected]> Cc: Baoquan He <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm/memory-failure: fix VM_BUG_ON_PAGE(PagePoisoned(page)) when unpoison memoryMiaohe Lin2025-09-091-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I did memory failure tests, below panic occurs: page dumped because: VM_BUG_ON_PAGE(PagePoisoned(page)) kernel BUG at include/linux/page-flags.h:616! Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 3 PID: 720 Comm: bash Not tainted 6.10.0-rc1-00195-g148743902568 #40 RIP: 0010:unpoison_memory+0x2f3/0x590 RSP: 0018:ffffa57fc8787d60 EFLAGS: 00000246 RAX: 0000000000000037 RBX: 0000000000000009 RCX: ffff9be25fcdc9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9be25fcdc9c0 RBP: 0000000000300000 R08: ffffffffb4956f88 R09: 0000000000009ffb R10: 0000000000000284 R11: ffffffffb4926fa0 R12: ffffe6b00c000000 R13: ffff9bdb453dfd00 R14: 0000000000000000 R15: fffffffffffffffe FS: 00007f08f04e4740(0000) GS:ffff9be25fcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000564787a30410 CR3: 000000010d4e2000 CR4: 00000000000006f0 Call Trace: <TASK> unpoison_memory+0x2f3/0x590 simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110 debugfs_attr_write+0x42/0x60 full_proxy_write+0x5b/0x80 vfs_write+0xd5/0x540 ksys_write+0x64/0xe0 do_syscall_64+0xb9/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f08f0314887 RSP: 002b:00007ffece710078 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f08f0314887 RDX: 0000000000000009 RSI: 0000564787a30410 RDI: 0000000000000001 RBP: 0000564787a30410 R08: 000000000000fefe R09: 000000007fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009 R13: 00007f08f041b780 R14: 00007f08f0417600 R15: 00007f08f0416a00 </TASK> Modules linked in: hwpoison_inject ---[ end trace 0000000000000000 ]--- RIP: 0010:unpoison_memory+0x2f3/0x590 RSP: 0018:ffffa57fc8787d60 EFLAGS: 00000246 RAX: 0000000000000037 RBX: 0000000000000009 RCX: ffff9be25fcdc9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9be25fcdc9c0 RBP: 0000000000300000 R08: ffffffffb4956f88 R09: 0000000000009ffb R10: 0000000000000284 R11: ffffffffb4926fa0 R12: ffffe6b00c000000 R13: ffff9bdb453dfd00 R14: 0000000000000000 R15: fffffffffffffffe FS: 00007f08f04e4740(0000) GS:ffff9be25fcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000564787a30410 CR3: 000000010d4e2000 CR4: 00000000000006f0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x31c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- The root cause is that unpoison_memory() tries to check the PG_HWPoison flags of an uninitialized page. So VM_BUG_ON_PAGE(PagePoisoned(page)) is triggered. This can be reproduced by below steps: 1.Offline memory block: echo offline > /sys/devices/system/memory/memory12/state 2.Get offlined memory pfn: page-types -b n -rlN 3.Write pfn to unpoison-pfn echo <pfn> > /sys/kernel/debug/hwpoison/unpoison-pfn This scenario can be identified by pfn_to_online_page() returning NULL. And ZONE_DEVICE pages are never expected, so we can simply fail if pfn_to_online_page() == NULL to fix the bug. Link: https://lkml.kernel.org/r/[email protected] Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") Signed-off-by: Miaohe Lin <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm/mremap: fix regression in vrm->new_addr checkCarlos Llamas2025-09-091-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3215eaceca87 ("mm/mremap: refactor initial parameter sanity checks") moved the sanity check for vrm->new_addr from mremap_to() to check_mremap_params(). However, this caused a regression as vrm->new_addr is now checked even when MREMAP_FIXED and MREMAP_DONTUNMAP flags are not specified. In this case, vrm->new_addr can be garbage and create unexpected failures. Fix this by moving the new_addr check after the vrm_implies_new_addr() guard. This ensures that the new_addr is only checked when the user has specified one explicitly. Link: https://lkml.kernel.org/r/[email protected] Fixes: 3215eaceca87 ("mm/mremap: refactor initial parameter sanity checks") Signed-off-by: Carlos Llamas <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Jann Horn <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * percpu: fix race on alloc failed warning limitVlad Dumitrescu2025-09-091-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'allocation failed, ...' warning messages can cause unlimited log spam, contrary to the implementation's intent. The warn_limit variable is accessed without synchronization. If more than <warn_limit> threads enter the warning path at the same time, the variable will get decremented past 0. Once it becomes negative, the non-zero check will always return true leading to unlimited log spam. Use atomic operation to access warn_limit and change condition to test for non-negative (>= 0) - atomic_dec_if_positive will return -1 once warn_limit becomes 0. Continue to print disable message alongside the last warning. While the change cited in Fixes is only adjacent, the warning limit implementation was correct before it. Only non-atomic allocations were considered for warnings, and those happened to hold pcpu_alloc_mutex while accessing warn_limit. [[email protected]: prevent warn_limit from going negative, per Christoph Lameter] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: f7d77dfc91f7 ("mm/percpu.c: print error message too if atomic alloc failed") Signed-off-by: Vlad Dumitrescu <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Christoph Lameter (Ampere) <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm/memory-failure: fix redundant updates for already poisoned pagesKyle Meyer2025-09-041-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Duplicate memory errors can be reported by multiple sources. Passing an already poisoned page to action_result() causes issues: * The amount of hardware corrupted memory is incorrectly updated. * Per NUMA node MF stats are incorrectly updated. * Redundant "already poisoned" messages are printed. Avoid those issues by: * Skipping hardware corrupted memory updates for already poisoned pages. * Skipping per NUMA node MF stats updates for already poisoned pages. * Dropping redundant "already poisoned" messages. Make MF_MSG_ALREADY_POISONED consistent with other action_page_types and make calls to action_result() consistent for already poisoned normal pages and huge pages. Link: https://lkml.kernel.org/r/[email protected] Fixes: b8b9488d50b7 ("mm/memory-failure: improve memory failure action_result messages") Signed-off-by: Kyle Meyer <[email protected]> Reviewed-by: Jiaqi Yan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Jane Chu <[email protected]> Acked-by: Miaohe Lin <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Kyle Meyer <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Russ Anderson <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm/damon/reclaim: avoid divide-by-zero in damon_reclaim_apply_parameters()Quanmin Yan2025-09-041-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When creating a new scheme of DAMON_RECLAIM, the calculation of 'min_age_region' uses 'aggr_interval' as the divisor, which may lead to division-by-zero errors. Fix it by directly returning -EINVAL when such a case occurs. Link: https://lkml.kernel.org/r/[email protected] Fixes: f5a79d7c0c87 ("mm/damon: introduce struct damos_access_pattern") Signed-off-by: Quanmin Yan <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: ze zuo <[email protected]> Cc: <[email protected]> [6.1+] Signed-off-by: Andrew Morton <[email protected]>
| * mm/damon/lru_sort: avoid divide-by-zero in damon_lru_sort_apply_parameters()Quanmin Yan2025-09-041-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm/damon: avoid divide-by-zero in DAMON module's parameters application". DAMON's RECLAIM and LRU_SORT modules perform no validation on user-configured parameters during application, which may lead to division-by-zero errors. Avoid the divide-by-zero by adding validation checks when DAMON modules attempt to apply the parameters. This patch (of 2): During the calculation of 'hot_thres' and 'cold_thres', either 'sample_interval' or 'aggr_interval' is used as the divisor, which may lead to division-by-zero errors. Fix it by directly returning -EINVAL when such a case occurs. Additionally, since 'aggr_interval' is already required to be set no smaller than 'sample_interval' in damon_set_attrs(), only the case where 'sample_interval' is zero needs to be checked. Link: https://lkml.kernel.org/r/[email protected] Fixes: 40e983cca927 ("mm/damon: introduce DAMON-based LRU-lists Sorting") Signed-off-by: Quanmin Yan <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: ze zuo <[email protected]> Cc: <[email protected]> [6.0+] Signed-off-by: Andrew Morton <[email protected]>
| * mm/damon/core: set quota->charged_from to jiffies at first charge windowSang-Heon Jeon2025-09-041-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel initializes the "jiffies" timer as 5 minutes below zero, as shown in include/linux/jiffies.h /* * Have the 32 bit jiffies value wrap 5 minutes after boot * so jiffies wrap bugs show up earlier. */ #define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ)) And jiffies comparison help functions cast unsigned value to signed to cover wraparound #define time_after_eq(a,b) \ (typecheck(unsigned long, a) && \ typecheck(unsigned long, b) && \ ((long)((a) - (b)) >= 0)) When quota->charged_from is initialized to 0, time_after_eq() can incorrectly return FALSE even after reset_interval has elapsed. This occurs when (jiffies - reset_interval) produces a value with MSB=1, which is interpreted as negative in signed arithmetic. This issue primarily affects 32-bit systems because: On 64-bit systems: MSB=1 values occur after ~292 million years from boot (assuming HZ=1000), almost impossible. On 32-bit systems: MSB=1 values occur during the first 5 minutes after boot, and the second half of every jiffies wraparound cycle, starting from day 25 (assuming HZ=1000) When above unexpected FALSE return from time_after_eq() occurs, the charging window will not reset. The user impact depends on esz value at that time. If esz is 0, scheme ignores configured quotas and runs without any limits. If esz is not 0, scheme stops working once the quota is exhausted. It remains until the charging window finally resets. So, change quota->charged_from to jiffies at damos_adjust_quota() when it is considered as the first charge window. By this change, we can avoid unexpected FALSE return from time_after_eq() Link: https://lkml.kernel.org/r/[email protected] Fixes: 2b8a248d5873 ("mm/damon/schemes: implement size quota for schemes application speed control") # 5.16 Signed-off-by: Sang-Heon Jeon <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm/hugetlb: add missing hugetlb_lock in __unmap_hugepage_range()Jeongjun Park2025-09-041-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When restoring a reservation for an anonymous page, we need to check to freeing a surplus. However, __unmap_hugepage_range() causes data race because it reads h->surplus_huge_pages without the protection of hugetlb_lock. And adjust_reservation is a boolean variable that indicates whether reservations for anonymous pages in each folio should be restored. Therefore, it should be initialized to false for each round of the loop. However, this variable is not initialized to false except when defining the current adjust_reservation variable. This means that once adjust_reservation is set to true even once within the loop, reservations for anonymous pages will be restored unconditionally in all subsequent rounds, regardless of the folio's state. To fix this, we need to add the missing hugetlb_lock, unlock the page_table_lock earlier so that we don't lock the hugetlb_lock inside the page_table_lock lock, and initialize adjust_reservation to false on each round within the loop. Link: https://lkml.kernel.org/r/[email protected] Fixes: df7a6d1f6405 ("mm/hugetlb: restore the reservation if needed") Signed-off-by: Jeongjun Park <[email protected]> Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=417aeb05fd190f3a6da9 Reviewed-by: Sidhartha Kumar <[email protected]> Cc: Breno Leitao <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range()Jinjiang Tu2025-09-041-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In do_migrate_range(), the hwpoisoned folio may be large folio, which can't be handled by unmap_poisoned_folio(). I can reproduce this issue in qemu after adding delay in memory_failure() BUG: kernel NULL pointer dereference, address: 0000000000000000 Workqueue: kacpi_hotplug acpi_hotplug_work_fn RIP: 0010:try_to_unmap_one+0x16a/0xfc0 <TASK> rmap_walk_anon+0xda/0x1f0 try_to_unmap+0x78/0x80 ? __pfx_try_to_unmap_one+0x10/0x10 ? __pfx_folio_not_mapped+0x10/0x10 ? __pfx_folio_lock_anon_vma_read+0x10/0x10 unmap_poisoned_folio+0x60/0x140 do_migrate_range+0x4d1/0x600 ? slab_memory_callback+0x6a/0x190 ? notifier_call_chain+0x56/0xb0 offline_pages+0x3e6/0x460 memory_subsys_offline+0x130/0x1f0 device_offline+0xba/0x110 acpi_bus_offline+0xb7/0x130 acpi_scan_hot_remove+0x77/0x290 acpi_device_hotplug+0x1e0/0x240 acpi_hotplug_work_fn+0x1a/0x30 process_one_work+0x186/0x340 Besides, do_migrate_range() may be called between memory_failure set hwpoison flag and isolate the folio from lru, so remove WARN_ON(). In other places, unmap_poisoned_folio() is called when the folio is isolated, obey it in do_migrate_range() too. [[email protected]: don't abort offlining, fixed typo, add comment] Link: https://lkml.kernel.org/r/[email protected] Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined") Signed-off-by: Jinjiang Tu <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Zi Yan <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Luis Chamberalin <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Raghav <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm/khugepaged: fix the address passed to notifier on testing youngWei Yang2025-09-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 8ee53820edfd ("thp: mmu_notifier_test_young") introduced mmu_notifier_test_young(), but we are passing the wrong address. In xxx_scan_pmd(), the actual iteration address is "_address" not "address". We seem to misuse the variable on the very beginning. Change it to the right one. [[email protected] fix whitespace, per everyone] Link: https://lkml.kernel.org/r/[email protected] Fixes: 8ee53820edfd ("thp: mmu_notifier_test_young") Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Dev Jain <[email protected]> Reviewed-by: Zi Yan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Nico Pache <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Barry Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* | Merge tag 'slab-for-6.17-rc5' of ↵Linus Torvalds2025-09-041-11/+26
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: - Stable fix to make slub_debug code not access invalid pointers in the process of reporting issues (Li Qiong) - Stable fix to make object tracking pass gfp flags to stackdepot to avoid deadlock in contexts that can't even wake up kswapd due to e.g. timers debugging enabled (yangshiguang) * tag 'slab-for-6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm: slub: avoid wake up kswapd in set_track_prepare mm/slub: avoid accessing metadata when pointer is invalid in object_err()
| * mm: slub: avoid wake up kswapd in set_track_prepareyangshiguang2025-09-011-10/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | set_track_prepare() can incur lock recursion. The issue is that it is called from hrtimer_start_range_ns holding the per_cpu(hrtimer_bases)[n].lock, but when enabled CONFIG_DEBUG_OBJECTS_TIMERS, may wake up kswapd in set_track_prepare, and try to hold the per_cpu(hrtimer_bases)[n].lock. Avoid deadlock caused by implicitly waking up kswapd by passing in allocation flags, which do not contain __GFP_KSWAPD_RECLAIM in the debug_objects_fill_pool() case. Inside stack depot they are processed by gfp_nested_mask(). Since ___slab_alloc() has preemption disabled, we mask out __GFP_DIRECT_RECLAIM from the flags there. The oops looks something like: BUG: spinlock recursion on CPU#3, swapper/3/0 lock: 0xffffff8a4bf29c80, .magic: dead4ead, .owner: swapper/3/0, .owner_cpu: 3 Hardware name: Qualcomm Technologies, Inc. Popsicle based on SM8850 (DT) Call trace: spin_bug+0x0 _raw_spin_lock_irqsave+0x80 hrtimer_try_to_cancel+0x94 task_contending+0x10c enqueue_dl_entity+0x2a4 dl_server_start+0x74 enqueue_task_fair+0x568 enqueue_task+0xac do_activate_task+0x14c ttwu_do_activate+0xcc try_to_wake_up+0x6c8 default_wake_function+0x20 autoremove_wake_function+0x1c __wake_up+0xac wakeup_kswapd+0x19c wake_all_kswapds+0x78 __alloc_pages_slowpath+0x1ac __alloc_pages_noprof+0x298 stack_depot_save_flags+0x6b0 stack_depot_save+0x14 set_track_prepare+0x5c ___slab_alloc+0xccc __kmalloc_cache_noprof+0x470 __set_page_owner+0x2bc post_alloc_hook[jt]+0x1b8 prep_new_page+0x28 get_page_from_freelist+0x1edc __alloc_pages_noprof+0x13c alloc_slab_page+0x244 allocate_slab+0x7c ___slab_alloc+0x8e8 kmem_cache_alloc_noprof+0x450 debug_objects_fill_pool+0x22c debug_object_activate+0x40 enqueue_hrtimer[jt]+0xdc hrtimer_start_range_ns+0x5f8 ... Signed-off-by: yangshiguang <[email protected]> Fixes: 5cf909c553e9 ("mm/slub: use stackdepot to save stack trace in objects") Cc: [email protected] Signed-off-by: Vlastimil Babka <[email protected]>
| * mm/slub: avoid accessing metadata when pointer is invalid in object_err()Li Qiong2025-08-251-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | object_err() reports details of an object for further debugging, such as the freelist pointer, redzone, etc. However, if the pointer is invalid, attempting to access object metadata can lead to a crash since it does not point to a valid object. One known path to the crash is when alloc_consistency_checks() determines the pointer to the allocated object is invalid because of a freelist corruption, and calls object_err() to report it. The debug code should report and handle the corruption gracefully and not crash in the process. In case the pointer is NULL or check_valid_pointer() returns false for the pointer, only print the pointer value and skip accessing metadata. Fixes: 81819f0fc828 ("SLUB core") Cc: <[email protected]> Signed-off-by: Li Qiong <[email protected]> Reviewed-by: Harry Yoo <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
* | Merge tag 'mm-hotfixes-stable-2025-09-01-17-20' of ↵Linus Torvalds2025-09-029-42/+66
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "17 hotfixes. 13 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 11 of these fixes are for MM. This includes a three-patch series from Harry Yoo which fixes an intermittent boot failure which can occur on x86 systems. And a two-patch series from Alexander Gordeev which fixes a KASAN crash on S390 systems" * tag 'mm-hotfixes-stable-2025-09-01-17-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: fix possible deadlock in kmemleak x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() mm: introduce and use {pgd,p4d}_populate_kernel() mm: move page table sync declarations to linux/pgtable.h proc: fix missing pde_set_flags() for net proc files mm: fix accounting of memmap pages mm/damon/core: prevent unnecessary overflow in damos_set_effective_quota() kexec: add KEXEC_FILE_NO_CMA as a legal flag kasan: fix GCC mem-intrinsic prefix with sw tags mm/kasan: avoid lazy MMU mode hazards mm/kasan: fix vmalloc shadow memory (de-)population races kunit: kasan_test: disable fortify string checker on kasan_strings() test selftests/mm: fix FORCE_READ to read input value correctly mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE ocfs2: prevent release journal inode after journal shutdown rust: mm: mark VmaNew as transparent of_numa: fix uninitialized memory nodes causing kernel panic
| * | mm: fix possible deadlock in kmemleakGu Bowen2025-09-021-7/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are some AA deadlock issues in kmemleak, similar to the situation reported by Breno [1]. The deadlock path is as follows: mem_pool_alloc() -> raw_spin_lock_irqsave(&kmemleak_lock, flags); -> pr_warn() -> netconsole subsystem -> netpoll -> __alloc_skb -> __create_object -> raw_spin_lock_irqsave(&kmemleak_lock, flags); To solve this problem, switch to printk_safe mode before printing warning message, this will redirect all printk()-s to a special per-CPU buffer, which will be flushed later from a safe context (irq work), and this deadlock problem can be avoided. The proper API to use should be printk_deferred_enter()/printk_deferred_exit() [2]. Another way is to place the warn print after kmemleak is released. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/#t [1] Link: https://lore.kernel.org/all/[email protected]/ [2] Signed-off-by: Gu Bowen <[email protected]> Reviewed-by: Waiman Long <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Reviewed-by: Breno Leitao <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: John Ogness <[email protected]> Cc: Lu Jialin <[email protected]> Cc: Petr Mladek <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * | mm: introduce and use {pgd,p4d}_populate_kernel()Harry Yoo2025-08-283-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce and use {pgd,p4d}_populate_kernel() in core MM code when populating PGD and P4D entries for the kernel address space. These helpers ensure proper synchronization of page tables when updating the kernel portion of top-level page tables. Until now, the kernel has relied on each architecture to handle synchronization of top-level page tables in an ad-hoc manner. For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes"). However, this approach has proven fragile for following reasons: 1) It is easy to forget to perform the necessary page table synchronization when introducing new changes. For instance, commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") overlooked the need to synchronize page tables for the vmemmap area. 2) It is also easy to overlook that the vmemmap and direct mapping areas must not be accessed before explicit page table synchronization. For example, commit 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")) caused crashes by accessing the vmemmap area before calling sync_global_pgds(). To address this, as suggested by Dave Hansen, introduce _kernel() variants of the page table population helpers, which invoke architecture-specific hooks to properly synchronize page tables. These are introduced in a new header file, include/linux/pgalloc.h, so they can be called from common code. They reuse existing infrastructure for vmalloc and ioremap. Synchronization requirements are determined by ARCH_PAGE_TABLE_SYNC_MASK, and the actual synchronization is performed by arch_sync_kernel_mappings(). This change currently targets only x86_64, so only PGD and P4D level helpers are introduced. Currently, these helpers are no-ops since no architecture sets PGTBL_{PGD,P4D}_MODIFIED in ARCH_PAGE_TABLE_SYNC_MASK. In theory, PUD and PMD level helpers can be added later if needed by other architectures. For now, 32-bit architectures (x86-32 and arm) only handle PGTBL_PMD_MODIFIED, so p*d_populate_kernel() will never affect them unless we introduce a PMD level helper. [[email protected]: fix KASAN build error due to p*d_populate_kernel()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <[email protected]> Suggested-by: Dave Hansen <[email protected]> Acked-by: Kiryl Shutsemau <[email protected]> Reviewed-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: bibo mao <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Christoph Lameter (Ampere) <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Dev Jain <[email protected]> Cc: Dmitriy Vyukov <[email protected]> Cc: Gwan-gyeong Mun <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jane Chu <[email protected]> Cc: Joao Martins <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleinxer <[email protected]> Cc: Thomas Huth <[email protected]> Cc: "Uladzislau Rezki (Sony)" <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * | mm: fix accounting of memmap pagesSumanth Korikkar2025-08-282-11/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For !CONFIG_SPARSEMEM_VMEMMAP, memmap page accounting is currently done upfront in sparse_buffer_init(). However, sparse_buffer_alloc() may return NULL in failure scenario. Also, memmap pages may be allocated either from the memblock allocator during early boot or from the buddy allocator. When removed via arch_remove_memory(), accounting of memmap pages must reflect the original allocation source. To ensure correctness: * Account memmap pages after successful allocation in sparse_init_nid() and section_activate(). * Account memmap pages in section_deactivate() based on allocation source. Link: https://lkml.kernel.org/r/[email protected] Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Sumanth Korikkar <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Reviewed-by: Wei Yang <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * | mm/damon/core: prevent unnecessary overflow in damos_set_effective_quota()Quanmin Yan2025-08-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On 32-bit systems, the throughput calculation in damos_set_effective_quota() is prone to unnecessary multiplication overflow. Using mult_frac() to fix it. Andrew Paniakin also recently found and privately reported this issue, on 64 bit systems. This can also happen on 64-bit systems, once the charged size exceeds ~17 TiB. On systems running for long time in production, this issue can actually happen. More specifically, when a DAMOS scheme having the time quota run for longtime, throughput calculation can overflow and set esz too small. As a result, speed of the scheme get unexpectedly slow. Link: https://lkml.kernel.org/r/[email protected] Fixes: 1cd243030059 ("mm/damon/schemes: implement time quota") Signed-off-by: Quanmin Yan <[email protected]> Reported-by: Andrew Paniakin <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: ze zuo <[email protected]> Cc: <[email protected]> [5.16+] Signed-off-by: Andrew Morton <[email protected]>
| * | mm/kasan: avoid lazy MMU mode hazardsAlexander Gordeev2025-08-281-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Functions __kasan_populate_vmalloc() and __kasan_depopulate_vmalloc() use apply_to_pte_range(), which enters lazy MMU mode. In that mode updating PTEs may not be observed until the mode is left. That may lead to a situation in which otherwise correct reads and writes to a PTE using ptep_get(), set_pte(), pte_clear() and other access primitives bring wrong results when the vmalloc shadow memory is being (de-)populated. To avoid these hazards leave the lazy MMU mode before and re-enter it after each PTE manipulation. Link: https://lkml.kernel.org/r/0d2efb7ddddbff6b288fbffeeb10166e90771718.1755528662.git.agordeev@linux.ibm.com Fixes: 3c5c3cfb9ef4 ("kasan: support backing vmalloc space with real shadow memory") Signed-off-by: Alexander Gordeev <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Daniel Axtens <[email protected]> Cc: Marc Rutland <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * | mm/kasan: fix vmalloc shadow memory (de-)population racesAlexander Gordeev2025-08-281-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While working on the lazy MMU mode enablement for s390 I hit pretty curious issues in the kasan code. The first is related to a custom kasan-based sanitizer aimed at catching invalid accesses to PTEs and is inspired by [1] conversation. The kasan complains on valid PTE accesses, while the shadow memory is reported as unpoisoned: [ 102.783993] ================================================================== [ 102.784008] BUG: KASAN: out-of-bounds in set_pte_range+0x36c/0x390 [ 102.784016] Read of size 8 at addr 0000780084cf9608 by task vmalloc_test/0/5542 [ 102.784019] [ 102.784040] CPU: 1 UID: 0 PID: 5542 Comm: vmalloc_test/0 Kdump: loaded Tainted: G OE 6.16.0-gcc-ipte-kasan-11657-gb2d930c4950e #340 PREEMPT [ 102.784047] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 102.784049] Hardware name: IBM 8561 T01 703 (LPAR) [ 102.784052] Call Trace: [ 102.784054] [<00007fffe0147ac0>] dump_stack_lvl+0xe8/0x140 [ 102.784059] [<00007fffe0112484>] print_address_description.constprop.0+0x34/0x2d0 [ 102.784066] [<00007fffe011282c>] print_report+0x10c/0x1f8 [ 102.784071] [<00007fffe090785a>] kasan_report+0xfa/0x220 [ 102.784078] [<00007fffe01d3dec>] set_pte_range+0x36c/0x390 [ 102.784083] [<00007fffe01d41c2>] leave_ipte_batch+0x3b2/0xb10 [ 102.784088] [<00007fffe07d3650>] apply_to_pte_range+0x2f0/0x4e0 [ 102.784094] [<00007fffe07e62e4>] apply_to_pmd_range+0x194/0x3e0 [ 102.784099] [<00007fffe07e820e>] __apply_to_page_range+0x2fe/0x7a0 [ 102.784104] [<00007fffe07e86d8>] apply_to_page_range+0x28/0x40 [ 102.784109] [<00007fffe090a3ec>] __kasan_populate_vmalloc+0xec/0x310 [ 102.784114] [<00007fffe090aa36>] kasan_populate_vmalloc+0x96/0x130 [ 102.784118] [<00007fffe0833a04>] alloc_vmap_area+0x3d4/0xf30 [ 102.784123] [<00007fffe083a8ba>] __get_vm_area_node+0x1aa/0x4c0 [ 102.784127] [<00007fffe083c4f6>] __vmalloc_node_range_noprof+0x126/0x4e0 [ 102.784131] [<00007fffe083c980>] __vmalloc_node_noprof+0xd0/0x110 [ 102.784135] [<00007fffe083ca32>] vmalloc_noprof+0x32/0x40 [ 102.784139] [<00007fff608aa336>] fix_size_alloc_test+0x66/0x150 [test_vmalloc] [ 102.784147] [<00007fff608aa710>] test_func+0x2f0/0x430 [test_vmalloc] [ 102.784153] [<00007fffe02841f8>] kthread+0x3f8/0x7a0 [ 102.784159] [<00007fffe014d8b4>] __ret_from_fork+0xd4/0x7d0 [ 102.784164] [<00007fffe299c00a>] ret_from_fork+0xa/0x30 [ 102.784173] no locks held by vmalloc_test/0/5542. [ 102.784176] [ 102.784178] The buggy address belongs to the physical page: [ 102.784186] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x84cf9 [ 102.784198] flags: 0x3ffff00000000000(node=0|zone=1|lastcpupid=0x1ffff) [ 102.784212] page_type: f2(table) [ 102.784225] raw: 3ffff00000000000 0000000000000000 0000000000000122 0000000000000000 [ 102.784234] raw: 0000000000000000 0000000000000000 f200000000000001 0000000000000000 [ 102.784248] page dumped because: kasan: bad access detected [ 102.784250] [ 102.784252] Memory state around the buggy address: [ 102.784260] 0000780084cf9500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 102.784274] 0000780084cf9580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 102.784277] >0000780084cf9600: fd 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 102.784290] ^ [ 102.784293] 0000780084cf9680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 102.784303] 0000780084cf9700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 102.784306] ================================================================== The second issue hits when the custom sanitizer above is not implemented, but the kasan itself is still active: [ 1554.438028] Unable to handle kernel pointer dereference in virtual kernel address space [ 1554.438065] Failing address: 001c0ff0066f0000 TEID: 001c0ff0066f0403 [ 1554.438076] Fault in home space mode while using kernel ASCE. [ 1554.438103] AS:00000000059d400b R2:0000000ffec5c00b R3:00000000c6c9c007 S:0000000314470001 P:00000000d0ab413d [ 1554.438158] Oops: 0011 ilc:2 [#1]SMP [ 1554.438175] Modules linked in: test_vmalloc(E+) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) sunrpc(E) pkey_pckmo(E) uvdevice(E) s390_trng(E) rng_core(E) eadm_sch(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) loop(E) i2c_core(E) drm_panel_orientation_quirks(E) nfnetlink(E) ctcm(E) fsm(E) zfcp(E) scsi_transport_fc(E) diag288_wdt(E) watchdog(E) ghash_s390(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha512_s390(E) sha1_s390(E) sha_common(E) pkey(E) autofs4(E) [ 1554.438319] Unloaded tainted modules: pkey_uv(E):1 hmac_s390(E):2 [ 1554.438354] CPU: 1 UID: 0 PID: 1715 Comm: vmalloc_test/0 Kdump: loaded Tainted: G E 6.16.0-gcc-ipte-kasan-11657-gb2d930c4950e #350 PREEMPT [ 1554.438368] Tainted: [E]=UNSIGNED_MODULE [ 1554.438374] Hardware name: IBM 8561 T01 703 (LPAR) [ 1554.438381] Krnl PSW : 0704e00180000000 00007fffe1d3d6ae (memset+0x5e/0x98) [ 1554.438396] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 [ 1554.438409] Krnl GPRS: 0000000000000001 001c0ff0066f0000 001c0ff0066f0000 00000000000000f8 [ 1554.438418] 00000000000009fe 0000000000000009 0000000000000000 0000000000000002 [ 1554.438426] 0000000000005000 000078031ae655c8 00000feffdcf9f59 0000780258672a20 [ 1554.438433] 0000780243153500 00007f8033780000 00007fffe083a510 00007f7fee7cfa00 [ 1554.438452] Krnl Code: 00007fffe1d3d6a0: eb540008000c srlg %r5,%r4,8 00007fffe1d3d6a6: b9020055 ltgr %r5,%r5 #00007fffe1d3d6aa: a784000b brc 8,00007fffe1d3d6c0 >00007fffe1d3d6ae: 42301000 stc %r3,0(%r1) 00007fffe1d3d6b2: d2fe10011000 mvc 1(255,%r1),0(%r1) 00007fffe1d3d6b8: 41101100 la %r1,256(%r1) 00007fffe1d3d6bc: a757fff9 brctg %r5,00007fffe1d3d6ae 00007fffe1d3d6c0: 42301000 stc %r3,0(%r1) [ 1554.438539] Call Trace: [ 1554.438545] [<00007fffe1d3d6ae>] memset+0x5e/0x98 [ 1554.438552] ([<00007fffe083a510>] remove_vm_area+0x220/0x400) [ 1554.438562] [<00007fffe083a9d6>] vfree.part.0+0x26/0x810 [ 1554.438569] [<00007fff6073bd50>] fix_align_alloc_test+0x50/0x90 [test_vmalloc] [ 1554.438583] [<00007fff6073c73a>] test_func+0x46a/0x6c0 [test_vmalloc] [ 1554.438593] [<00007fffe0283ac8>] kthread+0x3f8/0x7a0 [ 1554.438603] [<00007fffe014d8b4>] __ret_from_fork+0xd4/0x7d0 [ 1554.438613] [<00007fffe299ac0a>] ret_from_fork+0xa/0x30 [ 1554.438622] INFO: lockdep is turned off. [ 1554.438627] Last Breaking-Event-Address: [ 1554.438632] [<00007fffe1d3d65c>] memset+0xc/0x98 [ 1554.438644] Kernel panic - not syncing: Fatal exception: panic_on_oops This series fixes the above issues and is a pre-requisite for the s390 lazy MMU mode implementation. test_vmalloc was used to stress-test the fixes. This patch (of 2): When vmalloc shadow memory is established the modification of the corresponding page tables is not protected by any locks. Instead, the locking is done per-PTE. This scheme however has defects. kasan_populate_vmalloc_pte() - while ptep_get() read is atomic the sequence pte_none(ptep_get()) is not. Doing that outside of the lock might lead to a concurrent PTE update and what could be seen as a shadow memory corruption as result. kasan_depopulate_vmalloc_pte() - by the time a page whose address was extracted from ptep_get() read and cached in a local variable outside of the lock is attempted to get free, could actually be freed already. To avoid these put ptep_get() itself and the code that manipulates the result of the read under lock. In addition, move freeing of the page out of the atomic context. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/adb258634194593db294c0d1fb35646e894d6ead.1755528662.git.agordeev@linux.ibm.com Link: https://lore.kernel.org/linux-mm/[email protected]/ [1] Fixes: 3c5c3cfb9ef4 ("kasan: support backing vmalloc space with real shadow memory") Signed-off-by: Alexander Gordeev <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Daniel Axtens <[email protected]> Cc: Marc Rutland <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * | kunit: kasan_test: disable fortify string checker on kasan_strings() testYeoreum Yun2025-08-281-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to commit 09c6304e38e4 ("kasan: test: fix compatibility with FORTIFY_SOURCE") the kernel is panicing in kasan_string(). This is due to the `src` and `ptr` not being hidden from the optimizer which would disable the runtime fortify string checker. Call trace: __fortify_panic+0x10/0x20 (P) kasan_strings+0x980/0x9b0 kunit_try_run_case+0x68/0x190 kunit_generic_run_threadfn_adapter+0x34/0x68 kthread+0x1c4/0x228 ret_from_fork+0x10/0x20 Code: d503233f a9bf7bfd 910003fd 9424b243 (d4210000) ---[ end trace 0000000000000000 ]--- note: kunit_try_catch[128] exited with irqs disabled note: kunit_try_catch[128] exited with preempt_count 1 # kasan_strings: try faulted: last ** replaying previous printk message ** # kasan_strings: try faulted: last line seen mm/kasan/kasan_test_c.c:1600 # kasan_strings: internal error occurred preventing test case from running: -4 Link: https://lkml.kernel.org/r/[email protected] Fixes: 73228c7ecc5e ("KASAN: port KASAN Tests to KUnit") Signed-off-by: Yeoreum Yun <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitriy Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * | mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTESasha Levin2025-08-281-2/+7
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With CONFIG_HIGHPTE on 32-bit ARM, move_pages_pte() maps PTE pages using kmap_local_page(), which requires unmapping in Last-In-First-Out order. The current code maps dst_pte first, then src_pte, but unmaps them in the same order (dst_pte, src_pte), violating the LIFO requirement. This causes the warning in kunmap_local_indexed(): WARNING: CPU: 0 PID: 604 at mm/highmem.c:622 kunmap_local_indexed+0x178/0x17c addr \!= __fix_to_virt(FIX_KMAP_BEGIN + idx) Fix this by reversing the unmap order to respect LIFO ordering. This issue follows the same pattern as similar fixes: - commit eca6828403b8 ("crypto: skcipher - fix mismatch between mapping and unmapping order") - commit 8cf57c6df818 ("nilfs2: eliminate staggered calls to kunmap in nilfs_rename") Both of which addressed the same fundamental requirement that kmap_local operations must follow LIFO ordering. Link: https://lkml.kernel.org/r/[email protected] Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Sasha Levin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* | Merge tag 'fixes-2025-08-28' of ↵Linus Torvalds2025-08-283-11/+18
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fixes from Mike Rapoport: - printk cleanups in memblock and numa_memblks - update kernel-doc for MEMBLOCK_RSRV_NOINIT to be more accurate and detailed * tag 'fixes-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: fix kernel-doc for MEMBLOCK_RSRV_NOINIT mm: numa,memblock: Use SZ_1M macro to denote bytes to MB conversion mm/numa_memblks: Use pr_debug instead of printk(KERN_DEBUG)
| * memblock: fix kernel-doc for MEMBLOCK_RSRV_NOINITMike Rapoport (Microsoft)2025-08-261-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel-doc description of MEMBLOCK_RSRV_NOINIT and memblock_reserved_mark_noinit() do not accurately describe their functionality. Expand their kernel doc to make it clear that the user of MEMBLOCK_RSRV_NOINIT is responsible to properly initialize the struct pages for such regions and add more details about effects of using this flag. Reviewed-by: David Hildenbrand <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
| * mm: numa,memblock: Use SZ_1M macro to denote bytes to MB conversionPratyush Brahma2025-08-203-6/+6
| | | | | | | | | | | | | | | | | | | | Replace the manual bitwise conversion of bytes to MB with SZ_1M macro, a standard macro used within the mm subsystem, to improve readability. Signed-off-by: Pratyush Brahma <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
| * mm/numa_memblks: Use pr_debug instead of printk(KERN_DEBUG)Pratyush Brahma2025-08-131-1/+1
| | | | | | | | | | | | | | | | | | | | Replace the direct usage of printk(KERN_DEBUG ...) with pr_debug(...) to align with the consistent `pr_*` API usage within the file. Reviewed-by: Joshua Hahn <[email protected]> Signed-off-by: Pratyush Brahma <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
* | Merge tag 'driver-core-6.17-rc3' of ↵Linus Torvalds2025-08-231-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core fixes from Danilo Krummrich: - Fix swapped handling of lru_gen and lru_gen_full debugfs files in vmscan - Fix debugfs mount options (uid, gid, mode) being silently ignored - Fix leak of devres action in the unwind path of Devres::new() - Documentation: - Expand and fix documentation of (outdated) Device, DeviceContext and generic driver infrastructure - Fix C header link of faux device abstractions - Clarify expected interaction with the security team - Smooth text flow in the security bug reporting process documentation * tag 'driver-core-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: Documentation: smooth the text flow in the security bug reporting process Documentation: clarify the expected collaboration with security bugs reporters debugfs: fix mount options not being applied rust: devres: fix leaking call to devm_add_action() rust: faux: fix C header link driver: rust: expand documentation for driver infrastructure device: rust: expand documentation for Device device: rust: expand documentation for DeviceContext mm/vmscan: fix inverted polarity in lru_gen_seq_show()
| * | mm/vmscan: fix inverted polarity in lru_gen_seq_show()Danilo Krummrich2025-08-101-2/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a7694ff11aa9 ("vmscan: don't bother with debugfs_real_fops()") started using debugfs_get_aux_num() to distinguish between the RW "lru_gen" and the RO "lru_gen_full" file [1]. Willy reported the inverted polarity [2] and Al fixed it up in [3]. However, the patch in [1] was applied. Hence, fix this up accordingly. Cc: Alexander Viro <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/all/20250704040720.GP1880847@ZenIV/ [1] Link: https://lore.kernel.org/all/[email protected]/ [2] Link: https://lore.kernel.org/all/20250704040720.GP1880847@ZenIV/ [3] Fixes: a7694ff11aa9 ("vmscan: don't bother with debugfs_real_fops()") Acked-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Danilo Krummrich <[email protected]>
* | mm/mremap: fix WARN with uffd that has remap events disabledDavid Hildenbrand2025-08-191-18/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Registering userfaultd on a VMA that spans at least one PMD and then mremap()'ing that VMA can trigger a WARN when recovering from a failed page table move due to a page table allocation error. The code ends up doing the right thing (recurse, avoiding moving actual page tables), but triggering that WARN is unpleasant: WARNING: CPU: 2 PID: 6133 at mm/mremap.c:357 move_normal_pmd mm/mremap.c:357 [inline] WARNING: CPU: 2 PID: 6133 at mm/mremap.c:357 move_pgt_entry mm/mremap.c:595 [inline] WARNING: CPU: 2 PID: 6133 at mm/mremap.c:357 move_page_tables+0x3832/0x44a0 mm/mremap.c:852 Modules linked in: CPU: 2 UID: 0 PID: 6133 Comm: syz.0.19 Not tainted 6.17.0-rc1-syzkaller-00004-g53e760d89498 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:move_normal_pmd mm/mremap.c:357 [inline] RIP: 0010:move_pgt_entry mm/mremap.c:595 [inline] RIP: 0010:move_page_tables+0x3832/0x44a0 mm/mremap.c:852 Code: ... RSP: 0018:ffffc900037a76d8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000032930007 RCX: ffffffff820c6645 RDX: ffff88802e56a440 RSI: ffffffff820c7201 RDI: 0000000000000007 RBP: ffff888037728fc0 R08: 0000000000000007 R09: 0000000000000000 R10: 0000000032930007 R11: 0000000000000000 R12: 0000000000000000 R13: ffffc900037a79a8 R14: 0000000000000001 R15: dffffc0000000000 FS: 000055556316a500(0000) GS:ffff8880d68bc000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b30863fff CR3: 0000000050171000 CR4: 0000000000352ef0 Call Trace: <TASK> copy_vma_and_data+0x468/0x790 mm/mremap.c:1215 move_vma+0x548/0x1780 mm/mremap.c:1282 mremap_to+0x1b7/0x450 mm/mremap.c:1406 do_mremap+0xfad/0x1f80 mm/mremap.c:1921 __do_sys_mremap+0x119/0x170 mm/mremap.c:1977 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0x4c0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f00d0b8ebe9 Code: ... RSP: 002b:00007ffe5ea5ee98 EFLAGS: 00000246 ORIG_RAX: 0000000000000019 RAX: ffffffffffffffda RBX: 00007f00d0db5fa0 RCX: 00007f00d0b8ebe9 RDX: 0000000000400000 RSI: 0000000000c00000 RDI: 0000200000000000 RBP: 00007ffe5ea5eef0 R08: 0000200000c00000 R09: 0000000000000000 R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000002 R13: 00007f00d0db5fa0 R14: 00007f00d0db5fa0 R15: 0000000000000005 </TASK> The underlying issue is that we recurse during the original page table move, but not during the recovery move. Fix it by checking for both VMAs and performing the check before the pmd_none() sanity check. Add a new helper where we perform+document that check for the PMD and PUD level. Thanks to Harry for bisecting. Link: https://lkml.kernel.org/r/[email protected] Fixes: 0cef0bb836e3 ("mm: clear uffd-wp PTE/PMD state on mremap()") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: [email protected] Closes: https://lkml.kernel.org/r/[email protected] Tested-by: Harry Yoo <[email protected]> Cc: "Liam R. Howlett" <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jann Horn <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* | mm/damon/sysfs-schemes: put damos dests dir after removing its filesSeongJae Park2025-08-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | damon_sysfs_scheme_rm_dirs() puts dests directory kobject before removing its internal files. Sincee putting the kobject frees its container struct, and the internal files removal accesses the container, use-after-free happens. Fix it by putting the reference _after_ removing the files. Link: https://lkml.kernel.org/r/[email protected] Fixes: 2cd0bf85a203 ("mm/damon/sysfs-schemes: implement DAMOS action destinations directory") Signed-off-by: SeongJae Park <[email protected]> Reported-by: Alexandre Ghiti <[email protected]> Closes: https://lore.kernel.org/[email protected] Tested-by: Alexandre Ghiti <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* | mm/migrate: fix NULL movable_ops if CONFIG_ZSMALLOC=mHuacai Chen2025-08-193-8/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 84caf98838a3e5f4bdb34 ("mm: stop storing migration_ops in page->mapping") we get such an error message if CONFIG_ZSMALLOC=m: WARNING: CPU: 3 PID: 42 at mm/migrate.c:142 isolate_movable_ops_page+0xa8/0x1c0 CPU: 3 UID: 0 PID: 42 Comm: kcompactd0 Not tainted 6.16.0-rc5+ #2133 PREEMPT pc 9000000000540bd8 ra 9000000000540b84 tp 9000000100420000 sp 9000000100423a60 a0 9000000100193a80 a1 000000000000000c a2 000000000000001b a3 ffffffffffffffff a4 ffffffffffffffff a5 0000000000000267 a6 0000000000000000 a7 9000000100423ae0 t0 00000000000000f1 t1 00000000000000f6 t2 0000000000000000 t3 0000000000000001 t4 ffffff00010eb834 t5 0000000000000040 t6 900000010c89d380 t7 90000000023fcc70 t8 0000000000000018 u0 0000000000000000 s9 ffffff00010eb800 s0 ffffff00010eb800 s1 000000000000000c s2 0000000000043ae0 s3 0000800000000000 s4 900000000219cc40 s5 0000000000000000 s6 ffffff00010eb800 s7 0000000000000001 s8 90000000025b4000 ra: 9000000000540b84 isolate_movable_ops_page+0x54/0x1c0 ERA: 9000000000540bd8 isolate_movable_ops_page+0xa8/0x1c0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 00000004 (PPLV0 +PIE -PWE) EUEN: 00000000 (-FPE -SXE -ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 000c0000 [BRK] (IS= ECode=12 EsubCode=0) PRID: 0014c010 (Loongson-64bit, Loongson-3A5000) CPU: 3 UID: 0 PID: 42 Comm: kcompactd0 Not tainted 6.16.0-rc5+ #2133 PREEMPT Stack : 90000000021fd000 0000000000000000 9000000000247720 9000000100420000 90000001004236a0 90000001004236a8 0000000000000000 90000001004237e8 90000001004237e0 90000001004237e0 9000000100423550 0000000000000001 0000000000000001 90000001004236a8 725a84864a19e2d9 90000000023fcc58 9000000100420000 90000000024c6848 9000000002416848 0000000000000001 0000000000000000 000000000000000a 0000000007fe0000 ffffff00010eb800 0000000000000000 90000000021fd000 0000000000000000 900000000205cf30 000000000000008e 0000000000000009 ffffff00010eb800 0000000000000001 90000000025b4000 0000000000000000 900000000024773c 00007ffff103d748 00000000000000b0 0000000000000004 0000000000000000 0000000000071c1d ... Call Trace: [<900000000024773c>] show_stack+0x5c/0x190 [<90000000002415e0>] dump_stack_lvl+0x70/0x9c [<90000000004abe6c>] isolate_migratepages_block+0x3bc/0x16e0 [<90000000004af408>] compact_zone+0x558/0x1000 [<90000000004b0068>] compact_node+0xa8/0x1e0 [<90000000004b0aa4>] kcompactd+0x394/0x410 [<90000000002b3c98>] kthread+0x128/0x140 [<9000000001779148>] ret_from_kernel_thread+0x28/0xc0 [<9000000000245528>] ret_from_kernel_thread_asm+0x10/0x88 The reason is that defined(CONFIG_ZSMALLOC) evaluates to 1 only when CONFIG_ZSMALLOC=y, we should use IS_ENABLED(CONFIG_ZSMALLOC) instead. But when I use IS_ENABLED(CONFIG_ZSMALLOC), page_movable_ops() cannot access zsmalloc_mops because zsmalloc_mops is in a module. To solve this problem, we define a set_movable_ops() interface to register and unregister offline_movable_ops / zsmalloc_movable_ops in mm/migrate.c, and call them at mm/balloon_compaction.c & mm/zsmalloc.c. Since offline_movable_ops / zsmalloc_movable_ops are always accessible, all #ifdef / #endif are removed in page_movable_ops(). Link: https://lkml.kernel.org/r/[email protected] Fixes: 84caf98838a3 ("mm: stop storing migration_ops in page->mapping") Signed-off-by: Huacai Chen <[email protected]> Acked-by: Zi Yan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* | mm/damon/core: fix damos_commit_filter not changing allowSang-Heon Jeon2025-08-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current damos_commit_filter() does not persist the `allow' value of the filter. As a result, changing the `allow' value of a filter and committing doesn't change the `allow' value. Add the missing `allow' value update, so committing the filter persistently changes the `allow' value well. Link: https://lkml.kernel.org/r/[email protected] Fixes: fe6d7fdd6249 ("mm/damon/core: add damos_filter->allow field") Signed-off-by: Sang-Heon Jeon <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Cc: <[email protected]> [6.14.x] Signed-off-by: Andrew Morton <[email protected]>
* | mm/memory-failure: fix infinite UCE for VM_PFNMAP pfnJinjiang Tu2025-08-191-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When memory_failure() is called for a already hwpoisoned pfn, kill_accessing_process() will be called to kill current task. However, if the vma of the accessing vaddr is VM_PFNMAP, walk_page_range() will skip the vma in walk_page_test() and return 0. Before commit aaf99ac2ceb7 ("mm/hwpoison: do not send SIGBUS to processes with recovered clean pages"), kill_accessing_process() will return EFAULT. For x86, the current task will be killed in kill_me_maybe(). However, after this commit, kill_accessing_process() simplies return 0, that means UCE is handled properly, but it doesn't actually. In such case, the user task will trigger UCE infinitely. To fix it, add .test_walk callback for hwpoison_walk_ops to scan all vmas. Link: https://lkml.kernel.org/r/[email protected] Fixes: aaf99ac2ceb7 ("mm/hwpoison: do not send SIGBUS to processes with recovered clean pages") Signed-off-by: Jinjiang Tu <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Miaohe Lin <[email protected]> Reviewed-by: Jane Chu <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Shuai Xue <[email protected]> Cc: Zi Yan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* | mm/mremap: catch invalid multi VMA moves earlierLorenzo Stoakes2025-08-191-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, any attempt to solely move a VMA would require that the span specified reside within the span of that single VMA, with no gaps before or afterwards. After commit d23cb648e365 ("mm/mremap: permit mremap() move of multiple VMAs"), the multi VMA move permitted a gap to exist only after VMAs. This was done to provide maximum flexibility. However, We have consequently permitted this behaviour for the move of a single VMA including those not eligible for multi VMA move. The change introduced here means that we no longer permit non-eligible VMAs from being moved in this way. This is consistent, as it means all eligible VMA moves are treated the same, and all non-eligible moves are treated as they were before. This change does not break previous behaviour, which equally would have disallowed such a move (only in all cases). [[email protected]: do not incorrectly reference invalid VMA in VM_WARN_ON_ONCE()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/2b5aad5681573be85b5b8fac61399af6fb6b68b6.1754218667.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jann Horn <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* | mm/mremap: allow multi-VMA move when filesystem uses thp_get_unmapped_areaLorenzo Stoakes2025-08-191-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The multi-VMA move functionality introduced in commit d23cb648e365 ("mm/mremap: permit mremap() move of multiple VMA") doesn't allow moves of file-backed mappings which specify a custom f_op->get_unmapped_area handler excepting hugetlb and shmem. We expand this to include thp_get_unmapped_area to support file-backed mappings for filesystems which use large folios. Additionally, when the first VMA in a range is not compatible with a multi-VMA move, instead of moving the first VMA and returning an error, this series results in us not moving anything and returning an error immediately. Examining this second change in detail: The semantics of multi-VMA moves in mremap() very clearly indicate that a failure can result in a partial move of VMAs. This is in line with other aggregate operations within the kernel, which share these semantics. There are two classes of failures we're concerned with - eligiblity for mutli-VMA move, and transient failures that would occur even if the user individually moved each VMA. The latter is due to out-of-memory conditions (which, given the allocations involved are small, would likely be fatal in any case), or hitting the mapping limit. Regardless of the cause, transient issues would be fatal anyway, so it isn't really material which VMAs succeeded at being moved or not. However with when it comes to multi-VMA move eligiblity, we face another issue - we must allow a single VMA to succeed regardless of this eligiblity (as, of course, it is not a multi-VMA move) - but we must then fail multi-VMA operations. The two means by which VMAs may fail the eligbility test are - the VMAs being UFFD-armed, or the VMA being file-backed and providing its own f_op->get_unmapped_area() helper (because this may result in MREMAP_FIXED being disregarded), excepting those known to correctly handle MREMAP_FIXED. It is therefore conceivable that a user could erroneously try to use this functionality in these instances, and would prefer to not perform any move at all should that occur. This series therefore avoids any move of subsequent VMAs should the first be multi-VMA move ineligble and the input span exceeds that of the first VMA. We also add detailed test logic to assert that multi VMA move with ineligible VMAs functions as expected. This patch (of 3): We currently restrict multi-VMA move to avoid filesystems or drivers which provide a custom f_op->get_unmapped_area handler unless it is known to correctly handle MREMAP_FIXED. We do this so we do not get unexpected result when moving from one area to another (for instance, if the handler would align things resulting in the moved VMAs having different gaps than the original mapping). More and more filesystems are moving to using large folios, and typically do so (in part) by setting f_op->get_unmapped_area to thp_get_unmapped_area. When mremap() invokes the file system's get_unmapped MREMAP_FIXED, it does so via get_unmapped_area(), called in vrm_set_new_addr(). In order to do so, it converts the MREMAP_FIXED flag to a MAP_FIXED flag and passes this to the unmapped area handler. The __get_unmapped_area() function (called by get_unmapped_area()) in turn invokes the filesystem or driver's f_op->get_unmapped_area() handler. Therefore this is a point at which thp_get_unmapped_area() may be called (also, this is the case for anonymous mappings where the size is huge page aligned). thp_get_unmapped_area() calls thp_get_unmapped_area_vmflags() and __thp_get_unmapped_area() in turn (falling back to mm_get_unmapped_area_vm_flags() which is known to handle MAP_FIXED correctly). The __thp_get_unmapped_area() function in turn does nothing to change the address hint, nor the MAP_FIXED flag, only adjusting alignment parameters. It hten calls mm_get_unmapped_area_vmflags(), and in turn arch-specific unmapped area functions, all of which honour MAP_FIXED correctly. Therefore, we can safely add thp_get_unmapped_area to the known-good handlers. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/4f2542340c29c84d3d470b0c605e916b192f6c81.1754218667.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jann Horn <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>