aboutsummaryrefslogtreecommitdiffstats
path: root/mm/sparse-vmemmap.c
Commit message (Collapse)AuthorAgeFilesLines
* mm: introduce and use {pgd,p4d}_populate_kernel()Harry Yoo2025-08-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce and use {pgd,p4d}_populate_kernel() in core MM code when populating PGD and P4D entries for the kernel address space. These helpers ensure proper synchronization of page tables when updating the kernel portion of top-level page tables. Until now, the kernel has relied on each architecture to handle synchronization of top-level page tables in an ad-hoc manner. For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes"). However, this approach has proven fragile for following reasons: 1) It is easy to forget to perform the necessary page table synchronization when introducing new changes. For instance, commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") overlooked the need to synchronize page tables for the vmemmap area. 2) It is also easy to overlook that the vmemmap and direct mapping areas must not be accessed before explicit page table synchronization. For example, commit 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")) caused crashes by accessing the vmemmap area before calling sync_global_pgds(). To address this, as suggested by Dave Hansen, introduce _kernel() variants of the page table population helpers, which invoke architecture-specific hooks to properly synchronize page tables. These are introduced in a new header file, include/linux/pgalloc.h, so they can be called from common code. They reuse existing infrastructure for vmalloc and ioremap. Synchronization requirements are determined by ARCH_PAGE_TABLE_SYNC_MASK, and the actual synchronization is performed by arch_sync_kernel_mappings(). This change currently targets only x86_64, so only PGD and P4D level helpers are introduced. Currently, these helpers are no-ops since no architecture sets PGTBL_{PGD,P4D}_MODIFIED in ARCH_PAGE_TABLE_SYNC_MASK. In theory, PUD and PMD level helpers can be added later if needed by other architectures. For now, 32-bit architectures (x86-32 and arm) only handle PGTBL_PMD_MODIFIED, so p*d_populate_kernel() will never affect them unless we introduce a PMD level helper. [[email protected]: fix KASAN build error due to p*d_populate_kernel()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <[email protected]> Suggested-by: Dave Hansen <[email protected]> Acked-by: Kiryl Shutsemau <[email protected]> Reviewed-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: bibo mao <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Christoph Lameter (Ampere) <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Dev Jain <[email protected]> Cc: Dmitriy Vyukov <[email protected]> Cc: Gwan-gyeong Mun <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jane Chu <[email protected]> Cc: Joao Martins <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleinxer <[email protected]> Cc: Thomas Huth <[email protected]> Cc: "Uladzislau Rezki (Sony)" <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm: fix accounting of memmap pagesSumanth Korikkar2025-08-281-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | For !CONFIG_SPARSEMEM_VMEMMAP, memmap page accounting is currently done upfront in sparse_buffer_init(). However, sparse_buffer_alloc() may return NULL in failure scenario. Also, memmap pages may be allocated either from the memblock allocator during early boot or from the buddy allocator. When removed via arch_remove_memory(), accounting of memmap pages must reflect the original allocation source. To ensure correctness: * Account memmap pages after successful allocation in sparse_init_nid() and section_activate(). * Account memmap pages in section_deactivate() based on allocation source. Link: https://lkml.kernel.org/r/[email protected] Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Sumanth Korikkar <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Reviewed-by: Wei Yang <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm/hugetlb: do pre-HVO for bootmem allocated pagesFrank van der Linden2025-03-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For large systems, the overhead of vmemmap pages for hugetlb is substantial. It's about 1.5% of memory, which is about 45G for a 3T system. If you want to configure most of that system for hugetlb (e.g. to use as backing memory for VMs), there is a chance of running out of memory on boot, even though you know that the 45G will become available later. To avoid this scenario, and since it's a waste to first allocate and then free that 45G during boot, do pre-HVO for hugetlb bootmem allocated pages ('gigantic' pages). pre-HVO is done by adding functions that are called from sparse_init_nid_early and sparse_init_nid_late. The first is called before memmap allocation, so it takes care of allocating memmap HVO-style. The second verifies that all bootmem pages look good, specifically it checks that they do not intersect with multiple zones. This can only be done from sparse_init_nid_late path, when zones have been initialized. The hugetlb page size must be aligned to the section size, and aligned to the size of memory described by the number of page structures contained in one PMD (since pre-HVO is not prepared to split PMDs). This should be true for most 'gigantic' pages, it is for 1G pages on x86, where both of these alignment requirements are 128M. This will only have an effect if hugetlb_bootmem_alloc was called early in boot. If not, it won't do anything, and HVO for bootmem hugetlb pages works as before. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Frank van der Linden <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Joao Martins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Madhavan Srinivasan <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin (Cruise) <[email protected]> Cc: Usama Arif <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm/sparse: add vmemmap_*_hvo functionsFrank van der Linden2025-03-171-14/+127
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a few functions to enable early HVO: vmemmap_populate_hvo vmemmap_undo_hvo vmemmap_wrprotect_hvo The populate and undo functions are expected to be used in early init, from the sparse_init_nid_early() function. The wrprotect function is to be used, potentially, later. To implement these functions, mostly re-use the existing compound pages vmemmap logic used by DAX. vmemmap_populate_address has its argument changed a bit in this commit: the page structure passed in to be reused in the mapping is replaced by a PFN and a flag. The flag indicates whether an extra ref should be taken on the vmemmap page containing the head page structure. Taking the ref is appropriate to for DAX / ZONE_DEVICE, but not for HugeTLB HVO. The HugeTLB vmemmap optimization maps tail page structure pages read-only. The vmemmap_wrprotect_hvo function that does this is implemented separately, because it cannot be guaranteed that reserved page structures will not be write accessed during memory initialization. Even with CONFIG_DEFERRED_STRUCT_PAGE_INIT, they might still be written to (if they are at the bottom of a zone). So, vmemmap_populate_hvo leaves the tail page structure pages RW initially, and then later during initialization, after memmap init is fully done, vmemmap_wrprotect_hvo must be called to finish the job. Subsequent commits will use these functions for early HugeTLB HVO. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Frank van der Linden <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Joao Martins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Madhavan Srinivasan <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin (Cruise) <[email protected]> Cc: Usama Arif <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm/sparse: allow for alternate vmemmap section init at bootFrank van der Linden2025-03-171-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add functions that are called just before the per-section memmap is initialized and just before the memmap page structures are initialized. They are called sparse_vmemmap_init_nid_early and sparse_vmemmap_init_nid_late, respectively. This allows for mm subsystems to add calls to initialize memmap and page structures in a specific way, if using SPARSEMEM_VMEMMAP. Specifically, hugetlb can pre-HVO bootmem allocated pages that way, so that no time and resources are wasted on allocating vmemmap pages, only to free them later (and possibly unnecessarily running the system out of memory in the process). Refactor some code and export a few convenience functions for external use. In sparse_init_nid, skip any sections that are already initialized, e.g. they have been initialized by sparse_vmemmap_init_nid_early already. The hugetlb code to use these functions will be added in a later commit. Export section_map_size, as any alternate memmap init code will want to use it. The internal config option to enable this is SPARSEMEM_VMEMMAP_PREINIT, which is selected if an architecture-specific option, ARCH_WANT_HUGETLB_VMEMMAP_PREINIT, is set. In the future, if other subsystems want to do preinit too, they can do it in a similar fashion. The internal config option is there because a section flag is used, and the number of flags available is architecture-dependent (see mmzone.h). Architecures can decide if there is room for the flag when enabling options that select SPARSEMEM_VMEMMAP_PREINIT. Fortunately, as of right now, all sparse vmemmap using architectures do have room. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Frank van der Linden <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Joao Martins <[email protected]> Cc: Madhavan Srinivasan <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin (Cruise) <[email protected]> Cc: Usama Arif <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm/memmap: prevent double scanning of memmap by kmemleakGuo Weikang2025-01-261-2/+3
| | | | | | | | | | | | | | kmemleak explicitly scans the mem_map through the valid struct page objects. However, memmap_alloc() was also adding this memory to the gray object list, causing it to be scanned twice. Remove memmap_alloc() from the scan list and add a comment to clarify the behavior. Link: https://lore.kernel.org/lkml/CAOm6qn=FVeTpH54wGDFMHuCOeYtvoTx30ktnv9-w3Nh8RMofEA@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Guo Weikang <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm: define general function pXd_init()Bibo Mao2024-11-121-12/+0
| | | | | | | | | | | | | | | | | | pud_init(), pmd_init() and kernel_pte_init() are duplicated defined in file kasan.c and sparse-vmemmap.c as weak functions. Move them to generic header file pgtable.h, architecture can redefine them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Bibo Mao <[email protected]> Reviewed-by: Huacai Chen <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: WANG Xuerui <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* LoongArch: Set initial pte entry with PAGE_GLOBAL for kernel spaceBibo Mao2024-10-211-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | There are two pages in one TLB entry on LoongArch system. For kernel space, it requires both two pte entries (buddies) with PAGE_GLOBAL bit set, otherwise HW treats it as non-global tlb, there will be potential problems if tlb entry for kernel space is not global. Such as fail to flush kernel tlb with the function local_flush_tlb_kernel_range() which supposed only flush tlb with global bit. Kernel address space areas include percpu, vmalloc, vmemmap, fixmap and kasan areas. For these areas both two consecutive page table entries should be enabled with PAGE_GLOBAL bit. So with function set_pte() and pte_clear(), pte buddy entry is checked and set besides its own pte entry. However it is not atomic operation to set both two pte entries, there is problem with test_vmalloc test case. So function kernel_pte_init() is added to init a pte table when it is created for kernel address space, and the default initial pte value is PAGE_GLOBAL rather than zero at beginning. Then only its own pte entry need update with function set_pte() and pte_clear(), nothing to do with the pte buddy entry. Signed-off-by: Bibo Mao <[email protected]> Signed-off-by: Huacai Chen <[email protected]>
* mm: don't account memmap per-nodePasha Tatashin2024-08-161-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix invalid access to pgdat during hot-remove operation: ndctl users reported a GPF when trying to destroy a namespace: $ ndctl destroy-namespace all -r all -f Segmentation fault dmesg: Oops: general protection fault, probably for non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: probably user-memory-access in range [0x000000000002b280-0x000000000002b287] CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.20.1 09/13/2023 RIP: 0010:mod_node_page_state+0x2a/0x110 cxl-test users report a GPF when trying to unload the test module: $ modrpobe -r cxl-test dmesg BUG: unable to handle page fault for address: 0000000000004200 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197 Tainted: [O]=OOT_MODULE, [N]=TEST Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15 RIP: 0010:mod_node_page_state+0x6/0x90 Currently, when memory is hot-plugged or hot-removed the accounting is done based on the assumption that memmap is allocated from the same node as the hot-plugged/hot-removed memory, which is not always the case. In addition, there are challenges with keeping the node id of the memory that is being remove to the time when memmap accounting is actually performed: since this is done after remove_pfn_range_from_zone(), and also after remove_memory_block_devices(). Meaning that we cannot use pgdat nor walking though memblocks to get the nid. Given all of that, account the memmap overhead system wide instead. For this we are going to be using global atomic counters, but given that memmap size is rarely modified, and normally is only modified either during early boot when there is only one CPU, or under a hotplug global mutex lock, therefore there is no need for per-cpu optimizations. Also, while we are here rename nr_memmap to nr_memmap_pages, and nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the units are in page count. [[email protected]: address a few nits from David Hildenbrand] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Pasha Tatashin <[email protected]> Reported-by: Yi Zhang <[email protected]> Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@mail.gmail.com Reported-by: Alison Schofield <[email protected]> Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t Tested-by: Dan Williams <[email protected]> Tested-by: Alison Schofield <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: David Rientjes <[email protected]> Tested-by: Yi Zhang <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Fan Ni <[email protected]> Cc: Joel Granados <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm: report per-page metadata informationSourav Panda2024-07-041-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Today, we do not have any observability of per-page metadata and how much it takes away from the machine capacity. Thus, we want to describe the amount of memory that is going towards per-page metadata, which can vary depending on build configuration, machine architecture, and system use. This patch adds 2 fields to /proc/vmstat that can used as shown below: Accounting per-page metadata allocated by boot-allocator: /proc/vmstat:nr_memmap_boot * PAGE_SIZE Accounting per-page metadata allocated by buddy-allocator: /proc/vmstat:nr_memmap * PAGE_SIZE Accounting total Perpage metadata allocated on the machine: (/proc/vmstat:nr_memmap_boot + /proc/vmstat:nr_memmap) * PAGE_SIZE Utility for userspace: Observability: Describe the amount of memory overhead that is going to per-page metadata on the system at any given time since this overhead is not currently observable. Debugging: Tracking the changes or absolute value in struct pages can help detect anomalies as they can be correlated with other metrics in the machine (e.g., memtotal, number of huge pages, etc). page_ext overheads: Some kernel features such as page_owner page_table_check that use page_ext can be optionally enabled via kernel parameters. Having the total per-page metadata information helps users precisely measure impact. Furthermore, page-metadata metrics will reflect the amount of struct pages reliquished (or overhead reduced) when hugetlbfs pages are reserved which will vary depending on whether hugetlb vmemmap optimization is enabled or not. For background and results see: lore.kernel.org/all/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sourav Panda <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: Pasha Tatashin <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Chen Linxuan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ivan Babrou <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Muchun Song <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Tomas Mudrunka <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Xu <[email protected]> Cc: Yang Yang <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm/vmemmap: allow architectures to override how vmemmap optimization worksAneesh Kumar K.V2023-08-181-0/+3
| | | | | | | | | | | | | | | | | | | | | Architectures like powerpc will like to use different page table allocators and mapping mechanisms to implement vmemmap optimization. Similar to vmemmap_populate allow architectures to implement vmemap_populate_compound_pages Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Joao Martins <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm: ptep_get() conversionRyan Roberts2023-06-191-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/[email protected] Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Ryan Roberts <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Alex Williamson <[email protected]> Cc: Al Viro <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm/vmemmap/devdax: fix kernel crash when probing devdax devicesAneesh Kumar K.V2023-04-181-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") added support for using optimized vmmemap for devdax devices. But how vmemmap mappings are created are architecture specific. For example, powerpc with hash translation doesn't have vmemmap mappings in init_mm page table instead they are bolted table entries in the hardware page table vmemmap_populate_compound_pages() used by vmemmap optimization code is not aware of these architecture-specific mapping. Hence allow architecture to opt for this feature. I selected architectures supporting HUGETLB_PAGE_OPTIMIZE_VMEMMAP option as also supporting this feature. This patch fixes the below crash on ppc64. BUG: Unable to handle kernel data access on write at 0xc00c000100400038 Faulting instruction address: 0xc000000001269d90 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 7 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc5-150500.34-default+ #2 5c90a668b6bbd142599890245c2fb5de19d7d28a Hardware name: IBM,9009-42G POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.40 (VL950_099) hv:phyp pSeries NIP: c000000001269d90 LR: c0000000004c57d4 CTR: 0000000000000000 REGS: c000000003632c30 TRAP: 0300 Not tainted (6.3.0-rc5-150500.34-default+) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24842228 XER: 00000000 CFAR: c0000000004c57d0 DAR: c00c000100400038 DSISR: 42000000 IRQMASK: 0 .... NIP [c000000001269d90] __init_single_page.isra.74+0x14/0x4c LR [c0000000004c57d4] __init_zone_device_page+0x44/0xd0 Call Trace: [c000000003632ed0] [c000000003632f60] 0xc000000003632f60 (unreliable) [c000000003632f10] [c0000000004c5ca0] memmap_init_zone_device+0x170/0x250 [c000000003632fe0] [c0000000005575f8] memremap_pages+0x2c8/0x7f0 [c0000000036330c0] [c000000000557b5c] devm_memremap_pages+0x3c/0xa0 [c000000003633100] [c000000000d458a8] dev_dax_probe+0x108/0x3e0 [c0000000036331a0] [c000000000d41430] dax_bus_probe+0xb0/0x140 [c0000000036331d0] [c000000000cef27c] really_probe+0x19c/0x520 [c000000003633260] [c000000000cef6b4] __driver_probe_device+0xb4/0x230 [c0000000036332e0] [c000000000cef888] driver_probe_device+0x58/0x120 [c000000003633320] [c000000000cefa6c] __device_attach_driver+0x11c/0x1e0 [c0000000036333a0] [c000000000cebc58] bus_for_each_drv+0xa8/0x130 [c000000003633400] [c000000000ceefcc] __device_attach+0x15c/0x250 [c0000000036334a0] [c000000000ced458] bus_probe_device+0x108/0x110 [c0000000036334f0] [c000000000ce92dc] device_add+0x7fc/0xa10 [c0000000036335b0] [c000000000d447c8] devm_create_dev_dax+0x1d8/0x530 [c000000003633640] [c000000000d46b60] __dax_pmem_probe+0x200/0x270 [c0000000036337b0] [c000000000d46bf0] dax_pmem_probe+0x20/0x70 [c0000000036337d0] [c000000000d2279c] nvdimm_bus_probe+0xac/0x2b0 [c000000003633860] [c000000000cef27c] really_probe+0x19c/0x520 [c0000000036338f0] [c000000000cef6b4] __driver_probe_device+0xb4/0x230 [c000000003633970] [c000000000cef888] driver_probe_device+0x58/0x120 [c0000000036339b0] [c000000000cefd08] __driver_attach+0x1d8/0x240 [c000000003633a30] [c000000000cebb04] bus_for_each_dev+0xb4/0x130 [c000000003633a90] [c000000000cee564] driver_attach+0x34/0x50 [c000000003633ab0] [c000000000ced878] bus_add_driver+0x218/0x300 [c000000003633b40] [c000000000cf1144] driver_register+0xa4/0x1b0 [c000000003633bb0] [c000000000d21a0c] __nd_driver_register+0x5c/0x100 [c000000003633c10] [c00000000206a2e8] dax_pmem_init+0x34/0x48 [c000000003633c30] [c0000000000132d0] do_one_initcall+0x60/0x320 [c000000003633d00] [c0000000020051b0] kernel_init_freeable+0x360/0x400 [c000000003633de0] [c000000000013764] kernel_init+0x34/0x1d0 [c000000003633e50] [c00000000000de14] ret_from_kernel_thread+0x5c/0x64 Link: https://lkml.kernel.org/r/[email protected] Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") Signed-off-by: Aneesh Kumar K.V <[email protected]> Reported-by: Tarun Sahu <[email protected]> Reviewed-by: Joao Martins <[email protected]> Cc: Muchun Song <[email protected]> Cc: Dan Williams <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm/sparse-vmemmap: generalise vmemmap_populate_hugepages()Feiyang Chen2022-12-121-0/+63
| | | | | | | | | | | | | | | | | | | | | | | | | Generalise vmemmap_populate_hugepages() so ARM64 & X86 & LoongArch can share its implementation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Feiyang Chen <[email protected]> Signed-off-by: Huacai Chen <[email protected]> Acked-by: Will Deacon <[email protected]> Acked-by: Dave Hansen <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Guo Ren <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: Min Zhou <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Philippe Mathieu-Daudé <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Xuefeng Li <[email protected]> Cc: Xuerui Wang <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* LoongArch: add sparse memory vmemmap supportFeiyang Chen2022-12-121-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add sparse memory vmemmap support for LoongArch. SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise pfn_to_page and page_to_pfn operations. This is the most efficient option when sufficient kernel resources are available. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Min Zhou <[email protected]> Signed-off-by: Feiyang Chen <[email protected]> Signed-off-by: Huacai Chen <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Guo Ren <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Philippe Mathieu-Daudé <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xuefeng Li <[email protected]> Cc: Xuerui Wang <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.cMuchun Song2022-08-091-399/+0
| | | | | | | | | | | | | | | | | | | | | When I first introduced vmemmap manipulation functions related to HugeTLB, I thought those functions may be reused by other modules (e.g. using similar approach to optimize vmemmap pages, unfortunately, the DAX used the same approach but does not use those functions). After two years, we didn't see any other users. So move those functions to hugetlb_vmemmap.c. Code movement without any functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds2022-08-051-5/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
| * mm: sparsemem: drop unexpected word 'a' in commentsXueBing Chen2022-07-041-1/+1
| | | | | | | | | | | | | | | | there is an unexpected word 'a' in the comments that need to be dropped Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: XueBing Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * docs: rename Documentation/vm to Documentation/mmMike Rapoport2022-06-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: Mike Rapoport <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Tested-by: Ira Weiny <[email protected]> Acked-by: Jonathan Corbet <[email protected]> Acked-by: Wu XiangCheng <[email protected]>
| * mm/sparse-vmemmap.c: remove unwanted initialization in ↵Gautam Menghani2022-06-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vmemmap_populate_compound_pages() Remove unnecessary initialization for the variable 'next'. This fixes the clang scan warning: Value stored to 'next' during its initialization is never read [deadcode.DeadStores] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Gautam Menghani <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Joao Martins <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
| * mm: use PAGE_ALIGNED instead of IS_ALIGNEDFanjun Kong2022-06-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's use this macro instead of IS_ALIGNED and passing PAGE_SIZE directly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fanjun Kong <[email protected]> Acked-by: Muchun Song <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* | Merge tag 'efi-next-for-v5.20' of ↵Linus Torvalds2022-08-031-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI updates from Ard Biesheuvel: - Enable mirrored memory for arm64 - Fix up several abuses of the efivar API - Refactor the efivar API in preparation for moving the 'business logic' part of it into efivarfs - Enable ACPI PRM on arm64 * tag 'efi-next-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (24 commits) ACPI: Move PRM config option under the main ACPI config ACPI: Enable Platform Runtime Mechanism(PRM) support on ARM64 ACPI: PRM: Change handler_addr type to void pointer efi: Simplify arch_efi_call_virt() macro drivers: fix typo in firmware/efi/memmap.c efi: vars: Drop __efivar_entry_iter() helper which is no longer used efi: vars: Use locking version to iterate over efivars linked lists efi: pstore: Omit efivars caching EFI varstore access layer efi: vars: Add thin wrapper around EFI get/set variable interface efi: vars: Don't drop lock in the middle of efivar_init() pstore: Add priv field to pstore_record for backend specific use Input: applespi - avoid efivars API and invoke EFI services directly selftests/kexec: remove broken EFI_VARS secure boot fallback check brcmfmac: Switch to appropriate helper to load EFI variable contents iwlwifi: Switch to proper EFI variable store interface media: atomisp_gmin_platform: stop abusing efivar API efi: efibc: avoid efivar API for setting variables efi: avoid efivars layer when loading SSDTs from variables efi: Correct comment on efi_memmap_alloc memblock: Disable mirror feature if kernelcore is not specified ...
| * | mm: Limit warning message in vmemmap_verify() to onceMa Wupeng2022-06-151-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a system only have limited mirrored memory or some numa node without mirrored memory, the per node vmemmap page_structs prefer to allocate memory from mirrored region, which will lead to vmemmap_verify() in vmemmap_populate_basepages() report lots of warning message. This patch change the frequency of "potential offnode page_structs" warning messages to only once to avoid a very long print during bootup. Signed-off-by: Ma Wupeng <[email protected]> Acked-by: David Hildenbrand <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: Mike Rapoport <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]>
* / mm: sparsemem: fix missing higher order allocation splittingMuchun Song2022-07-031-0/+8
|/ | | | | | | | | | | | | | | | | | | | Higher order allocations for vmemmap pages from buddy allocator must be able to be treated as indepdenent small pages as they can be freed individually by the caller. There is no problem for higher order vmemmap pages allocated at boot time since each individual small page will be initialized at boot time. However, it will be an issue for memory hotplug case since those higher order vmemmap pages are allocated from buddy allocator without initializing each individual small page's refcount. The system will panic in put_page_testzero() when CONFIG_DEBUG_VM is enabled if the vmemmap page is freed. Link: https://lkml.kernel.org/r/[email protected] Fixes: d8d55f5616cf ("mm: sparsemem: use page table lock to protect kernel pmd operations") Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm/sparse-vmemmap: improve memory savings for compound devmapsJoao Martins2022-04-291-10/+122
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A compound devmap is a dev_pagemap with @vmemmap_shift > 0 and it means that pages are mapped at a given huge page alignment and utilize uses compound pages as opposed to order-0 pages. Take advantage of the fact that most tail pages look the same (except the first two) to minimize struct page overhead. Allocate a separate page for the vmemmap area which contains the head page and separate for the next 64 pages. The rest of the subsections then reuse this tail vmemmap page to initialize the rest of the tail pages. Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and when initializing compound devmap with big enough @vmemmap_shift (e.g. 1G PUD) it may cross multiple sections. The vmemmap code needs to consult @pgmap so that multiple sections that all map the same tail data can refer back to the first copy of that data for a given gigantic page. On compound devmaps with 2M align, this mechanism lets 6 pages be saved out of the 8 necessary PFNs necessary to set the subsection's 512 struct pages being mapped. On a 1G compound devmap it saves 4094 pages. Altmap isn't supported yet, given various restrictions in altmap pfn allocator, thus fallback to the already in use vmemmap_populate(). It is worth noting that altmap for devmap mappings was there to relieve the pressure of inordinate amounts of memmap space to map terabytes of pmem. With compound pages the motivation for altmaps for pmem gets reduced. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joao Martins <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helperJoao Martins2022-04-291-17/+36
| | | | | | | | | | | | | | | | | | | | | | | In preparation for describing a memmap with compound pages, move the actual pte population logic into a separate function vmemmap_populate_address() and have a new helper vmemmap_populate_range() walk through all base pages it needs to populate. While doing that, change the helper to use a pte_t* as return value, rather than an hardcoded errno of 0 or -ENOMEM. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joao Martins <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm/sparse-vmemmap: add a pgmap argument to section activationJoao Martins2022-04-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "sparse-vmemmap: memory savings for compound devmaps (device-dax)", v9. This series minimizes 'struct page' overhead by pursuing a similar approach as Muchun Song series "Free some vmemmap pages of hugetlb page" (now merged since v5.14), but applied to devmap with @vmemmap_shift (device-dax). The vmemmap dedpulication original idea (already used in HugeTLB) is to reuse/deduplicate tail page vmemmap areas, particular the area which only describes tail pages. So a vmemmap page describes 64 struct pages, and the first page for a given ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second vmemmap page would contain only tail pages, and that's what gets reused across the rest of the subsection/section. The bigger the page size, the bigger the savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages). This is done for PMEM /specifically only/ on device-dax configured namespaces, not fsdax. In other words, a devmap with a @vmemmap_shift. In terms of savings, per 1Tb of memory, the struct page cost would go down with compound devmap: * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory) * with 1G pages we lose 40MB instead of 16G (0.0014% instead of 1.5% of total memory) The series is mostly summed up by patch 4, and to summarize what the series does: Patches 1 - 3: Minor cleanups in preparation for patch 4. Move the very nice docs of hugetlb_vmemmap.c into a Documentation/vm/ entry. Patch 4: Patch 4 is the one that takes care of the struct page savings (also referred to here as tail-page/vmemmap deduplication). Much like Muchun series, we reuse the second PTE tail page vmemmap areas across a given @vmemmap_shift On important difference though, is that contrary to the hugetlbfs series, there's no vmemmap for the area because we are late-populating it as opposed to remapping a system-ram range. IOW no freeing of pages of already initialized vmemmap like the case for hugetlbfs, which greatly simplifies the logic (besides not being arch-specific). altmap case unchanged and still goes via the vmemmap_populate(). Also adjust the newly added docs to the device-dax case. [Note that device-dax is still a little behind HugeTLB in terms of savings. I have an additional simple patch that reuses the head vmemmap page too, as a follow-up. That will double the savings and namespaces initialization.] Patch 5: Initialize fewer struct pages depending on the page size with DRAM backed struct pages -- because fewer pages are unique and most tail pages (with bigger vmemmap_shift). NVDIMM namespace bootstrap improves from ~268-358 ms to ~80-110/<1ms on 128G NVDIMMs with 2M and 1G respectivally. And struct page needed capacity will be 3.8x / 1071x smaller for 2M and 1G respectivelly. Tested on x86 with 1.5Tb of pmem (including pinning, and RDMA registration/deregistration scalability with 2M MRs) This patch (of 5): In support of using compound pages for devmap mappings, plumb the pgmap down to the vmemmap_populate implementation. Note that while altmap is retrievable from pgmap the memory hotplug code passes altmap without pgmap[*], so both need to be independently plumbed. So in addition to @altmap, pass @pgmap to sparse section populate functions namely: sparse_add_section section_activate populate_section_memmap __populate_section_memmap Passing @pgmap allows __populate_section_memmap() to both fetch the vmemmap_shift in which memmap metadata is created for and also to let sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick whether to just reuse tail pages from past onlined sections. While at it, fix the kdoc for @altmap for sparse_add_section(). [*] https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joao Martins <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jane Chu <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*Muchun Song2022-04-291-2/+2
| | | | | | | | | | | | | The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize". In this patch , cheanup configs to make code more expressive. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* mm: sparsemem: move vmemmap related to HugeTLB to ↵Muchun Song2022-03-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | CONFIG_HUGETLB_PAGE_FREE_VMEMMAP The vmemmap_remap_free/alloc are relevant to HugeTLB, so move those functiongs to the scope of CONFIG_HUGETLB_PAGE_FREE_VMEMMAP. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm: sparsemem: use page table lock to protect kernel pmd operationsMuchun Song2022-03-221-16/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The init_mm.page_table_lock is used to protect kernel page tables, we can use it to serialize splitting vmemmap PMD mappings instead of mmap write lock, which can increase the concurrency of vmemmap_remap_free(). Actually, It increase the concurrency between allocations of HugeTLB pages. But it is not the only benefit. There are a lot of users of mmap read lock of init_mm. The mmap write lock is holding through vmemmap_remap_free(), removing mmap write lock usage to make it does not affect other users of mmap read lock. It is not making anything worse and always a win to move. Now the kernel page table walker does not hold the page_table_lock when walking pmd entries. There may be consistency issue of a pmd entry, because pmd entry might change from a huge pmd entry to a PTE page table. There is only one user of kernel page table walker, namely ptdump. The ptdump already considers the consistency, which use a local variable to cache the value of pmd entry. But we also need to update ->action to ACTION_CONTINUE to make sure the walker does not walk every pte entry again when concurrent thread has split the huge pmd. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Barry Song <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Chen Huang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB pageMuchun Song2022-03-221-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Free the 2nd vmemmap page associated with each HugeTLB page", v7. This series can minimize the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks. The main implementation and details can refer to the commit log of patch 1. In this series, I have changed the following four helpers, the following table shows the impact of the overhead of those helpers. +------------------+-----------------------+ | APIs | head page | tail page | +------------------+-----------+-----------+ | PageHead() | Y | N | +------------------+-----------+-----------+ | PageTail() | Y | N | +------------------+-----------+-----------+ | PageCompound() | N | N | +------------------+-----------+-----------+ | compound_head() | Y | N | +------------------+-----------+-----------+ Y: Overhead is increased. N: Overhead is _NOT_ increased. It shows that the overhead of those helpers on a tail page don't change between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off". But the overhead on a head page will be increased when "hugetlb_free_vmemmap=on" (except PageCompound()). So I believe that Matthew Wilcox's folio series will help with this. The users of PageHead() and PageTail() are much less than compound_head() and most users of PageTail() are VM_BUG_ON(), so I have done some tests about the overhead of compound_head() on head pages. I have tested the overhead of calling compound_head() on a head page, which is 2.11ns (Measure the call time of 10 million times compound_head(), and then average). For a head page whose address is not aligned with PAGE_SIZE or a non-compound page, the overhead of compound_head() is 2.54ns which is increased by 20%. For a head page whose address is aligned with PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by 40%. Most pages are the former. I do not think the overhead is significant since the overhead of compound_head() itself is low. This patch (of 5): This patch minimizes the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB (2MB type). After the feature of "Free sonme vmemmap pages of HugeTLB page" is enabled, the mapping of the vmemmap addresses associated with a 2MB HugeTLB page becomes the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | ----------------------+ | | | | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+ As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and remaped. However, the 2nd vmemmap page frame is also can be freed to the buddy allocator, then we can change the mapping from the figure above to the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | | 2 | -----------------+ | | | | | | | +-----------+ | | | | | | | | 3 | -------------------+ | | | | | | +-----------+ | | | | | | | 4 | ---------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | -----------------------+ | | | | +-----------+ | | | | | 6 | -------------------------+ | | | +-----------+ | | | | 7 | ---------------------------+ | | +-----------+ | | | | | | +-----------+ After we do this, all tail vmemmap pages (1-7) are mapped to the head vmemmap page frame (0). In other words, there are more than one page struct with PG_head associated with each HugeTLB page. We __know__ that there is only one head page struct, the tail page structs with PG_head are fake head page structs. We need an approach to distinguish between those two different types of page structs so that compound_head(), PageHead() and PageTail() can work properly if the parameter is the tail page struct but with PG_head. The following code snippet describes how to distinguish between real and fake head page struct. if (test_bit(PG_head, &page->flags)) { unsigned long head = READ_ONCE(page[1].compound_head); if (head & 1) { if (head == (unsigned long)page + 1) ==> head page struct else ==> tail page struct } else ==> head page struct } We can safely access the field of the @page[1] with PG_head because the @page is a compound page composed with at least two contiguous pages. [[email protected]: restore lost comment changes] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Chen Huang <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Xiongchun Duan <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Qi Zheng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm: remove redundant smp_wmb()Qi Zheng2021-11-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The smp_wmb() which is in the __pte_alloc() is used to ensure all ptes setup is visible before the pte is made visible to other CPUs by being put into page tables. We only need this when the pte is actually populated, so move it to pmd_install(). __pte_alloc_kernel(), __p4d_alloc(), __pud_alloc() and __pmd_alloc() are similar to this case. We can also defer smp_wmb() to the place where the pmd entry is really populated by preallocated pte. There are two kinds of user of preallocated pte, one is filemap & finish_fault(), another is THP. The former does not need another smp_wmb() because the smp_wmb() has been done by pmd_install(). Fortunately, the latter also does not need another smp_wmb() because there is already a smp_wmb() before populating the new pte when the THP uses a preallocated pte to split a huge pmd. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mika Penttila <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm: sparsemem: split the huge PMD mapping of vmemmap pagesMuchun Song2021-07-011-38/+125
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Split huge PMD mapping of vmemmap pages", v4. In order to reduce the difficulty of code review in series[1]. We disable huge PMD mapping of vmemmap pages when that feature is enabled. In this series, we do not disable huge PMD mapping of vmemmap pages anymore. We will split huge PMD mapping when needed. When HugeTLB pages are freed from the pool we do not attempt coalasce and move back to a PMD mapping because it is much more complex. [1] https://lore.kernel.org/linux-doc/[email protected]/ This patch (of 3): In [1], PMD mappings of vmemmap pages were disabled if the the feature hugetlb_free_vmemmap was enabled. This was done to simplify the initial implementation of vmmemap freeing for hugetlb pages. Now, remove this simplification by allowing PMD mapping and switching to PTE mappings as needed for allocated hugetlb pages. When a hugetlb page is allocated, the vmemmap page tables are walked to free vmemmap pages. During this walk, split huge PMD mappings to PTE mappings as required. In the unlikely case PTE pages can not be allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb page. When HugeTLB pages are freed from the pool, we do not attempt to coalesce and move back to a PMD mapping because it is much more complex. [1] https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Chen Huang <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB pageMuchun Song2021-07-011-1/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we free a HugeTLB page to the buddy allocator, we need to allocate the vmemmap pages associated with it. However, we may not be able to allocate the vmemmap pages when the system is under memory pressure. In this case, we just refuse to free the HugeTLB page. This changes behavior in some corner cases as listed below: 1) Failing to free a huge page triggered by the user (decrease nr_pages). User needs to try again later. 2) Failing to free a surplus huge page when freed by the application. Try again later when freeing a huge page next time. 3) Failing to dissolve a free huge page on ZONE_MOVABLE via offline_pages(). This can happen when we have plenty of ZONE_MOVABLE memory, but not enough kernel memory to allocate vmemmmap pages. We may even be able to migrate huge page contents, but will not be able to dissolve the source huge page. This will prevent an offline operation and is unfortunate as memory offlining is expected to succeed on movable zones. Users that depend on memory hotplug to succeed for movable zones should carefully consider whether the memory savings gained from this feature are worth the risk of possibly not being able to offline memory in certain situations. 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via alloc_contig_range() - once we have that handling in place. Mainly affects CMA and virtio-mem. Similar to 3). virito-mem will handle migration errors gracefully. CMA might be able to fallback on other free areas within the CMA region. Vmemmap pages are allocated from the page freeing context. In order for those allocations to be not disruptive (e.g. trigger oom killer) __GFP_NORETRY is used. hugetlb_lock is dropped for the allocation because a non sleeping allocation would be too fragile and it could fail too easily under memory pressure. GFP_ATOMIC or other modes to access memory reserves is not used because we want to prevent consuming reserves under heavy hugetlb freeing. [[email protected]: fix dissolve_free_huge_page use of tail/head page] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix alloc_vmemmap_page_list documentation warning] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Barry Song <[email protected]> Cc: Bodeddula Balasubramaniam <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chen Huang <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joao Martins <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Neukum <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pawan Gupta <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm: hugetlb: free the vmemmap pages associated with each HugeTLB pageMuchun Song2021-07-011-0/+194
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every HugeTLB has more than one struct page structure. We __know__ that we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to store metadata associated with each HugeTLB. There are a lot of struct page structures associated with each HugeTLB page. For tail pages, the value of compound_head is the same. So we can reuse first page of tail page structures. We map the virtual addresses of the remaining pages of tail page structures to the first tail page struct, and then free these page frames. Therefore, we need to reserve two pages as vmemmap areas. When we allocate a HugeTLB page from the buddy, we can free some vmemmap pages associated with each HugeTLB page. It is more appropriate to do it in the prep_new_huge_page(). The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages associated with a HugeTLB page can be freed, returns zero for now, which means the feature is disabled. We will enable it once all the infrastructure is there. [[email protected]: fix documentation warning] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Tested-by: Chen Huang <[email protected]> Tested-by: Bodeddula Balasubramaniam <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Barry Song <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: HORIGUCHI NAOYA <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Joao Martins <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Oliver Neukum <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Pawan Gupta <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Xiongchun Duan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm/sparse: only sub-section aligned range would be populatedWei Yang2020-08-071-13/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | There are two code path which invoke __populate_section_memmap() * sparse_init_nid() * sparse_add_section() For both case, we are sure the memory range is sub-section aligned. * we pass PAGES_PER_SECTION to sparse_init_nid() * we check range by check_pfn_span() before calling sparse_add_section() Also, the counterpart of __populate_section_memmap(), we don't do such calculation and check since the range is checked by check_pfn_span() in __remove_pages(). Clear the calculation and check to keep it simple and comply with its counterpart. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: David Hildenbrand <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
* mm/sparsemem: enable vmem_altmap support in vmemmap_alloc_block_buf()Anshuman Khandual2020-08-071-15/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are many instances where vmemap allocation is often switched between regular memory and device memory just based on whether altmap is available or not. vmemmap_alloc_block_buf() is used in various platforms to allocate vmemmap mappings. Lets also enable it to handle altmap based device memory allocation along with existing regular memory allocations. This will help in avoiding the altmap based allocation switch in many places. To summarize there are two different methods to call vmemmap_alloc_block_buf(). vmemmap_alloc_block_buf(size, node, NULL) /* Allocate from system RAM */ vmemmap_alloc_block_buf(size, node, altmap) /* Allocate from altmap */ This converts altmap_alloc_block_buf() into a static function, drops it's entry from the header and updates Documentation/vm/memory-model.rst. Suggested-by: Robin Murphy <[email protected]> Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Jia He <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Will Deacon <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Hsin-Yi Wang <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Mark Rutland <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Steve Capper <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yu Zhao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
* mm/sparsemem: enable vmem_altmap support in vmemmap_populate_basepages()Anshuman Khandual2020-08-071-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "arm64: Enable vmemmap mapping from device memory", v4. This series enables vmemmap backing memory allocation from device memory ranges on arm64. But before that, it enables vmemmap_populate_basepages() and vmemmap_alloc_block_buf() to accommodate struct vmem_altmap based alocation requests. This patch (of 3): vmemmap_populate_basepages() is used across platforms to allocate backing memory for vmemmap mapping. This is used as a standard default choice or as a fallback when intended huge pages allocation fails. This just creates entire vmemmap mapping with base pages (PAGE_SIZE). On arm64 platforms, vmemmap_populate_basepages() is called instead of the platform specific vmemmap_populate() when ARM64_SWAPPER_USES_SECTION_MAPS is not enabled as in case for ARM64_16K_PAGES and ARM64_64K_PAGES configs. At present vmemmap_populate_basepages() does not support allocating from driver defined struct vmem_altmap while trying to create vmemmap mapping for a device memory range. It prevents ARM64_16K_PAGES and ARM64_64K_PAGES configs on arm64 from supporting device memory with vmemap_altmap request. This enables vmem_altmap support in vmemmap_populate_basepages() unlocking device memory allocation for vmemap mapping on arm64 platforms with 16K or 64K base page configs. Each architecture should evaluate and decide on subscribing device memory based base page allocation through vmemmap_populate_basepages(). Hence lets keep it disabled on all archs in order to preserve the existing semantics. A subsequent patch enables it on arm64. Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Jia He <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Will Deacon <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Hsin-Yi Wang <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Steve Capper <[email protected]> Cc: Yu Zhao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
* mm: don't include asm/pgtable.h if linux/mm.h is already includedMike Rapoport2020-06-091-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: consolidate definitions of page table accessors", v2. The low level page table accessors (pXY_index(), pXY_offset()) are duplicated across all architectures and sometimes more than once. For instance, we have 31 definition of pgd_offset() for 25 supported architectures. Most of these definitions are actually identical and typically it boils down to, e.g. static inline unsigned long pmd_index(unsigned long address) { return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1); } static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) { return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); } These definitions can be shared among 90% of the arches provided XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined. For architectures that really need a custom version there is always possibility to override the generic version with the usual ifdefs magic. These patches introduce include/linux/pgtable.h that replaces include/asm-generic/pgtable.h and add the definitions of the page table accessors to the new header. This patch (of 12): The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the functions involving page table manipulations, e.g. pte_alloc() and pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h> in the files that include <linux/mm.h>. The include statements in such cases are remove with a simple loop: for f in $(git grep -l "include <linux/mm.h>") ; do sed -i -e '/include <asm\/pgtable.h>/ d' $f done Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chris Zankel <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vincent Chen <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
* mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()Dan Williams2019-07-191-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow sub-section sized ranges to be added to the memmap. populate_section_memmap() takes an explict pfn range rather than assuming a full section, and those parameters are plumbed all the way through to vmmemap_populate(). There should be no sub-section usage in current deployments. New warnings are added to clarify which memmap allocation paths are sub-section capable. Link: http://lkml.kernel.org/r/156092352058.979959.6551283472062305149.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Tested-by: Aneesh Kumar K.V <[email protected]> [ppc64] Reviewed-by: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jeff Moyer <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm: remove include/linux/bootmem.hMike Rapoport2018-10-311-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [[email protected]: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chris Zankel <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Burton <[email protected]> Cc: Richard Kuo <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Serge Semin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variantsMike Rapoport2018-10-311-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop BOOTMEM_ALLOC_ACCESSIBLE and BOOTMEM_ALLOC_ANYWHERE in favor of identical MEMBLOCK definitions. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chris Zankel <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Burton <[email protected]> Cc: Richard Kuo <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Serge Semin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* memblock: remove _virt from APIs returning virtual addressMike Rapoport2018-10-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The conversion is done using sed -i 's@memblock_virt_alloc@memblock_alloc@g' \ $(git grep -l memblock_virt_alloc) Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chris Zankel <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Mark Salter <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Simek <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Burton <[email protected]> Cc: Richard Kuo <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Serge Semin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm/sparse: delete old sparse_init and enable new onePavel Tatashin2018-08-171-21/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename new_sparse_init() to sparse_init() which enables it. Delete old sparse_init() and all the code that became obsolete with. [[email protected]: remove unused sparse_mem_maps_populate_node()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Tested-by: Michael Ellerman <[email protected]> [powerpc] Tested-by: Oscar Salvador <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Abdul Haleem <[email protected]> Cc: Baoquan He <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Steven Sistare <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm/sparse: move buffer init/fini to the common placePavel Tatashin2018-08-171-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that both variants of sparse memory use the same buffers to populate memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the common place. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Tested-by: Michael Ellerman <[email protected]> [powerpc] Tested-by: Oscar Salvador <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Abdul Haleem <[email protected]> Cc: Baoquan He <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Steven Sistare <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm/sparse: abstract sparse buffer allocationsPavel Tatashin2018-08-171-34/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "sparse_init rewrite", v6. In sparse_init() we allocate two large buffers to temporary hold usemap and memmap for the whole machine. However, we can avoid doing that if we changed sparse_init() to operated on per-node bases instead of doing it on the whole machine beforehand. As shown by Baoquan http://lkml.kernel.org/r/[email protected] The buffers are large enough to cause machine stop to boot on small memory systems. Another benefit of these changes is that they also obsolete CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER. This patch (of 5): When struct pages are allocated for sparse-vmemmap VA layout, we first try to allocate one large buffer, and than if that fails allocate struct pages for each section as we go. The code that allocates buffer is uses global variables and is spread across several call sites. Cleanup the code by introducing three functions to handle the global buffer: sparse_buffer_init() initialize the buffer sparse_buffer_fini() free the remaining part of the buffer sparse_buffer_alloc() alloc from the buffer, and if buffer is empty return NULL Define these functions in sparse.c instead of sparse-vmemmap.c because later we will use them for non-vmemmap sparse allocations as well. [[email protected]: use PTR_ALIGN()] [[email protected]: s/BUG_ON/WARN_ON/] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Tested-by: Michael Ellerman <[email protected]> [powerpc] Reviewed-by: Oscar Salvador <[email protected]> Tested-by: Oscar Salvador <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Steven Sistare <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Baoquan He <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Abdul Haleem <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm/sparse: optimize memmap allocation during sparse_init()Baoquan He2018-08-171-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In sparse_init(), two temporary pointer arrays, usemap_map and map_map are allocated with the size of NR_MEM_SECTIONS. They are used to store each memory section's usemap and mem map if marked as present. With the help of these two arrays, continuous memory chunk is allocated for usemap and memmap for memory sections on one node. This avoids too many memory fragmentations. Like below diagram, '1' indicates the present memory section, '0' means absent one. The number 'n' could be much smaller than NR_MEM_SECTIONS on most of systems. |1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0| ------------------------------------------------------- 0 1 2 3 4 5 i i+1 n-1 n If we fail to populate the page tables to map one section's memmap, its ->section_mem_map will be cleared finally to indicate that it's not present. After use, these two arrays will be released at the end of sparse_init(). In 4-level paging mode, each array costs 4M which can be ignorable. While in 5-level paging, they costs 256M each, 512M altogether. Kdump kernel Usually only reserves very few memory, e.g 256M. So, even thouth they are temporarily allocated, still not acceptable. In fact, there's no need to allocate them with the size of NR_MEM_SECTIONS. Since the ->section_mem_map clearing has been deferred to the last, the number of present memory sections are kept the same during sparse_init() until we finally clear out the memory section's ->section_mem_map if its usemap or memmap is not correctly handled. Thus in the middle whenever for_each_present_section_nr() loop is taken, the i-th present memory section is always the same one. Here only allocate usemap_map and map_map with the size of 'nr_present_sections'. For the i-th present memory section, install its usemap and memmap to usemap_map[i] and mam_map[i] during allocation. Then in the last for_each_present_section_nr() loop which clears the failed memory section's ->section_mem_map, fetch usemap and memmap from usemap_map[] and map_map[] array and set them into mem_section[] accordingly. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm/sparsemem.c: defer the ms->section_mem_map clearingBaoquan He2018-08-171-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system will allocate one continuous memory chunk for mem maps on one node and populate the relevant page tables to map memory section one by one. If fail to populate for a certain mem section, print warning and its ->section_mem_map will be cleared to cancel the marking of being present. Like this, the number of mem sections marked as present could become less during sparse_init() execution. Here just defer the ms->section_mem_map clearing if failed to populate its page tables until the last for_each_present_section_nr() loop. This is in preparation for later optimizing the mem map allocation. [[email protected]: remove now-unused local `ms', per Oscar] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Acked-by: Dave Hansen <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Pankaj Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* mm: merge vmem_altmap_alloc into altmap_alloc_block_bufChristoph Hellwig2018-01-081-29/+16
| | | | | | | | There is no clear separation between the two, so merge them. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Dan Williams <[email protected]>
* mm: split altmap memory map allocation from normal caseChristoph Hellwig2018-01-081-12/+3
| | | | | | | | | No functional changes, just untangling the call chain and document why the altmap is passed around the hotplug code. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Dan Williams <[email protected]>