aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'block-6.17-20250918' of git://git.kernel.dk/linuxLinus Torvalds2025-09-195-0/+5
|\ | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A set of fixes for an issue with md array assembly and drbd for devices supporting write zeros" * tag 'block-6.17-20250918' of git://git.kernel.dk/linux: drbd: init queue_limits->max_hw_wzeroes_unmap_sectors parameter md: init queue_limits->max_hw_wzeroes_unmap_sectors parameter
| * md: init queue_limits->max_hw_wzeroes_unmap_sectors parameterZhang Yi2025-09-165-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The parameter max_hw_wzeroes_unmap_sectors in queue_limits should be equal to max_write_zeroes_sectors if it is set to a non-zero value. However, the stacked md drivers call md_init_stacking_limits() to initialize this parameter to UINT_MAX but only adjust max_write_zeroes_sectors when setting limits. Therefore, this discrepancy triggers a value check failure in blk_validate_limits(). $ modprobe scsi_debug num_parts=2 dev_size_mb=8 lbprz=1 lbpws=1 $ mdadm --create /dev/md0 --level=0 --raid-device=2 /dev/sda1 /dev/sda2 mdadm: Defaulting to version 1.2 metadata mdadm: RUN_ARRAY failed: Invalid argument Fix this failure by explicitly setting max_hw_wzeroes_unmap_sectors to max_write_zeroes_sectors. Since the linear and raid0 drivers support write zeroes, so they can support unmap write zeroes operation if all of the backend devices support it. However, the raid1/10/5 drivers don't support write zeroes, so we have to set it to zero. Fixes: 0c40d7cb5ef3 ("block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits") Reported-by: John Garry <[email protected]> Closes: https://lore.kernel.org/linux-block/[email protected]/ Signed-off-by: Zhang Yi <[email protected]> Tested-by: John Garry <[email protected]> Reviewed-by: Li Nan <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
* | Merge tag 'for-6.17/dm-fixes' of ↵Linus Torvalds2025-09-173-6/+12
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mikulas Patocka: - fix integer overflow in dm-stripe - limit tag size in dm-integrity to 255 bytes - fix 'alignment inconsistency' warning in dm-raid * tag 'for-6.17/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-raid: don't set io_min and io_opt for raid1 dm-integrity: limit MAX_TAG_SIZE to 255 dm-stripe: fix a possible integer overflow
| * dm-raid: don't set io_min and io_opt for raid1Mikulas Patocka2025-09-171-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These commands modprobe brd rd_size=1048576 vgcreate vg /dev/ram* lvcreate -m4 -L10 -n lv vg trigger the following warnings: device-mapper: table: 252:10: adding target device (start sect 0 len 24576) caused an alignment inconsistency device-mapper: table: 252:10: adding target device (start sect 0 len 24576) caused an alignment inconsistency The warnings are caused by the fact that io_min is 512 and physical block size is 4096. If there's chunk-less raid, such as raid1, io_min shouldn't be set to zero because it would be raised to 512 and it would trigger the warning. Signed-off-by: Mikulas Patocka <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Cc: [email protected]
| * dm-integrity: limit MAX_TAG_SIZE to 255Mikulas Patocka2025-09-081-1/+1
| | | | | | | | | | | | | | MAX_TAG_SIZE was 0x1a8 and it may be truncated in the "bi->metadata_size = ic->tag_size" assignment. We need to limit it to 255. Signed-off-by: Mikulas Patocka <[email protected]>
| * dm-stripe: fix a possible integer overflowMikulas Patocka2025-08-111-3/+7
| | | | | | | | | | | | | | | | | | | | | | There's a possible integer overflow in stripe_io_hints if we have too large chunk size. Test if the overflow happened, and if it did, don't set limits->io_min and limits->io_opt; Signed-off-by: Mikulas Patocka <[email protected]> Reviewed-by: John Garry <[email protected]> Suggested-by: Dongsheng Yang <[email protected]> Cc: [email protected]
* | md: prevent incorrect update of resync/recovery offsetLi Nan2025-09-041-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In md_do_sync(), when md_sync_action returns ACTION_FROZEN, subsequent call to md_sync_position() will return MaxSector. This causes 'curr_resync' (and later 'recovery_offset') to be set to MaxSector too, which incorrectly signals that recovery/resync has completed, even though disk data has not actually been updated. To fix this issue, skip updating any offset values when the sync action is FROZEN. The same holds true for IDLE. Fixes: 7d9f107a4e94 ("md: use new helpers in md_do_sync()") Signed-off-by: Li Nan <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
* | md/raid1: fix data lost for writemostly rdevYu Kuai2025-09-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If writemostly is enabled, alloc_behind_master_bio() will allocate a new bio for rdev, with bi_opf set to 0. Later, raid1_write_request() will clone from this bio, hence bi_opf is still 0 for the cloned bio. Submit this cloned bio will end up to be read, causing write data lost. Fix this problem by inheriting bi_opf from original bio for behind_mast_bio. Fixes: e879a0d9cb08 ("md/raid1,raid10: don't ignore IO flags") Reported-and-tested-by: Ian Dall <[email protected]> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220507 Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Li Nan <[email protected]>
* | md: fix sync_action incorrect display during resyncZheng Qixing2025-08-161-2/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During raid resync, if a disk becomes faulty, the operation is briefly interrupted. The MD_RECOVERY_RECOVER flag triggered by the disk failure causes sync_action to incorrectly show "recover" instead of "resync". The same issue affects reshape operations. Reproduction steps: mdadm -Cv /dev/md1 -l1 -n4 -e1.2 /dev/sd{a..d} // -> resync happened mdadm -f /dev/md1 /dev/sda // -> resync interrupted cat sync_action -> recover Add progress checks in md_sync_action() for resync/recover/reshape to ensure the interface correctly reports the actual operation type. Fixes: 4b10a3bc67c1 ("md: ensure resync is prioritized over recovery") Signed-off-by: Zheng Qixing <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
* | md: add helper rdev_needs_recovery()Zheng Qixing2025-08-161-11/+12
| | | | | | | | | | | | | | | | Add a helper for checking if an rdev needs recovery. Signed-off-by: Zheng Qixing <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
* | md: keep recovery_cp in mdp_superblock_sXiao Ni2025-08-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | commit 907a99c314a5 ("md: rename recovery_cp to resync_offset") replaces recovery_cp with resync_offset in mdp_superblock_s which is in md_p.h. md_p.h is used in userspace too. So mdadm building fails because of this. This patch revert this change. Fixes: 907a99c314a5 ("md: rename recovery_cp to resync_offset") Signed-off-by: Xiao Ni <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
* | md: add legacy_async_del_gendisk modeXiao Ni2025-08-131-14/+42
|/ | | | | | | | | | | | | | | | | | | | | | | | commit 9e59d609763f ("md: call del_gendisk in control path") changes the async way to sync way of calling del_gendisk. But it breaks mdadm --assemble command. The assemble command runs like this: 1. create the array 2. stop the array 3. access the sysfs files after stopping The sync way calls del_gendisk in step 2, so all sysfs files are removed. Now to avoid breaking mdadm assemble command, this patch adds the parameter legacy_async_del_gendisk that can be used to choose which way. The default is async way. In future, we plan to change default to sync way in kernel 7.0. Then users need to upgrade to mdadm 4.5+ which removes step 2. Fixes: 9e59d609763f ("md: call del_gendisk in control path") Reported-by: Mikulas Patocka <[email protected]> Closes: https://lore.kernel.org/linux-raid/CAMw=ZnQ=ET2St-+hnhsuq34rRPnebqcXqP1QqaHW5Bh4aaaZ4g@mail.gmail.com/T/#t Suggested-and-reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Reviewed-by: Paul Menzel <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
* Merge tag 'block-6.17-20250808' of git://git.kernel.dk/linuxLinus Torvalds2025-08-0912-173/+144
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more block updates from Jens Axboe: - MD pull request via Yu: - mddev null-ptr-dereference fix, by Erkun - md-cluster fail to remove the faulty disk regression fix, by Heming - minor cleanup, by Li Nan and Jinchao - mdadm lifetime regression fix reported by syzkaller, by Yu Kuai - MD pull request via Christoph - add support for getting the FDP featuee in fabrics passthru path (Nitesh Shetty) - add capability to connect to an administrative controller (Kamaljit Singh) - fix a leak on sgl setup error (Keith Busch) - initialize discovery subsys after debugfs is initialized (Mohamed Khalfella) - fix various comment typos (Bjorn Helgaas) - remove unneeded semicolons (Jiapeng Chong) - nvmet debugfs ordering issue fix - Fix UAF in the tag_set in zloop - Ensure sbitmap shallow depth covers entire set - Reduce lock roundtrips in io context lookup - Move scheduler tags alloc/free out of elevator and freeze lock, to fix some lockdep found issues - Improve robustness of queue limits checking - Fix a regression with IO priorities, if no io context exists * tag 'block-6.17-20250808' of git://git.kernel.dk/linux: (26 commits) lib/sbitmap: make sbitmap_get_shallow() internal lib/sbitmap: convert shallow_depth from one word to the whole sbitmap nvmet: exit debugfs after discovery subsystem exits block, bfq: Reorder struct bfq_iocq_bfqq_data md: make rdev_addable usable for rcu mode md/raid1: remove struct pool_info and related code md/raid1: change r1conf->r1bio_pool to a pointer type block: ensure discard_granularity is zero when discard is not supported zloop: fix KASAN use-after-free of tag set block: Fix default IO priority if there is no IO context nvme: fix various comment typos nvme-auth: remove unneeded semicolon nvme-pci: fix leak on sgl setup error nvmet: initialize discovery subsys after debugfs is initialized nvme: add capability to connect to an administrative controller nvmet: add support for FDP in fabrics passthru path md: rename recovery_cp to resync_offset md/md-cluster: handle REMOVE message earlier md: fix create on open mddev lifetime regression block: fix potential deadlock while running nr_hw_queue update ...
| * md: make rdev_addable usable for rcu modeYang Erkun2025-08-031-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our testcase trigger panic: BUG: kernel NULL pointer dereference, address: 00000000000000e0 ... Oops: Oops: 0000 [#1] SMP NOPTI CPU: 2 UID: 0 PID: 85 Comm: kworker/2:1 Not tainted 6.16.0+ #94 PREEMPT(none) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 Workqueue: md_misc md_start_sync RIP: 0010:rdev_addable+0x4d/0xf0 ... Call Trace: <TASK> md_start_sync+0x329/0x480 process_one_work+0x226/0x6d0 worker_thread+0x19e/0x340 kthread+0x10f/0x250 ret_from_fork+0x14d/0x180 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: raid10 CR2: 00000000000000e0 ---[ end trace 0000000000000000 ]--- RIP: 0010:rdev_addable+0x4d/0xf0 md_spares_need_change in md_start_sync will call rdev_addable which protected by rcu_read_lock/rcu_read_unlock. This rcu context will help protect rdev won't be released, but rdev->mddev will be set to NULL before we call synchronize_rcu in md_kick_rdev_from_array. Fix this by using READ_ONCE and check does rdev->mddev still alive. Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") Fixes: 570b9147deb6 ("md: use RCU lock to protect traversal in md_spares_need_change()") Signed-off-by: Yang Erkun <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
| * md/raid1: remove struct pool_info and related codeWang Jinchao2025-08-032-55/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | The struct pool_info was originally introduced mainly to support reshape operations, serving as a parameter for mempool_init() when raid_disks changes. Now that mempool_create_kmalloc_pool() is sufficient for this purpose, struct pool_info and its related code are no longer needed. Remove struct pool_info and all associated code. Signed-off-by: Wang Jinchao <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
| * md/raid1: change r1conf->r1bio_pool to a pointer typeWang Jinchao2025-08-032-22/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In raid1_reshape(), newpool is a stack variable. mempool_init() initializes newpool->wait with the stack address. After assigning newpool to conf->r1bio_pool, the wait queue need to be reinitialized, which is not ideal. Change raid1_conf->r1bio_pool to a pointer type and replace mempool_init() with mempool_create_kmalloc_pool() to avoid referencing a stack-based wait queue. Signed-off-by: Wang Jinchao <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
| * md: rename recovery_cp to resync_offsetLi Nan2025-07-3011-94/+94
| | | | | | | | | | | | | | | | | | | | 'recovery_cp' was used to represent the progress of sync, but its name contains recovery, which can cause confusion. Replaces 'recovery_cp' with 'resync_offset' for clarity. Signed-off-by: Li Nan <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
| * md/md-cluster: handle REMOVE message earlierHeming Zhao2025-07-301-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a1fd37f97808 ("md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl") introduced a regression in the md_cluster module. (Failed cases 02r1_Manage_re-add & 02r10_Manage_re-add) Consider a 2-node cluster: - node1 set faulty & remove command on a disk. - node2 must correctly update the array metadata. Before a1fd37f97808, on node1, the delay between msg:METADATA_UPDATED (triggered by faulty) and msg:REMOVE was sufficient for node2 to reload the disk info (written by node1). After a1fd37f97808, node1 no longer waits between faulty and remove, causing it to send msg:REMOVE while node2 is still reloading disk info. This often results in node2 failing to remove the faulty disk. == how to trigger == set up a 2-node cluster (node1 & node2) with disks vdc & vdd. on node1: mdadm -CR /dev/md0 -l1 -b clustered -n2 /dev/vdc /dev/vdd --assume-clean ssh node2-ip mdadm -A /dev/md0 /dev/vdc /dev/vdd mdadm --manage /dev/md0 --fail /dev/vdc --remove /dev/vdc check array status on both nodes with "mdadm -D /dev/md0". node1 output: Number Major Minor RaidDevice State - 0 0 0 removed 1 254 48 1 active sync /dev/vdd node2 output: Number Major Minor RaidDevice State - 0 0 0 removed 1 254 48 1 active sync /dev/vdd 0 254 32 - faulty /dev/vdc Fixes: a1fd37f97808 ("md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl") Signed-off-by: Heming Zhao <[email protected]> Reviewed-by: Su Yue <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
| * md: fix create on open mddev lifetime regressionYu Kuai2025-07-301-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 9e59d609763f ("md: call del_gendisk in control path") moves setting MD_DELETED from __mddev_put() to do_md_stop(), however, for the case create on open, mddev can be freed without do_md_stop(): 1) open md_probe md_alloc_and_put md_alloc mddev_alloc atomic_set(&mddev->active, 1); mddev->hold_active = UNTIL_IOCTL mddev_put atomic_dec_and_test(&mddev->active) if (mddev->hold_active) -> active is 0, hold_active is set md_open mddev_get atomic_inc(&mddev->active); 2) ioctl that is not STOP_ARRAY, for example, GET_ARRAY_INFO: md_ioctl mddev->hold_active = 0 3) close md_release mddev_put(mddev); atomic_dec_and_lock(&mddev->active, &all_mddevs_lock) __mddev_put -> hold_active is cleared, mddev will be freed queue_work(md_misc_wq, &mddev->del_work) Now that MD_DELETED is not set, before mddev is freed by mddev_delayed_delete(), md_open can still succeed and break mddev lifetime, causing mddev->kobj refcount underflow or mddev uaf problem. Fix this problem by setting MD_DELETED before queuing del_work. Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected]/ Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected]/ Fixes: 9e59d609763f ("md: call del_gendisk in control path") Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Paul Menzel <[email protected]> Reviewed-by: Xiao Ni <[email protected]>
* | Merge tag 'for-6.17/dm-changes' of ↵Linus Torvalds2025-08-0419-253/+102
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - fix checking for request-based stackable devices (dm-table) - fix corrupt_bio_byte setup checks (dm-flakey) - add support for resync w/o metadata devices (dm raid) - small code simplification (dm, dm-mpath, vm-vdo, dm-raid) - remove support for asynchronous hashes (dm-verity) - close smatch warning (dm-zoned-target) - update the documentation and enable inline-crypto passthrough (dm-thin) * tag 'for-6.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: set DM_TARGET_PASSES_CRYPTO feature for dm-thin dm-thin: update the documentation dm-raid: do not include dm-core.h vdo: omit need_resched() before cond_resched() md: dm-zoned-target: Initialize return variable r to avoid uninitialized use dm-verity: remove support for asynchronous hashes dm-mpath: don't print the "loaded" message if registering fails dm-mpath: make dm_unregister_path_selector return void dm: ima: avoid extra calls to strlen() dm: Simplify dm_io_complete() dm: Remove unnecessary return in dm_zone_endio() dm raid: add support for resync w/o metadata devices dm-flakey: Fix corrupt_bio_byte setup checks dm-table: fix checking for rq stackable devices
| * | dm: set DM_TARGET_PASSES_CRYPTO feature for dm-thinLongPing Wei2025-07-311-3/+4
| | | | | | | | | | | | | | | | | | | | | dm-thin obviously can pass through inline crypto support. Signed-off-by: LongPing Wei <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]>
| * | dm-raid: do not include dm-core.hPavel Tikhomirov2025-07-311-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 4cc96131afce ("dm: move request-based code out to dm-rq.[hc]") we have a note: "DM targets should _never_ include dm-core.h!". And it is not used in any DM targets except dm-raid now, so let's remove it from dm-raid for consistency, also use special helpers instead of accessing dm_table and mapper_device fields directly. This change is merely a cleanup and should not affect functionality. Signed-off-by: Pavel Tikhomirov <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]>
| * | vdo: omit need_resched() before cond_resched()Mikulas Patocka2025-07-311-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | There's no need to call need_resched() because cond_resched() will do nothing if need_resched() returns false. Reviewed-by: Matthew Sakai <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]>
| * | md: dm-zoned-target: Initialize return variable r to avoid uninitialized usePurva Yeshi2025-07-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix Smatch-detected error: drivers/md/dm-zoned-target.c:1073 dmz_iterate_devices() error: uninitialized symbol 'r'. Smatch detects a possible use of the uninitialized variable 'r' in dmz_iterate_devices() because if dmz->nr_ddevs is zero, the loop is skipped and 'r' is returned without being set, leading to undefined behavior. Initialize 'r' to 0 before the loop. This ensures that if there are no devices to iterate over, the function still returns a defined value. Signed-off-by: Purva Yeshi <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]>
| * | dm-verity: remove support for asynchronous hashesEric Biggers2025-07-313-173/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The support for asynchronous hashes in dm-verity has outlived its usefulness. It adds significant code complexity and opportunity for bugs. I don't know of anyone using it in practice. (The original submitter of the code possibly was, but that was 8 years ago.) Data I recently collected for en/decryption shows that using off-CPU crypto "accelerators" is consistently much slower than the CPU (https://lore.kernel.org/r/[email protected]/), even on CPUs that lack dedicated cryptographic instructions. Similar results are likely to be seen for hashing. I already removed support for asynchronous hashes from fsverity two years ago, and no one ever complained. Moreover, neither dm-verity, fsverity, nor fscrypt has ever actually used the asynchronous crypto algorithms in a truly asynchronous manner. The lack of interest in such optimizations provides further evidence that it's only the CPU-based crypto that actually matters. Historically, it's also been common for people to forget to enable the optimized SHA-256 code, which could contribute to an off-CPU crypto engine being perceived as more useful than it really is. In 6.16 I fixed that: the optimized SHA-256 code is now enabled by default. Therefore, let's drop the support for asynchronous hashes in dm-verity. Tested with verity-compat-test. Acked-by: Ard Biesheuvel <[email protected]> Signed-off-by: Eric Biggers <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]>
| * | dm-mpath: don't print the "loaded" message if registering failsMikulas Patocka2025-06-304-4/+12
| | | | | | | | | | | | | | | | | | If dm_register_path_selector, don't print the "version X loaded" message. Signed-off-by: Mikulas Patocka <[email protected]>
| * | dm-mpath: make dm_unregister_path_selector return voidMikulas Patocka2025-06-307-26/+9
| | | | | | | | | | | | | | | | | | | | | | | | dm_unregister_path_selector may only return error if there's a bug in the code - so we make it return void and print a warning if the user abuses this function to unregister a target that was not registered. Signed-off-by: Mikulas Patocka <[email protected]>
| * | dm: ima: avoid extra calls to strlen()Dmitry Antipov2025-06-271-23/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | Since 'scnprintf()' returns the number of characters emitted (not including the trailing '\0'), use that return value instead of the subsequent calls to 'strlen()' where appropriate. Compile tested only. Signed-off-by: Dmitry Antipov <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]>
| * | dm: Simplify dm_io_complete()Damien Le Moal2025-06-271-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The local variable first_requeue is not needed since it is always equal to dm_io_flagged(io, DM_IO_WAS_SPLIT). Call __dm_io_complete() passing this value directly and remove first_requeue. Also declare dm_io_complete() as inline to make sure it is inlined in its single call site, thus avoiding the cost of a function call. Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]>
| * | dm: Remove unnecessary return in dm_zone_endio()Damien Le Moal2025-06-271-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | The return statement at the end of dm_zone_endio() is not needed. Remove it. Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]>
| * | dm raid: add support for resync w/o metadata devicesHeinz Mauelshagen2025-06-231-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Target does not honour the "sync" argument when activated w/o metadata devices, e.g. with table line: "0 $(blockdev --getsz $data1) raid raid1 2 0 sync 2 - $data1 - $data2". Fix this to support temporary, transient raid devices useful for data duplication. Signed-off-by: Heinz Mauelshagen <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]>
| * | dm-flakey: Fix corrupt_bio_byte setup checksKent Overstreet2025-06-231-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the error_reads mode - it's incompatible with corrupt_bio_byte, but that's only enabled if corrupt_bio_byte is nonzero. Cc: Benjamin Marzinski <[email protected]> Cc: Mikulas Patocka <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: [email protected] Signed-off-by: Kent Overstreet <[email protected]> Reviewed-by: Benjamin Marzinski <[email protected]> Fixes: 19da6b2c9e8e ("dm-flakey: Clean up parsing messages") Signed-off-by: Mikulas Patocka <[email protected]>
| * | dm-table: fix checking for rq stackable devicesBenjamin Marzinski2025-06-231-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to the semantics of iterate_devices(), the current code allows a request-based dm table as long as it includes one request-stackable device. It is supposed to only allow tables where there are no non-request-stackable devices. Signed-off-by: Benjamin Marzinski <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]>
* | | Merge tag 'mm-stable-2025-07-30-15-25' of ↵Linus Torvalds2025-07-316-11/+10
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "As usual, many cleanups. The below blurbiage describes 42 patchsets. 21 of those are partially or fully cleanup work. "cleans up", "cleanup", "maintainability", "rationalizes", etc. I never knew the MM code was so dirty. "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes) addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for merging with existing adjacent VMAs. "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park) adds a new kernel module which simplifies the setup and usage of DAMON in production environments. "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig) is a cleanup to the writeback code which removes a couple of pointers from struct writeback_control. "drivers/base/node.c: optimization and cleanups" (Donet Tom) contains largely uncorrelated cleanups to the NUMA node setup and management code. "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman) does some maintenance work on the userfaultfd code. "Readahead tweaks for larger folios" (Ryan Roberts) implements some tuneups for pagecache readahead when it is reading into order>0 folios. "selftests/mm: Tweaks to the cow test" (Mark Brown) provides some cleanups and consistency improvements to the selftests code. "Optimize mremap() for large folios" (Dev Jain) does that. A 37% reduction in execution time was measured in a memset+mremap+munmap microbenchmark. "Remove zero_user()" (Matthew Wilcox) expunges zero_user() in favor of the more modern memzero_page(). "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand) addresses some warts which David noticed in the huge page code. These were not known to be causing any issues at this time. "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park) provides some cleanup and consolidation work in DAMON. "use vm_flags_t consistently" (Lorenzo Stoakes) uses vm_flags_t in places where we were inappropriately using other types. "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy) increases the reliability of large page allocation in the memfd code. "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple) removes several now-unneeded PFN_* flags. "mm/damon: decouple sysfs from core" (SeongJae Park) implememnts some cleanup and maintainability work in the DAMON sysfs layer. "madvise cleanup" (Lorenzo Stoakes) does quite a lot of cleanup/maintenance work in the madvise() code. "madvise anon_name cleanups" (Vlastimil Babka) provides additional cleanups on top or Lorenzo's effort. "Implement numa node notifier" (Oscar Salvador) creates a standalone notifier for NUMA node memory state changes. Previously these were lumped under the more general memory on/offline notifier. "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan) cleans up the pageblock isolation code and fixes a potential issue which doesn't seem to cause any problems in practice. "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park) adds additional drgn- and python-based DAMON selftests which are more comprehensive than the existing selftest suite. "Misc rework on hugetlb faulting path" (Oscar Salvador) fixes a rather obscure deadlock in the hugetlb fault code and follows that fix with a series of cleanups. "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport) rationalizes and cleans up the highmem-specific code in the CMA allocator. "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand) provides cleanups and future-preparedness to the migration code. "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park) adds some tracepoints to some DAMON auto-tuning code. "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park) does that. "mm/damon: misc cleanups" (SeongJae Park) also does what it claims. "mm: folio_pte_batch() improvements" (David Hildenbrand) cleans up the large folio PTE batching code. "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park) facilitates dynamic alteration of DAMON's inter-node allocation policy. "Remove unmap_and_put_page()" (Vishal Moola) provides a couple of page->folio conversions. "mm: per-node proactive reclaim" (Davidlohr Bueso) implements a per-node control of proactive reclaim - beyond the current memcg-based implementation. "mm/damon: remove damon_callback" (SeongJae Park) replaces the damon_callback interface with a more general and powerful damon_call()+damos_walk() interface. "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes) implements a number of mremap cleanups (of course) in preparation for adding new mremap() functionality: newly permit the remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It still excludes some specialized situations where this cannot be performed reliably. "drop hugetlb_free_pgd_range()" (Anthony Yznaga) switches some sparc hugetlb code over to the generic version and removes the thus-unneeded hugetlb_free_pgd_range(). "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park) augments the present userspace-requested update of DAMON sysfs monitoring files. Automatic update is now provided, along with a tunable to control the update interval. "Some randome fixes and cleanups to swapfile" (Kemeng Shi) does what is claims. "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand) provides (and uses) a means by which debug-style functions can grab a copy of a pageframe and inspect it locklessly without tripping over the races inherent in operating on the live pageframe directly. "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan) addresses the large contention issues which can be triggered by reads from that procfs file. Latencies are reduced by more than half in some situations. The series also introduces several new selftests for the /proc/pid/maps interface. "__folio_split() clean up" (Zi Yan) cleans up __folio_split()! "Optimize mprotect() for large folios" (Dev Jain) provides some quite large (>3x) speedups to mprotect() when dealing with large folios. "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian) does some cleanup work in the selftests code. "tools/testing: expand mremap testing" (Lorenzo Stoakes) extends the mremap() selftest in several ways, including adding more checking of Lorenzo's recently added "permit mremap() move of multiple VMAs" feature. "selftests/damon/sysfs.py: test all parameters" (SeongJae Park) extends the DAMON sysfs interface selftest so that it tests all possible user-requested parameters. Rather than the present minimal subset" * tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits) MAINTAINERS: add missing headers to mempory policy & migration section MAINTAINERS: add missing file to cgroup section MAINTAINERS: add MM MISC section, add missing files to MISC and CORE MAINTAINERS: add missing zsmalloc file MAINTAINERS: add missing files to page alloc section MAINTAINERS: add missing shrinker files MAINTAINERS: move memremap.[ch] to hotplug section MAINTAINERS: add missing mm_slot.h file THP section MAINTAINERS: add missing interval_tree.c to memory mapping section MAINTAINERS: add missing percpu-internal.h file to per-cpu section mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info() selftests/damon: introduce _common.sh to host shared function selftests/damon/sysfs.py: test runtime reduction of DAMON parameters selftests/damon/sysfs.py: test non-default parameters runtime commit selftests/damon/sysfs.py: generalize DAMON context commit assertion selftests/damon/sysfs.py: generalize monitoring attributes commit assertion selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion selftests/damon/sysfs.py: test DAMOS filters commitment selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion selftests/damon/sysfs.py: test DAMOS destinations commitment ...
| * | mm: remove callers of pfn_t functionalityAlistair Popple2025-07-106-11/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All PFN_* pfn_t flags have been removed. Therefore there is no longer a need for the pfn_t type and all uses can be replaced with normal pfns. Link: https://lkml.kernel.org/r/bbedfa576c9822f8032494efbe43544628698b1f.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Björn Töpel <[email protected]> Cc: Björn Töpel <[email protected]> Cc: Chunyan Zhang <[email protected]> Cc: Dan Williams <[email protected]> Cc: Deepak Gupta <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Inki Dae <[email protected]> Cc: John Groves <[email protected]> Cc: John Hubbard <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
* | | Merge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linuxLinus Torvalds2025-07-289-72/+158
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - MD pull request via Yu: - call del_gendisk synchronously (Xiao) - cleanup unused variable (John) - cleanup workqueue flags (Ryo) - fix faulty rdev can't be removed during resync (Qixing) - NVMe pull request via Christoph: - try PCIe function level reset on init failure (Keith Busch) - log TLS handshake failures at error level (Maurizio Lombardi) - pci-epf: do not complete commands twice if nvmet_req_init() fails (Rick Wertenbroek) - misc cleanups (Alok Tiwari) - Removal of the pktcdvd driver This has been more than a decade coming at this point, and some recently revealed breakages that had it causing issues even for cases where it isn't required made me re-pull the trigger on this one. It's known broken and nobody has stepped up to maintain the code - Series for ublk supporting batch commands, enabling the use of multishot where appropriate - Speed up ublk exit handling - Fix for the two-stage elevator fixing which could leak data - Convert NVMe to use the new IOVA based API - Increase default max transfer size to something more reasonable - Series fixing write operations on zoned DM devices - Add tracepoints for zoned block device operations - Prep series working towards improving blk-mq queue management in the presence of isolated CPUs - Don't allow updating of the block size of a loop device that is currently under exclusively ownership/open - Set chunk sectors from stacked device stripe size and use it for the atomic write size limit - Switch to folios in bcache read_super() - Fix for CD-ROM MRW exit flush handling - Various tweaks, fixes, and cleanups * tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits) block: restore two stage elevator switch while running nr_hw_queue update cdrom: Call cdrom_mrw_exit from cdrom_release function sunvdc: Balance device refcount in vdc_port_mpgroup_check nvme-pci: try function level reset on init failure dm: split write BIOs on zone boundaries when zone append is not emulated block: use chunk_sectors when evaluating stacked atomic write limits dm-stripe: limit chunk_sectors to the stripe size md/raid10: set chunk_sectors limit md/raid0: set chunk_sectors limit block: sanitize chunk_sectors for atomic write limits ilog2: add max_pow_of_two_factor() nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails nvme-tcp: log TLS handshake failures at error level docs: nvme: fix grammar in nvme-pci-endpoint-target.rst nvme: fix typo in status code constant for self-test in progress nvmet: remove redundant assignment of error code in nvmet_ns_enable() nvme: fix incorrect variable in io cqes error message nvme: fix multiple spelling and grammar issues in host drivers block: fix blk_zone_append_update_request_bio() kernel-doc md/raid10: fix set but not used variable in sync_request_write() ...
| * \ \ Merge tag 'md-6.17-20250722' of ↵Jens Axboe2025-07-224-36/+68
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.17/block Pull MD updates from Yu: "- call del_gendisk synchronously, from Xiao - cleanup unused variable, from John - cleanup workqueue flags, from Ryo - fix faulty rdev can't be removed during resync, from Qixing" * tag 'md-6.17-20250722' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md/raid10: fix set but not used variable in sync_request_write() md: allow removing faulty rdev during resync md/raid5: unset WQ_CPU_INTENSIVE for raid5 unbound workqueue md: remove/add redundancy group only in level change md: Don't clear MD_CLOSING until mddev is freed md: call del_gendisk in control path
| | * | | md/raid10: fix set but not used variable in sync_request_write()John Garry2025-07-161-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Building with W=1 reports the following: drivers/md/raid10.c: In function ‘sync_request_write’: drivers/md/raid10.c:2441:21: error: variable ‘d’ set but not used [-Werror=unused-but-set-variable] 2441 | int d; | ^ cc1: all warnings being treated as errors Remove the usage of that variable. Fixes: 752d0464b78a ("md: clean up accounting for issued sync IO") Signed-off-by: John Garry <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
| | * | | md: allow removing faulty rdev during resyncZheng Qixing2025-07-121-7/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During RAID resync, faulty rdev cannot be removed and will result in "Device or resource busy" error when attempting hot removal. Reproduction steps: mdadm -Cv /dev/md0 -l1 -n3 -e1.2 /dev/sd{b..d} mdadm /dev/md0 -f /dev/sdb mdadm /dev/md0 -r /dev/sdb -> mdadm: hot remove failed for /dev/sdb: Device or resource busy After commit 4b10a3bc67c1 ("md: ensure resync is prioritized over recovery"), when a device becomes faulty during resync, the md_choose_sync_action() function returns early without calling remove_and_add_spares(), preventing faulty device removal. This patch extracts a helper function remove_spares() to support removing faulty devices during RAID resync operations. Fixes: 4b10a3bc67c1 ("md: ensure resync is prioritized over recovery") Signed-off-by: Zheng Qixing <[email protected]> Reviewed-by: Li Nan <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
| | * | | md/raid5: unset WQ_CPU_INTENSIVE for raid5 unbound workqueueRyo Takakura2025-07-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When specified with WQ_CPU_INTENSIVE, the workqueue doesn't participate in concurrency management. This behaviour is already accounted for WQ_UNBOUND workqueues given that they are assigned to their own worker threads. Unset WQ_CPU_INTENSIVE as the use of flag has no effect when used with WQ_UNBOUND. Signed-off-by: Ryo Takakura <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
| | * | | md: remove/add redundancy group only in level changeXiao Ni2025-07-121-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | del_gendisk is called in synchronous way now. So it doesn't need to handle redundancy group in stop path separately. Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
| | * | | md: Don't clear MD_CLOSING until mddev is freedXiao Ni2025-07-121-12/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UNTIL_STOP is used to avoid mddev is freed on the last close before adding disks to mddev. And it should be cleared when stopping an array which is mentioned in commit efeb53c0e572 ("md: Allow md devices to be created by name."). So reset ->hold_active to 0 in md_clean. And MD_CLOSING should be kept until mddev is freed to avoid reopen. Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
| | * | | md: call del_gendisk in control pathXiao Ni2025-07-122-12/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now del_gendisk and put_disk are called asynchronously in workqueue work. The asynchronous way has a problem that the device node can still exist after mdadm --stop command returns in a short window. So udev rule can open this device node and create the struct mddev in kernel again. So put del_gendisk in control path and still leave put_disk in md_kobj_release to avoid uaf of gendisk. Function del_gendisk can't be called with reconfig_mutex. If it's called with reconfig mutex, a deadlock can happen. del_gendisk waits all sysfs files access to finish and sysfs file access waits reconfig mutex. So put del_gendisk after releasing reconfig mutex. But there is still a window that sysfs can be accessed between mddev_unlock and del_gendisk. So some actions (add disk, change level, .e.g) can happen which lead unexpected results. MD_DELETED is used to resolve this problem. MD_DELETED is set before releasing reconfig mutex and it should be checked for these sysfs access which need reconfig mutex. For sysfs access which don't need reconfig mutex, del_gendisk will wait them to finish. But it doesn't need to do this in function mddev_lock_nointr. There are ten places that call it. * Five of them are in dm raid which we don't need to care. MD_DELETED is only used for md raid. * stop_sync_thread, md_do_sync and md_start_sync are related sync request, and it needs to wait sync thread to finish before stopping an array. * md_ioctl: md_open is called before md_ioctl, so ->openers is added. It will fail to stop the array. So it doesn't need to check MD_DELETED here * md_set_readonly: It needs to call mddev_set_closing_and_sync_blockdev when setting readonly or read_auto. So it will fail to stop the array too because MD_CLOSING is already set. Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Link: https://lore.kernel.org/linux-raid/[email protected] Signed-off-by: Yu Kuai <[email protected]>
| * | | | dm: split write BIOs on zone boundaries when zone append is not emulatedShin'ichiro Kawasaki2025-07-171-11/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 2df7168717b7 ("dm: Always split write BIOs to zoned device limits") updates the device-mapper driver to perform splits for the write BIOs. However, it did not address the cases where DM targets do not emulate zone append, such as in the cases of dm-linear or dm-flakey. For these targets, when the write BIOs span across zone boundaries, they trigger WARN_ON_ONCE(bio_straddles_zones(bio)) in blk_zone_wplug_handle_write(). This results in I/O errors. The errors are reproduced by running blktests test case zbd/004 using zoned dm-linear or dm-flakey devices. To avoid the I/O errors, handle the write BIOs regardless whether DM targets emulate zone append or not, so that all write BIOs are split at zone boundaries. For that purpose, drop the check for zone append emulation in dm_zone_bio_needs_split(). Its argument 'md' is no longer used then drop it also. Fixes: 2df7168717b7 ("dm: Always split write BIOs to zoned device limits") Signed-off-by: Shin'ichiro Kawasaki <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Mikulas Patocka <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * | | | dm-stripe: limit chunk_sectors to the stripe sizeJohn Garry2025-07-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Same as done for raid0, set chunk_sectors limit to appropriately set the atomic write size limit. Setting chunk_sectors limit in this way overrides the stacked limit already calculated based on the bottom device limits. This is ok, as when any bios are sent to the bottom devices, the block layer will still respect the bottom device chunk_sectors. Reviewed-by: Nilay Shroff <[email protected]> Reviewed-by: Mikulas Patocka <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * | | | md/raid10: set chunk_sectors limitJohn Garry2025-07-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Same as done for raid0, set chunk_sectors limit to appropriately set the atomic write size limit. Reviewed-by: Nilay Shroff <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * | | | md/raid0: set chunk_sectors limitJohn Garry2025-07-171-0/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we use min io size as the chunk size when deciding on the atomic write size limits - see blk_stack_atomic_writes_head(). The limit min_io size is not a reliable value to store the chunk size, as this may be mutated by the block stacking code. Such an example would be for the min io size less than the physical block size, and the min io size is raised to the physical block size - see blk_stack_limits(). The block stacking limits will rely on chunk_sectors in future, so set this value (to the chunk size). Reviewed-by: Nilay Shroff <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * | | bcache: switch from pages to folios in read_super()Matthew Wilcox (Oracle)2025-07-031-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Retrieve a folio from the page cache instead of a page. Removes a hidden call to compound_head(). Then be sure to call folio_put() instead of put_page() to release it. That doesn't save any calls to compound_head(), just moves them around. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Coly Li <[email protected]> Acked-back: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: commit message massaging] Signed-off-by: Jens Axboe <[email protected]>
| * | | dm: Check for forbidden splitting of zone write operationsDamien Le Moal2025-06-301-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DM targets must not split zone append and write operations using dm_accept_partial_bio() as doing so is forbidden for zone append BIOs, breaks zone append emulation using regular write BIOs and potentially creates deadlock situations with queue freeze operations. Modify dm_accept_partial_bio() to add missing BUG_ON() checks for all these cases, that is, check that the BIO is a write or write zeroes operation. This change packs all the zone related checks together under a static_branch_unlikely(&zoned_enabled) and done only if the target is a zoned device. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: [email protected] Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Mikulas Patocka <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * | | dm: dm-crypt: Do not partially accept write BIOs with zoned targetsDamien Le Moal2025-06-301-9/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Read and write operations issued to a dm-crypt target may be split according to the dm-crypt internal limits defined by the max_read_size and max_write_size module parameters (default is 128 KB). The intent is to improve processing time of large BIOs by splitting them into smaller operations that can be parallelized on different CPUs. For zoned dm-crypt targets, this BIO splitting is still done but without the parallel execution to ensure that the issuing order of write operations to the underlying devices remains sequential. However, the splitting itself causes other problems: 1) Since dm-crypt relies on the block layer zone write plugging to handle zone append emulation using regular write operations, the reminder of a split write BIO will always be plugged into the target zone write plugged. Once the on-going write BIO finishes, this reminder BIO is unplugged and issued from the zone write plug work. If this reminder BIO itself needs to be split, the reminder will be re-issued and plugged again, but that causes a call to a blk_queue_enter(), which may block if a queue freeze operation was initiated. This results in a deadlock as DM submission still holds BIOs that the queue freeze side is waiting for. 2) dm-crypt relies on the emulation done by the block layer using regular write operations for processing zone append operations. This still requires to properly return the written sector as the BIO sector of the original BIO. However, this can be done correctly only and only if there is a single clone BIO used for processing the original zone append operation issued by the user. If the size of a zone append operation is larger than dm-crypt max_write_size, then the orginal BIO will be split and processed as a chain of regular write operations. Such chaining result in an incorrect written sector being returned to the zone append issuer using the original BIO sector. This in turn results in file system data corruptions using xfs or btrfs. Fix this by modifying get_max_request_size() to always return the size of the BIO to avoid it being split with dm_accpet_partial_bio() in crypt_map(). get_max_request_size() is renamed to get_max_request_sectors() to clarify the unit of the value returned and its interface is changed to take a struct dm_target pointer and a pointer to the struct bio being processed. In addition to this change, to ensure that crypt_alloc_buffer() works correctly, set the dm-crypt device max_hw_sectors limit to be at most BIO_MAX_VECS << PAGE_SECTORS_SHIFT (1 MB with a 4KB page architecture). This forces DM core to split write BIOs before passing them to crypt_map(), and thus guaranteeing that dm-crypt can always accept an entire write BIO without needing to split it. This change does not have any effect on the read path of dm-crypt. Read operations can still be split and the BIO fragments processed in parallel. There is also no impact on the performance of the write path given that all zone write BIOs were already processed inline instead of in parallel. This change also does not affect in any way regular dm-crypt block devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: [email protected] Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Mikulas Patocka <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>