aboutsummaryrefslogtreecommitdiffstats
path: root/io_uring
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'io_uring-6.17-20250919' of git://git.kernel.dk/linuxLinus Torvalds2025-09-198-37/+15
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring fixes from Jens Axboe: - Fix for a regression introduced in the io-wq worker creation logic. - Remove the allocation cache for the msg_ring io_kiocb allocations. I have a suspicion that there's a bug there, and since we just fixed one in that area, let's just yank the use of that cache entirely. It's not that important, and it kills some code. - Treat a closed ring like task exiting in that any requests that trigger post that condition should just get canceled. Doesn't fix any real issues, outside of having tasks being able to rely on that guarantee. - Fix for a bug in the network zero-copy notification mechanism, where a comparison for matching tctx/ctx for notifications was buggy in that it didn't correctly compare with the previous notification. * tag 'io_uring-6.17-20250919' of git://git.kernel.dk/linux: io_uring: fix incorrect io_kiocb reference in io_link_skb io_uring/msg_ring: kill alloc_cache for io_kiocb allocations io_uring: include dying ring in task_work "should cancel" state io_uring/io-wq: fix `max_workers` breakage and `nr_workers` underflow
| * io_uring: fix incorrect io_kiocb reference in io_link_skbYang Xiuwei2025-09-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In io_link_skb function, there is a bug where prev_notif is incorrectly assigned using 'nd' instead of 'prev_nd'. This causes the context validation check to compare the current notification with itself instead of comparing it with the previous notification. Fix by using the correct prev_nd parameter when obtaining prev_notif. Signed-off-by: Yang Xiuwei <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Fixes: 6fe4220912d19 ("io_uring/notif: implement notification stacking") Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/msg_ring: kill alloc_cache for io_kiocb allocationsJens Axboe2025-09-182-26/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent commit: fc582cd26e88 ("io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU") fixed an issue with not deferring freeing of io_kiocb structs that msg_ring allocates to after the current RCU grace period. But this only covers requests that don't end up in the allocation cache. If a request goes into the alloc cache, it can get reused before it is sane to do so. A recent syzbot report would seem to indicate that there's something there, however it may very well just be because of the KASAN poisoning that the alloc_cache handles manually. Rather than attempt to make the alloc_cache sane for that use case, just drop the usage of the alloc_cache for msg_ring request payload data. Fixes: 50cf5f3842af ("io_uring/msg_ring: add an alloc cache for io_kiocb entries") Link: https://lore.kernel.org/io-uring/[email protected]/ Reported-by: [email protected] Signed-off-by: Jens Axboe <[email protected]>
| * io_uring: include dying ring in task_work "should cancel" stateJens Axboe2025-09-185-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running task_work for an exiting task, rather than perform the issue retry attempt, the task_work is canceled. However, this isn't done for a ring that has been closed. This can lead to requests being successfully completed post the ring being closed, which is somewhat confusing and surprising to an application. Rather than just check the task exit state, also include the ring ref state in deciding whether or not to terminate a given request when run from task_work. Cc: [email protected] # 6.1+ Link: https://github.com/axboe/liburing/discussions/1459 Reported-by: Benedek Thaler <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/io-wq: fix `max_workers` breakage and `nr_workers` underflowMax Kellermann2025-09-151-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 88e6c42e40de ("io_uring/io-wq: add check free worker before create new worker") reused the variable `do_create` for something else, abusing it for the free worker check. This caused the value to effectively always be `true` at the time `nr_workers < max_workers` was checked, but it should really be `false`. This means the `max_workers` setting was ignored, and worse: if the limit had already been reached, incrementing `nr_workers` was skipped even though another worker would be created. When later lots of workers exit, the `nr_workers` field could easily underflow, making the problem worse because more and more workers would be created without incrementing `nr_workers`. The simple solution is to use a different variable for the free worker check instead of using one variable for two different things. Cc: [email protected] Fixes: 88e6c42e40de ("io_uring/io-wq: add check free worker before create new worker") Signed-off-by: Max Kellermann <[email protected]> Reviewed-by: Fengnan Chang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
* | Merge tag 'vfs-6.17-rc6.fixes' of ↵Linus Torvalds2025-09-081-0/+3
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "fuse: - Prevent opening of non-regular backing files. Fuse doesn't support non-regular files anyway. - Check whether copy_file_range() returns a larger size than requested. - Prevent overflow in copy_file_range() as fuse currently only supports 32-bit sized copies. - Cache the blocksize value if the server returned a new value as inode->i_blkbits isn't modified directly anymore. - Fix i_blkbits handling for iomap partial writes. By default i_blkbits is set to PAGE_SIZE which causes iomap to mark the whole folio as uptodate even on a partial write. But fuseblk filesystems support choosing a blocksize smaller than PAGE_SIZE risking data corruption. Simply enforce PAGE_SIZE as blocksize for fuseblk's internal inode for now. - Prevent out-of-bounds acces in fuse_dev_write() when the number of bytes to be retrieved is truncated to the fc->max_pages limit. virtiofs: - Fix page faults for DAX page addresses. Misc: - Tighten file handle decoding from userns. Check that the decoded dentry itself has a valid idmapping in the user namespace. - Fix mount-notify selftests. - Fix some indentation errors. - Add an FMODE_ flag to indicate IOCB_HAS_METADATA availability. This will be moved to an FOP_* flag with a bit more rework needed for that to happen not suitable for a fix. - Don't silently ignore metadata for sync read/write. - Don't pointlessly log warning when reading coredump sysctls" * tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fuse: virtio_fs: fix page fault for DAX page address selftests/fs/mount-notify: Fix compilation failure. fhandle: use more consistent rules for decoding file handle from userns fuse: Block access to folio overlimit fuse: fix fuseblk i_blkbits for iomap partial writes fuse: reflect cached blocksize if blocksize was changed fuse: prevent overflow in copy_file_range return value fuse: check if copy_file_range() returns larger than requested size fuse: do not allow mapping a non-regular backing file coredump: don't pointlessly check and spew warnings fs: fix indentation style block: don't silently ignore metadata for sync read/write fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availability Please enter a commit message to explain why this merge is necessary, especially if it merges an updated upstream into a topic branch.
| * Merge tag 'fuse-fixes-6.17-rc5' of ↵Christian Brauner2025-09-012-0/+4
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse into vfs.fixes fuse fixes for 6.17-rc5 * tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (6 commits) fuse: Block access to folio overlimit fuse: fix fuseblk i_blkbits for iomap partial writes fuse: reflect cached blocksize if blocksize was changed fuse: prevent overflow in copy_file_range return value fuse: check if copy_file_range() returns larger than requested size fuse: do not allow mapping a non-regular backing file Link: https://lore.kernel.org/CAJfpeguEVMMyw_zCb+hbOuSxdE2Z3Raw=SJsq=Y56Ae6dn2W3g@mail.gmail.com Signed-off-by: Christian Brauner <[email protected]>
| * | fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availabilityChristoph Hellwig2025-08-201-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the kernel will happily route io_uring requests with metadata to file operations that don't support it. Add a FMODE_ flag to guard that. Fixes: 4de2ce04c862 ("fs: introduce IOCB_HAS_METADATA for metadata") Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/[email protected] Signed-off-by: Christian Brauner <[email protected]>
* | | io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengthsJens Axboe2025-08-281-7/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the buffers are mapped from userspace, it is prudent to use READ_ONCE() to read the value into a local variable, and use that for any other actions taken. Having a stable read of the buffer length avoids worrying about it changing after checking, or being read multiple times. Similarly, the buffer may well change in between it being picked and being committed. Ensure the looping for incremental ring buffer commit stops if it hits a zero sized buffer, as no further progress can be made at that point. Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://lore.kernel.org/io-uring/[email protected]/ Reported-by: Qingyue Zhang <[email protected]> Reported-by: Suoxing Zhang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
* | | io_uring/kbuf: fix signedness in this_len calculationQingyue Zhang2025-08-271-1/+1
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | When importing and using buffers, buf->len is considered unsigned. However, buf->len is converted to signed int when committing. This can lead to unexpected behavior if the buffer is large enough to be interpreted as a negative value. Make min_t calculation unsigned. Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Co-developed-by: Suoxing Zhang <[email protected]> Signed-off-by: Suoxing Zhang <[email protected]> Signed-off-by: Qingyue Zhang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
* | io_uring: clear ->async_data as part of normal initJens Axboe2025-08-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | Opcode handlers like POLL_ADD will use ->async_data as the pointer for double poll handling, which is a bit different than the usual case where it's strictly gated by the REQ_F_ASYNC_DATA flag. Be a bit more proactive in handling ->async_data, and clear it to NULL as part of regular init. Init is touching that cacheline anyway, so might as well clear it. Signed-off-by: Jens Axboe <[email protected]>
* | io_uring/futex: ensure io_futex_wait() cleans up properly on failureJens Axboe2025-08-211-0/+3
|/ | | | | | | | | | | | | | | | | | The io_futex_data is allocated upfront and assigned to the io_kiocb async_data field, but the request isn't marked with REQ_F_ASYNC_DATA at that point. Those two should always go together, as the flag tells io_uring whether the field is valid or not. Additionally, on failure cleanup, the futex handler frees the data but does not clear ->async_data. Clear the data and the flag in the error path as well. Thanks to Trend Micro Zero Day Initiative and particularly ReDress for reporting this. Cc: [email protected] Fixes: 194bb58c6090 ("io_uring: add support for futex wake and wait") Signed-off-by: Jens Axboe <[email protected]>
* io_uring/io-wq: add check free worker before create new workerFengnan Chang2025-08-131-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 0b2b066f8a85 ("io_uring/io-wq: only create a new worker if it can make progress"), in our produce environment, we still observe that part of io_worker threads keeps creating and destroying. After analysis, it was confirmed that this was due to a more complex scenario involving a large number of fsync operations, which can be abstracted as frequent write + fsync operations on multiple files in a single uring instance. Since write is a hash operation while fsync is not, and fsync is likely to be suspended during execution, the action of checking the hash value in io_wqe_dec_running cannot handle such scenarios. Similarly, if hash-based work and non-hash-based work are sent at the same time, similar issues are likely to occur. Returning to the starting point of the issue, when a new work arrives, io_wq_enqueue may wake up free worker A, while io_wq_dec_running may create worker B. Ultimately, only one of A and B can obtain and process the task, leaving the other in an idle state. In the end, the issue is caused by inconsistent logic in the checks performed by io_wq_enqueue and io_wq_dec_running. Therefore, the problem can be resolved by checking for available workers in io_wq_dec_running. Signed-off-by: Fengnan Chang <[email protected]> Reviewed-by: Diangang Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
* io_uring/net: commit partial buffers on retryJens Axboe2025-08-121-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | Ring provided buffers are potentially only valid within the single execution context in which they were acquired. io_uring deals with this and invalidates them on retry. But on the networking side, if MSG_WAITALL is set, or if the socket is of the streaming type and too little was processed, then it will hang on to the buffer rather than recycle or commit it. This is problematic for two reasons: 1) If someone unregisters the provided buffer ring before a later retry, then the req->buf_list will no longer be valid. 2) If multiple sockers are using the same buffer group, then multiple receives can consume the same memory. This can cause data corruption in the application, as either receive could land in the same userspace buffer. Fix this by disallowing partial retries from pinning a provided buffer across multiple executions, if ring provided buffers are used. Cc: [email protected] Reported-by: pt x <[email protected]> Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring") Signed-off-by: Jens Axboe <[email protected]>
* io_uring/memmap: cast nr_pages to size_t before shiftingJens Axboe2025-08-081-1/+1
| | | | | | | | | | | | | | | | | If the allocated size exceeds UINT_MAX, then it's necessary to cast the mr->nr_pages value to size_t to prevent it from overflowing. In practice this isn't much of a concern as the required memory size will have been validated upfront, and accounted to the user. And > 4GB sizes will be necessary to make the lack of a cast a problem, which greatly exceeds normal user locked_vm settings that are generally in the kb to mb range. However, if root is used, then accounting isn't done, and then it's possible to hit this issue. Link: https://lore.kernel.org/all/[email protected]/ Cc: [email protected] Reported-by: [email protected] Fixes: 087f997870a9 ("io_uring/memmap: implement mmap for regions") Signed-off-by: Jens Axboe <[email protected]>
* io_uring/net: Allow to do vectorized sendNorman Maurer2025-07-301-2/+7
| | | | | | | | | | | | | | At the moment you have to use sendmsg for vectorized send. While this works it's suboptimal as it also means you need to allocate a struct msghdr that needs to be kept alive until a submission happens. We can remove this limitation by just allowing to use send directly. Signed-off-by: Norman Maurer <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: remove -EINVAL return for SENDMSG and SEND_VECTORIZED] [axboe: allow send_zc to set SEND_VECTORIZED too] Signed-off-by: Jens Axboe <[email protected]>
* Merge tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linuxLinus Torvalds2025-07-2818-216/+867
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring updates from Jens Axboe: - Optimization to avoid reference counts on non-cloned registered buffers. This is how these buffers were handled prior to having cloning support, and we can still use that approach as long as the buffers haven't been cloned to another ring. - Cleanup and improvement for uring_cmd, where btrfs was the only user of storing allocated data for the lifetime of the uring_cmd. Clean that up so we can get rid of the need to do that. - Avoid unnecessary memory copies in uring_cmd usage. This is particularly important as a lot of uring_cmd usage necessitates the use of 128b SQEs. - A few updates for recv multishot, where it's now possible to add fairness limits for limiting how much is transferred for each retry loop. Additionally, recv multishot now supports an overall cap as well, where once reached the multishot recv will terminate. The latter is useful for buffer management and juggling many recv streams at the same time. - Add support for returning the TX timestamps via a new socket command. This feature can work in either singleshot or multishot mode, where the latter triggers a completion whenever new timestamps are available. This is an alternative to using the existing error queue. - Add support for an io_uring "mock" file, which is the start of being able to do 100% targeted testing in terms of exercising io_uring request handling. The idea is to have a file type that can be anything the tester would like, and behave exactly how you want it to behave in terms of hitting the code paths you want. - Improve zcrx by using sgtables to de-duplicate and improve dma address handling. - Prep work for supporting larger pages for zcrx. - Various little improvements and fixes. * tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux: (42 commits) io_uring/zcrx: fix leaking pages on sg init fail io_uring/zcrx: don't leak pages on account failure io_uring/zcrx: fix null ifq on area destruction io_uring: fix breakage in EXPERT menu io_uring/cmd: remove struct io_uring_cmd_data btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag io_uring/zcrx: account area memory io_uring: export io_[un]account_mem io_uring/net: Support multishot receive len cap io_uring: deduplicate wakeup handling io_uring/net: cast min_not_zero() type io_uring/poll: cleanup apoll freeing io_uring/net: allow multishot receive per-invocation cap io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags io_uring/net: use passed in 'len' in io_recv_buf_select() io_uring/zcrx: prepare fallback for larger pages io_uring/zcrx: assert area type in io_zcrx_iov_page io_uring/zcrx: allocate sgtable for umem areas io_uring/zcrx: introduce io_populate_area_dma ...
| * io_uring/zcrx: fix leaking pages on sg init failPavel Begunkov2025-07-211-1/+3
| | | | | | | | | | | | | | | | | | | | If sg_alloc_table_from_pages() fails, io_import_umem() returns without cleaning up pinned pages first. Fix it. Fixes: b84621d96ee02 ("io_uring/zcrx: allocate sgtable for umem areas") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9fd94d1bc8c316611eccfec7579799182ff3fb0a.1753091564.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/zcrx: don't leak pages on account failurePavel Begunkov2025-07-211-4/+2
| | | | | | | | | | | | | | | | | | | | | | Someone needs to release pinned pages in io_import_umem() if accounting fails. Assign them to the area but return an error, the following io_zcrx_free_area() will clean them up. Fixes: 262ab205180d2 ("io_uring/zcrx: account area memory") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/e19f283a912f200c0d427e376cb789fc3f3d69bc.1753091564.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/zcrx: fix null ifq on area destructionPavel Begunkov2025-07-211-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Dan reports that ifq can be null when infering arguments for io_unaccount_mem() from io_zcrx_free_area(). Fix it by always setting a correct ifq. Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Closes: https://lore.kernel.org/r/[email protected]/ Fixes: 262ab205180d2 ("io_uring/zcrx: account area memory") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/20670d163bb90dba2a81a4150f1125603cefb101.1753091564.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/cmd: remove struct io_uring_cmd_dataCaleb Sander Mateos2025-07-182-12/+1
| | | | | | | | | | | | | | | | | | | | | | There are no more users of struct io_uring_cmd_data and its op_data field. Remove it to shave 8 bytes from struct io_async_cmd and eliminate a store and load for every uring_cmd. Signed-off-by: Caleb Sander Mateos <[email protected]> Acked-by: David Sterba <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/cmd: introduce IORING_URING_CMD_REISSUE flagCaleb Sander Mateos2025-07-181-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | Add a flag IORING_URING_CMD_REISSUE that ->uring_cmd() implementations can use to tell whether this is the first or subsequent issue of the uring_cmd. This will allow ->uring_cmd() implementations to store information in the io_uring_cmd's pdu across issues. Signed-off-by: Caleb Sander Mateos <[email protected]> Acked-by: David Sterba <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/zcrx: account area memoryPavel Begunkov2025-07-162-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | zcrx areas can be quite large and need to be accounted and checked against RLIMIT_MEMLOCK. In practise it shouldn't be a big issue as the inteface already requires cap_net_admin. Cc: [email protected] Fixes: cf96310c5f9a0 ("io_uring/zcrx: add io_zcrx_area") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/4b53f0c575bd062f63d12bec6cac98037fc66aeb.1752699568.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * io_uring: export io_[un]account_memPavel Begunkov2025-07-162-2/+4
| | | | | | | | | | | | | | | | | | | | | | Export pinned memory accounting helpers, they'll be used by zcrx shortly. Cc: [email protected] Fixes: cf96310c5f9a0 ("io_uring/zcrx: add io_zcrx_area") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9a61e54bd89289b39570ae02fe620e12487439e4.1752699568.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/net: Support multishot receive len capNorman Maurer2025-07-161-4/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment its very hard to do fine grained backpressure when using multishot as the kernel might produce a lot of completions before the user has a chance to cancel a previous submitted multishot recv. This change adds support to issue a multishot recv that is capped by a len, which means the kernel will only rearm until X amount of data is received. When the limit is reached the completion will signal to the user that a re-arm needs to happen manually by not setting the IORING_CQE_F_MORE flag. Signed-off-by: Norman Maurer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * io_uring: deduplicate wakeup handlingJens Axboe2025-07-151-11/+16
| | | | | | | | | | | | | | | | | | | | | | Both io_poll_wq_wake() and io_cqring_wake() contain the exact same code, and most of the comment in the latter applies equally to both. Move the test and wakeup handling into a basic helper that they can both use, and move part of the comment that applies generically to this new helper. Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/net: cast min_not_zero() typeJens Axboe2025-07-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | kernel test robot reports that xtensa complains about different signedness for a min_not_zero() comparison. Cast the int part to size_t to avoid this issue. Fixes: e227c8cdb47b ("io_uring/net: use passed in 'len' in io_recv_buf_select()") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/poll: cleanup apoll freeingJens Axboe2025-07-121-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | No point having REQ_F_POLLED in both IO_REQ_CLEAN_FLAGS and in IO_REQ_CLEAN_SLOW_FLAGS, and having both io_free_batch_list() and then io_clean_op() check for it and clean it. Move REQ_F_POLLED to IO_REQ_CLEAN_SLOW_FLAGS and drop it from IO_REQ_CLEAN_FLAGS, and have only io_free_batch_list() do the check and freeing. Link: https://lore.kernel.org/io-uring/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/net: allow multishot receive per-invocation capJens Axboe2025-07-101-6/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an application is handling multiple receive streams using recv multishot, then the amount of retries and buffer peeking for multishot and bundles can process too much per socket before moving on. This isn't directly controllable by the application. By default, io_uring will retry a recv MULTISHOT_MAX_RETRY (32) times, if the socket keeps having data to receive. And if using bundles, then each bundle peek will potentially map up to PEEK_MAX_IMPORT (256) iovecs of data. Once these limits are hit, then a requeue operation will be done, where the request will get retried after other pending requests have had a time to get executed. Add support for capping the per-invocation receive length, before a requeue condition is considered for each receive. This is done by setting sqe->mshot_len to the byte value. For example, if this is set to 1024, then each receive will be requeued by 1024 bytes received. Link: https://lore.kernel.org/io-uring/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flagsJens Axboe2025-07-101-11/+15
| | | | | | | | | | | | | | | | | | | | There's plenty of space left, as sr->flags is a 16-bit type. The UAPI bits are the lower 8 bits, as that's all that sqe->ioprio can carry in the SQE anyway. Use a few of the upper 8 bits for internal uses, rather than have two separate flags entries. Link: https://lore.kernel.org/io-uring/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/net: use passed in 'len' in io_recv_buf_select()Jens Axboe2025-07-101-1/+1
| | | | | | | | | | | | | | | | | | | | len is a pointer to the desired len, use that rather than grab it from sr->len again. No functional changes as of this patch, but it does prepare io_recv_buf_select() for getting passed in a value that differs from sr->len. Link: https://lore.kernel.org/io-uring/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/zcrx: prepare fallback for larger pagesPavel Begunkov2025-07-081-27/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | io_zcrx_copy_chunk() processes one page at a time, which won't be sufficient when the net_iov size grows. Introduce a structure keeping the target niov page and other parameters, it's more convenient and can be reused later. And add a helper function that can efficient copy buffers of an arbitrary length. For 64bit archs the loop inside should be compiled out. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/e84bc705a4e1edeb9aefff470d96558d8232388f.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/zcrx: assert area type in io_zcrx_iov_pagePavel Begunkov2025-07-081-0/+2
| | | | | | | | | | | | | | | | | | Add a simple debug assertion to io_zcrx_iov_page() making it's not trying to return pages for a dmabuf area. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/c3c30a926a18436a399a1768f3cc86c76cd17fa7.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/zcrx: allocate sgtable for umem areasPavel Begunkov2025-07-082-51/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, dma addresses for umem areas are stored directly in niovs. It's memory efficient but inconvenient. I need a better format 1) to share code with dmabuf areas, and 2) for disentangling page, folio and niov sizes. dmabuf already provides sg_table, create one for user memory as well. Signed-off-by: Pavel Begunkov <[email protected]> Reviewed-by: David Wei <[email protected]> Link: https://lore.kernel.org/r/f3c15081827c1bf5427d3a2e693bc526476b87ee.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/zcrx: introduce io_populate_area_dmaPavel Begunkov2025-07-081-25/+31
| | | | | | | | | | | | | | | | | | | | Add a helper that initialises page-pool dma addresses from a sg table. It'll be reused in following patches. Signed-off-by: Pavel Begunkov <[email protected]> Reviewed-by: David Wei <[email protected]> Link: https://lore.kernel.org/r/a8972a77be9b5675abc585d6e2e6e30f9c7dbd85.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/zcrx: return error from io_zcrx_map_area_*Pavel Begunkov2025-07-081-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | io_zcrx_map_area_*() helpers return the number of processed niovs, which we use to unroll some of the mappings for user memory areas. It's unhandy, and dmabuf doesn't care about it. Return an error code instead and move failure partial unmapping into io_zcrx_map_area_umem(). Signed-off-by: Pavel Begunkov <[email protected]> Reviewed-by: David Wei <[email protected]> Link: https://lore.kernel.org/r/42668e82be3a84b07ee8fc76d1d6d5ac0f137fe5.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/zcrx: always pass page to io_zcrx_copy_chunkPavel Begunkov2025-07-081-11/+10
| | | | | | | | | | | | | | | | | | | | | | io_zcrx_copy_chunk() currently takes either a page or virtual address. Unify the parameters, make it take pages and resolve the linear part into a page the same way general networking code does that. Signed-off-by: Pavel Begunkov <[email protected]> Reviewed-by: David Wei <[email protected]> Link: https://lore.kernel.org/r/b8f9f4bac027f5f44a9ccf85350912d1db41ceb8.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * io_uring/rw: cast rw->flags assignment to rwf_tJens Axboe2025-07-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel test robot reports that a recent change of the sqe->rw_flags field throws a sparse warning on 32-bit archs: >> io_uring/rw.c:291:19: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted __kernel_rwf_t [usertype] flags @@ got unsigned int @@ io_uring/rw.c:291:19: sparse: expected restricted __kernel_rwf_t [usertype] flags io_uring/rw.c:291:19: sparse: got unsigned int Force cast it to rwf_t to silence that new sparse warning. Fixes: cf73d9970ea4 ("io_uring: don't use int for ABI") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Jens Axboe <[email protected]>
| * Merge branch 'io_uring-6.16' into for-6.17/io_uringJens Axboe2025-07-068-25/+54
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge in 6.16 io_uring fixes, to avoid clashes with pending net and settings changes. * io_uring-6.16: io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well io_uring/kbuf: flag partial buffer mappings io_uring/net: mark iov as dynamically allocated even for single segments io_uring: fix resource leak in io_import_dmabuf() io_uring: don't assume uaddr alignment in io_vec_fill_bvec io_uring/rsrc: don't rely on user vaddr alignment io_uring/rsrc: fix folio unpinning io_uring: make fallocate be hashed work
| * | io_uring/rsrc: skip atomic refcount for uncloned buffersCaleb Sander Mateos2025-07-021-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_buffer_unmap() performs an atomic decrement of the io_mapped_ubuf's reference count in case it has been cloned into another io_ring_ctx's registered buffer table. This is an expensive operation and unnecessary in the common case that the io_mapped_ubuf is only registered once. Load the reference count first and check whether it's 1. In that case, skip the atomic decrement and immediately free the io_mapped_ubuf. Signed-off-by: Caleb Sander Mateos <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
| * | io_uring/mock: add trivial poll handlerPavel Begunkov2025-07-021-2/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a flag that enables polling on the mock file. For now it's trivially says that there is always data available, it'll be extended in the future. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/f16de043ec4876d65fae294fc99ade57415fba0c.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * | io_uring/mock: support for async read/writePavel Begunkov2025-07-021-4/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let the user to specify a delay to read/write request. io_uring will start a timer, return -EIOCBQUEUED and complete the request asynchronously after the delay pass. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/38f9d2e143fda8522c90a724b74630e68f9bbd16.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * | io_uring/mock: allow to choose FMODE_NOWAITPavel Begunkov2025-07-021-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add an option to choose whether the file supports FMODE_NOWAIT, that changes the execution path io_uring request takes. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/1e532565b05a05b23589d237c24ee1a3d90c2fd9.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * | io_uring/mock: add sync read/writePavel Begunkov2025-07-021-5/+62
| | | | | | | | | | | | | | | | | | | | | | | | Add support for synchronous zero read/write for mock files. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/571f3c9fe688e918256a06a722d3db6ced9ca3d5.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * | io_uring/mock: add cmd using vectored regbufsPavel Begunkov2025-07-021-1/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a command api allowing to import vectored registered buffers, add a new mock command that uses the feature and simply copies the specified registered buffer into user space or vice versa. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/229a113fd7de6b27dbef9567f7c0bf4475c9017d.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * | io_uring/mock: add basic infra for test mock filesPavel Begunkov2025-07-022-0/+149
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_uring commands provide an ioctl style interface for files to implement file specific operations. io_uring provides many features and advanced api to commands, and it's getting hard to test as it requires specific files/devices. Add basic infrastucture for creating special mock files that will be implementing the cmd api and using various io_uring features we want to test. It'll also be useful to test some more obscure read/write/polling edge cases in the future. Suggested-by: chase xd <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/93f21b0af58c1367a2b22635d5a7d694ad0272fc.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * | io_uring/netcmd: add tx timestamping cmd supportPavel Begunkov2025-06-231-0/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new socket command which returns tx time stamps to the user. It provide an alternative to the existing error queue recvmsg interface. The command works in a polled multishot mode, which means io_uring will poll the socket and keep posting timestamps until the request is cancelled or fails in any other way (e.g. with no space in the CQ). It reuses the net infra and grabs timestamps from the socket's error queue. The command requires IORING_SETUP_CQE32. All non-final CQEs (marked with IORING_CQE_F_MORE) have cqe->res set to the tskey, and the upper 16 bits of cqe->flags keep tstype (i.e. offset by IORING_CQE_BUFFER_SHIFT). The timevalue is store in the upper part of the extended CQE. The final completion won't have IORING_CQE_F_MORE and will have cqe->res storing 0/error. Suggested-by: Vadim Fedorenko <[email protected]> Acked-by: Willem de Bruijn <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/92ee66e6b33b8de062a977843d825f58f21ecd37.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * | io_uring: add mshot helper for posting CQE32Pavel Begunkov2025-06-234-0/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a helper for posting 32 byte CQEs in a multishot mode and add a cmd helper on top. As it specifically works with requests, the helper ignore the passed in cqe->user_data and sets it to the one stored in the request. The command helper is only valid with multishot requests. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/c29d7720c16e1f981cfaa903df187138baa3946b.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * | io_uring/cmd: allow multishot polled commandsPavel Begunkov2025-06-232-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some commands like timestamping in the next patch can make use of multishot polling, i.e. REQ_F_APOLL_MULTISHOT. Add support for that, which is condensed in a single helper called io_cmd_poll_multishot(). The user who wants to continue with a request in a multishot mode must call the function, and only if it returns 0 the user is free to proceed. Apart from normal terminal errors, it can also end up with -EIOCBQUEUED, in which case the user must forward it to the core io_uring. It's forbidden to use task work while the request is executing in a multishot mode. The API is not foolproof, hence it's not exported to modules nor exposed in public headers. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/bcf97c31659662c72b69fc8fcdf2a88cfc16e430.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
| * | io_uring/poll: introduce io_arm_apoll()Pavel Begunkov2025-06-232-17/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to allowing commands to do file polling, add a helper that takes the desired poll event mask and arms it for polling. We won't be able to use io_arm_poll_handler() with IORING_OP_URING_CMD as it tries to infer the mask from the opcode data, and we can't unify it across all commands. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/7ee5633f2dc45fd15243f1a60965f7e30e1c48e8.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>