aboutsummaryrefslogtreecommitdiffstats
path: root/net/mptcp/protocol.c
Commit message (Collapse)AuthorAgeFilesLines
* mptcp: propagate shutdown to subflows when possibleMatthieu Baerts (NGI0)2025-09-161-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the MPTCP DATA FIN have been ACKed, there is no more MPTCP related metadata to exchange, and all subflows can be safely shutdown. Before this patch, the subflows were actually terminated at 'close()' time. That's certainly fine most of the time, but not when the userspace 'shutdown()' a connection, without close()ing it. When doing so, the subflows were staying in LAST_ACK state on one side -- and consequently in FIN_WAIT2 on the other side -- until the 'close()' of the MPTCP socket. Now, when the DATA FIN have been ACKed, all subflows are shutdown. A consequence of this is that the TCP 'FIN' flag can be set earlier now, but the end result is the same. This affects the packetdrill tests looking at the end of the MPTCP connections, but for a good reason. Note that tcp_shutdown() will check the subflow state, so no need to do that again before calling it. Fixes: 3721b9b64676 ("mptcp: Track received DATA_FIN sequence number and add related helpers") Cc: [email protected] Fixes: 16a9a9da1723 ("mptcp: Add helper to process acks of DATA_FIN") Reviewed-by: Mat Martineau <[email protected]> Reviewed-by: Geliang Tang <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* net: lockless sock_i_ino()Eric Dumazet2025-09-031-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Followup of commit c51da3f7a161 ("net: remove sock_i_uid()") A recent syzbot report was the trigger for this change. Over the years, we had many problems caused by the read_lock[_bh](&sk->sk_callback_lock) in sock_i_uid(). We could fix smc_diag_dump_proto() or make a more radical move: Instead of waiting for new syzbot reports, cache the socket inode number in sk->sk_ino, so that we no longer need to acquire sk->sk_callback_lock in sock_i_ino(). This makes socket dumps faster (one less cache line miss, and two atomic ops avoided). Prior art: commit 25a9c8a4431c ("netlink: Add __sock_i_ino() for __netlink_diag_dump().") commit 4f9bf2a2f5aa ("tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.") commit efc3dbc37412 ("rds: Make rds_sock_lock BH rather than IRQ safe.") Fixes: d2d6422f8bd1 ("x86: Allow to enable PREEMPT_RT.") Reported-by: [email protected] Closes: https://lore.kernel.org/netdev/[email protected]/T/#u Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Sebastian Andrzej Siewior <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: remove pr_fallback()Paolo Abeni2025-07-251-1/+0
| | | | | | | | | | | We can now track fully the fallback status of a given connection via the relevant mibs, the mentioned helper is redundant. Remove it completely. Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20250723-net-next-mptcp-track-fallbacks-v1-2-a83cce08f2d5@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: track fallbacks accurately via mibsPaolo Abeni2025-07-251-14/+30
| | | | | | | | | | | | | | | | | | | | | Add the mibs required to cover the few possible fallback causes still lacking suck info. Move the relevant mib increment into the fallback helper, so that no eventual future fallback operation will miss a paired mib increment. Additionally track failed fallback via its own mib, such mib is incremented only when a fallback mandated by the protocol fails - due to racing subflow creation. While at the above, rename an existing helper to reduce long lines problems all along. Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20250723-net-next-mptcp-track-fallbacks-v1-1-a83cce08f2d5@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: fix typo in a commentmoyuanhao2025-07-221-1/+1
| | | | | | | | | | | | This patch fixes the follow spelling mistake in a comment: greter -> greater Signed-off-by: moyuanhao <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2025-07-171-8/+48
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR (net-6.16-rc7). Conflicts: Documentation/netlink/specs/ovpn.yaml 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") af52020fc599 ("ovpn: reject unexpected netlink attributes") drivers/net/phy/phy_device.c a44312d58e78 ("net: phy: Don't register LEDs for genphy") f0f2b992d818 ("net: phy: Don't register LEDs for genphy") https://lore.kernel.org/[email protected] drivers/net/wireless/intel/iwlwifi/fw/regulatory.c drivers/net/wireless/intel/iwlwifi/mld/regulatory.c 5fde0fcbd760 ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap") ea045a0de3b9 ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware") net/ipv6/mcast.c ae3264a25a46 ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()") a8594c956cc9 ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()") https://lore.kernel.org/[email protected] No adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
| * mptcp: reset fallback status gracefully at disconnect() timePaolo Abeni2025-07-161-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mptcp_disconnect() clears the fallback bit unconditionally, without touching the associated flags. The bit clear is safe, as no fallback operation can race with that -- all subflow are already in TCP_CLOSE status thanks to the previous FASTCLOSE -- but we need to consistently reset all the fallback related status. Also acquire the relevant lock, to avoid fouling static analyzers. Fixes: b29fcfb54cd7 ("mptcp: full disconnect implementation") Cc: [email protected] Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
| * mptcp: plug races between subflow fail and subflow creationPaolo Abeni2025-07-161-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have races similar to the one addressed by the previous patch between subflow failing and additional subflow creation. They are just harder to trigger. The solution is similar. Use a separate flag to track the condition 'socket state prevent any additional subflow creation' protected by the fallback lock. The socket fallback makes such flag true, and also receiving or sending an MP_FAIL option. The field 'allow_infinite_fallback' is now always touched under the relevant lock, we can drop the ONCE annotation on write. Fixes: 478d770008b0 ("mptcp: send out MP_FAIL when data checksum fails") Cc: [email protected] Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
| * mptcp: make fallback action and fallback decision atomicPaolo Abeni2025-07-161-5/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzkaller reported the following splat: WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 __mptcp_do_fallback net/mptcp/protocol.h:1223 [inline] WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 mptcp_do_fallback net/mptcp/protocol.h:1244 [inline] WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 check_fully_established net/mptcp/options.c:982 [inline] WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 mptcp_incoming_options+0x21a8/0x2510 net/mptcp/options.c:1153 Modules linked in: CPU: 1 UID: 0 PID: 7704 Comm: syz.3.1419 Not tainted 6.16.0-rc3-gbd5ce2324dba #20 PREEMPT(voluntary) Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:__mptcp_do_fallback net/mptcp/protocol.h:1223 [inline] RIP: 0010:mptcp_do_fallback net/mptcp/protocol.h:1244 [inline] RIP: 0010:check_fully_established net/mptcp/options.c:982 [inline] RIP: 0010:mptcp_incoming_options+0x21a8/0x2510 net/mptcp/options.c:1153 Code: 24 18 e8 bb 2a 00 fd e9 1b df ff ff e8 b1 21 0f 00 e8 ec 5f c4 fc 44 0f b7 ac 24 b0 00 00 00 e9 54 f1 ff ff e8 d9 5f c4 fc 90 <0f> 0b 90 e9 b8 f4 ff ff e8 8b 2a 00 fd e9 8d e6 ff ff e8 81 2a 00 RSP: 0018:ffff8880a3f08448 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8880180a8000 RCX: ffffffff84afcf45 RDX: ffff888090223700 RSI: ffffffff84afdaa7 RDI: 0000000000000001 RBP: ffff888017955780 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8880180a8910 R14: ffff8880a3e9d058 R15: 0000000000000000 FS: 00005555791b8500(0000) GS:ffff88811c495000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000110c2800b7 CR3: 0000000058e44000 CR4: 0000000000350ef0 Call Trace: <IRQ> tcp_reset+0x26f/0x2b0 net/ipv4/tcp_input.c:4432 tcp_validate_incoming+0x1057/0x1b60 net/ipv4/tcp_input.c:5975 tcp_rcv_established+0x5b5/0x21f0 net/ipv4/tcp_input.c:6166 tcp_v4_do_rcv+0x5dc/0xa70 net/ipv4/tcp_ipv4.c:1925 tcp_v4_rcv+0x3473/0x44a0 net/ipv4/tcp_ipv4.c:2363 ip_protocol_deliver_rcu+0xba/0x480 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x2f1/0x500 net/ipv4/ip_input.c:233 NF_HOOK include/linux/netfilter.h:317 [inline] NF_HOOK include/linux/netfilter.h:311 [inline] ip_local_deliver+0x1be/0x560 net/ipv4/ip_input.c:254 dst_input include/net/dst.h:469 [inline] ip_rcv_finish net/ipv4/ip_input.c:447 [inline] NF_HOOK include/linux/netfilter.h:317 [inline] NF_HOOK include/linux/netfilter.h:311 [inline] ip_rcv+0x514/0x810 net/ipv4/ip_input.c:567 __netif_receive_skb_one_core+0x197/0x1e0 net/core/dev.c:5975 __netif_receive_skb+0x1f/0x120 net/core/dev.c:6088 process_backlog+0x301/0x1360 net/core/dev.c:6440 __napi_poll.constprop.0+0xba/0x550 net/core/dev.c:7453 napi_poll net/core/dev.c:7517 [inline] net_rx_action+0xb44/0x1010 net/core/dev.c:7644 handle_softirqs+0x1d0/0x770 kernel/softirq.c:579 do_softirq+0x3f/0x90 kernel/softirq.c:480 </IRQ> <TASK> __local_bh_enable_ip+0xed/0x110 kernel/softirq.c:407 local_bh_enable include/linux/bottom_half.h:33 [inline] inet_csk_listen_stop+0x2c5/0x1070 net/ipv4/inet_connection_sock.c:1524 mptcp_check_listen_stop.part.0+0x1cc/0x220 net/mptcp/protocol.c:2985 mptcp_check_listen_stop net/mptcp/mib.h:118 [inline] __mptcp_close+0x9b9/0xbd0 net/mptcp/protocol.c:3000 mptcp_close+0x2f/0x140 net/mptcp/protocol.c:3066 inet_release+0xed/0x200 net/ipv4/af_inet.c:435 inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:487 __sock_release+0xb3/0x270 net/socket.c:649 sock_close+0x1c/0x30 net/socket.c:1439 __fput+0x402/0xb70 fs/file_table.c:465 task_work_run+0x150/0x240 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0xd4/0xe0 kernel/entry/common.c:114 exit_to_user_mode_prepare include/linux/entry-common.h:330 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:414 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:449 [inline] do_syscall_64+0x245/0x360 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fc92f8a36ad Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffcf52802d8 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 RAX: 0000000000000000 RBX: 00007ffcf52803a8 RCX: 00007fc92f8a36ad RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003 RBP: 00007fc92fae7ba0 R08: 0000000000000001 R09: 0000002800000000 R10: 00007fc92f700000 R11: 0000000000000246 R12: 00007fc92fae5fac R13: 00007fc92fae5fa0 R14: 0000000000026d00 R15: 0000000000026c51 </TASK> irq event stamp: 4068 hardirqs last enabled at (4076): [<ffffffff81544816>] __up_console_sem+0x76/0x80 kernel/printk/printk.c:344 hardirqs last disabled at (4085): [<ffffffff815447fb>] __up_console_sem+0x5b/0x80 kernel/printk/printk.c:342 softirqs last enabled at (3096): [<ffffffff840e1be0>] local_bh_enable include/linux/bottom_half.h:33 [inline] softirqs last enabled at (3096): [<ffffffff840e1be0>] inet_csk_listen_stop+0x2c0/0x1070 net/ipv4/inet_connection_sock.c:1524 softirqs last disabled at (3097): [<ffffffff813b6b9f>] do_softirq+0x3f/0x90 kernel/softirq.c:480 Since we need to track the 'fallback is possible' condition and the fallback status separately, there are a few possible races open between the check and the actual fallback action. Add a spinlock to protect the fallback related information and use it close all the possible related races. While at it also remove the too-early clearing of allow_infinite_fallback in __mptcp_subflow_connect(): the field will be correctly cleared by subflow_finish_connect() if/when the connection will complete successfully. If fallback is not possible, as per RFC, reset the current subflow. Since the fallback operation can now fail and return value should be checked, rename the helper accordingly. Fixes: 0530020a7c8f ("mptcp: track and update contiguous data status") Cc: [email protected] Reported-by: Matthieu Baerts <[email protected]> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/570 Reported-by: [email protected] Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/555 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | tcp: move tcp_memory_allocated into net_aligned_dataEric Dumazet2025-07-021-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ____cacheline_aligned_in_smp attribute only makes sure to align a field to a cache line. It does not prevent the linker to use the remaining of the cache line for other variables, causing potential false sharing. Move tcp_memory_allocated into a dedicated cache line. Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | net: annotate races around sk->sk_uidEric Dumazet2025-06-241-1/+1
|/ | | | | | | | | | | | | | | | sk->sk_uid can be read while another thread changes its value in sockfs_setattr(). Add sk_uid(const struct sock *sk) helper to factorize the needed READ_ONCE() annotations, and add corresponding WRITE_ONCE() where needed. Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.") Signed-off-by: Eric Dumazet <[email protected]> Cc: Lorenzo Colitti <[email protected]> Reviewed-by: Maciej Żenczykowski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar2025-06-081-3/+3
| | | | | | | | | | | Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
* mptcp: Use nested-BH locking for hmac_storageSebastian Andrzej Siewior2025-05-151-1/+3
| | | | | | | | | | | | | | | | | | | mptcp_delegated_actions is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Add a local_lock_t to the data structure and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Cc: Matthieu Baerts <[email protected]> Cc: Mat Martineau <[email protected]> Cc: Geliang Tang <[email protected]> Cc: [email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
* mptcp: Align mptcp_inet6_sk with other protocolsPedro Falcato2025-05-031-2/+2
| | | | | | | | | | | | | | | | Ever since commit f5f80e32de12 ("ipv6: remove hard coded limitation on ipv6_pinfo") that protocols stopped using the old "obj_size - sizeof(struct ipv6_pinfo)" way of grabbing ipv6_pinfo, that severely restricted struct layout and caused fun, hard to see issues. However, mptcp_inet6_sk wasn't fixed (unlike tcp_inet6_sk). Do so. The non-cloned sockets already do the right thing using ipv6_pinfo_offset + the generic IPv6 code. Signed-off-by: Pedro Falcato <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: add MPJoinRejected MIB counterMatthieu Baerts (NGI0)2025-04-151-1/+3
| | | | | | | | | | | | | This counter is useful to understand why some paths are rejected, and not created as expected. It is incremented when receiving a connection request, if the PM didn't allow the creation of new subflows. Reviewed-by: Geliang Tang <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20250413-net-next-mptcp-sched-mib-sft-misc-v2-5-0f83a4350150@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* tcp/dccp: remove icsk->icsk_ack.timeoutEric Dumazet2025-03-251-1/+0
| | | | | | | | | | | icsk->icsk_ack.timeout can be replaced by icsk->csk_delack_timer.expires This saves 8 bytes in TCP/DCCP sockets and helps for better cache locality. Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* tcp/dccp: remove icsk->icsk_timeoutEric Dumazet2025-03-251-1/+1
| | | | | | | | | | | icsk->icsk_timeout can be replaced by icsk->icsk_retransmit_timer.expires This saves 8 bytes in TCP/DCCP sockets and helps for better cache locality. Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: pm: avoid calling PM specific code from coreMatthieu Baerts (NGI0)2025-03-101-2/+1
| | | | | | | | | | | | | | | | | | | | | When destroying an MPTCP socket, some userspace PM specific code was called from mptcp_destroy_common() in protocol.c. That feels wrong, and it is the only case. Instead, the core now calls mptcp_pm_destroy() from pm.c which is now in charge of cleaning the announced addresses list, and ask the different PMs to do extra cleaning if needed, e.g. the userspace PM, if used, will clean the local addresses list. While at it, the userspace PM specific helper has been prefixed with 'mptcp_userspace_pm_' like the other ones. No behavioural changes intended. Reviewed-by: Geliang Tang <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: pm: remove '_nl' from mptcp_pm_nl_workMatthieu Baerts (NGI0)2025-03-101-1/+1
| | | | | | | | | | | | | | | | | | Currently, in-kernel PM specific helpers are prefixed with 'mptcp_pm_nl_'. But here 'mptcp_pm_nl_work' is not specific to this PM: it is called from the core to call helpers, some of them needed by both the in-kernel and userspace PMs. To avoid confusions, the '_nl' bit has been removed from the name. Also used 'worker' instead of 'work', similar to protocol.c's worker. No behavioural changes intended. Reviewed-by: Geliang Tang <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* net: use sock_kmemdup for ip_optionsGeliang Tang2025-03-041-5/+2
| | | | | | | | | | | | | Instead of using sock_kmalloc() to allocate an ip_options and then immediately duplicate another ip_options to the newly allocated one in ipv6_dup_options(), mptcp_copy_ip_options() and sctp_v4_copy_ip_options(), the newly added sock_kmemdup() helper can be used to simplify the code. Signed-off-by: Geliang Tang <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Acked-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/91ae749d66600ec6fb679e0e518fda6acb5c3e6f.1740735165.git.tanggeliang@kylinos.cn Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: micro-optimize __mptcp_move_skb()Paolo Abeni2025-02-201-61/+50
| | | | | | | | | | | | | | | | After the RX path refactor the mentioned function is expected to run frequently, let's optimize it a bit. Scan for ready subflow from the last processed one, and stop after traversing the list once or reaching the msk memory limit - instead of looking for dubious per-subflow conditions. Also re-order the memory limit checks, to avoid duplicate tests. Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-7-4a47d90d7998@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: dismiss __mptcp_rmem()Paolo Abeni2025-02-201-4/+4
| | | | | | | | | | | After the RX path refactor, it become a wrapper for sk_rmem_alloc access, with a slightly misleading name. Just drop it. Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-6-4a47d90d7998@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: cleanup mem accountingPaolo Abeni2025-02-201-107/+8
| | | | | | | | | | | | After the previous patch, updating sk_forward_memory is cheap and we can drop a lot of complexity from the MPTCP memory accounting, removing the custom fwd mem allocations for rmem. Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-4-4a47d90d7998@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: move the whole rx path under msk socket lock protectionPaolo Abeni2025-02-201-65/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit c2e6048fa1cf ("mptcp: fix race in release_cb") we can move the whole MPTCP rx path under the socket lock leveraging the release_cb. We can drop a bunch of spin_lock pairs in the receive functions, use a single receive queue and invoke __mptcp_move_skbs only when subflows ask for it. This will allow more cleanup in the next patch. Some changes are worth specific mention: The msk rcvbuf update now always happens under both the msk and the subflow socket lock: we can drop a bunch of ONCE annotation and consolidate the checks. When the skbs move is delayed at msk release callback time, even the msk rcvbuf update is delayed; additionally take care of such action in __mptcp_move_skbs(). Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-3-4a47d90d7998@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: drop __mptcp_fastopen_gen_msk_ackseq()Paolo Abeni2025-02-201-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | When we will move the whole RX path under the msk socket lock, updating the already queued skb for passive fastopen socket at 3rd ack time will be extremely painful and race prone The map_seq for already enqueued skbs is used only to allow correct coalescing with later data; preventing collapsing to the first skb of a fastopen connect we can completely remove the __mptcp_fastopen_gen_msk_ackseq() helper. Before dropping this helper, a new item had to be added to the mptcp_skb_cb structure. Because this item will be frequently tested in the fast path -- almost on every packet -- and because there is free space there, a single byte is used instead of a bitfield. This micro optimisation slightly reduces the number of CPU operations to do the associated check. Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-2-4a47d90d7998@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: handle fastopen disconnect correctlyPaolo Abeni2025-01-271-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot was able to trigger a data stream corruption: WARNING: CPU: 0 PID: 9846 at net/mptcp/protocol.c:1024 __mptcp_clean_una+0xddb/0xff0 net/mptcp/protocol.c:1024 Modules linked in: CPU: 0 UID: 0 PID: 9846 Comm: syz-executor351 Not tainted 6.13.0-rc2-syzkaller-00059-g00a5acdbf398 #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 11/25/2024 RIP: 0010:__mptcp_clean_una+0xddb/0xff0 net/mptcp/protocol.c:1024 Code: fa ff ff 48 8b 4c 24 18 80 e1 07 fe c1 38 c1 0f 8c 8e fa ff ff 48 8b 7c 24 18 e8 e0 db 54 f6 e9 7f fa ff ff e8 e6 80 ee f5 90 <0f> 0b 90 4c 8b 6c 24 40 4d 89 f4 e9 04 f5 ff ff 44 89 f1 80 e1 07 RSP: 0018:ffffc9000c0cf400 EFLAGS: 00010293 RAX: ffffffff8bb0dd5a RBX: ffff888033f5d230 RCX: ffff888059ce8000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffc9000c0cf518 R08: ffffffff8bb0d1dd R09: 1ffff110170c8928 R10: dffffc0000000000 R11: ffffed10170c8929 R12: 0000000000000000 R13: ffff888033f5d220 R14: dffffc0000000000 R15: ffff8880592b8000 FS: 00007f6e866496c0(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6e86f491a0 CR3: 00000000310e6000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __mptcp_clean_una_wakeup+0x7f/0x2d0 net/mptcp/protocol.c:1074 mptcp_release_cb+0x7cb/0xb30 net/mptcp/protocol.c:3493 release_sock+0x1aa/0x1f0 net/core/sock.c:3640 inet_wait_for_connect net/ipv4/af_inet.c:609 [inline] __inet_stream_connect+0x8bd/0xf30 net/ipv4/af_inet.c:703 mptcp_sendmsg_fastopen+0x2a2/0x530 net/mptcp/protocol.c:1755 mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1830 sock_sendmsg_nosec net/socket.c:711 [inline] __sock_sendmsg+0x1a6/0x270 net/socket.c:726 ____sys_sendmsg+0x52a/0x7e0 net/socket.c:2583 ___sys_sendmsg net/socket.c:2637 [inline] __sys_sendmsg+0x269/0x350 net/socket.c:2669 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f6e86ebfe69 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 1f 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f6e86649168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f6e86f491b8 RCX: 00007f6e86ebfe69 RDX: 0000000030004001 RSI: 0000000020000080 RDI: 0000000000000003 RBP: 00007f6e86f491b0 R08: 00007f6e866496c0 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6e86f491bc R13: 000000000000006e R14: 00007ffe445d9420 R15: 00007ffe445d9508 </TASK> The root cause is the bad handling of disconnect() generated internally by the MPTCP protocol in case of connect FASTOPEN errors. Address the issue increasing the socket disconnect counter even on such a case, to allow other threads waiting on the same socket lock to properly error out. Fixes: c2b2ae3925b6 ("mptcp: handle correctly disconnect() failures") Cc: [email protected] Reported-by: [email protected] Closes: https://lore.kernel.org/[email protected] Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/537 Tested-by: [email protected] Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* net: remove init_dummy_netdev()Jakub Kicinski2025-01-141-3/+5
| | | | | | | | | | | | | | | | init_dummy_netdev() can initialize statically declared or embedded net_devices. Such netdevs did not come from alloc_netdev_mqs(). After recent work by Breno, there are the only two cases where we have do that. Switch those cases to alloc_netdev_mqs() and delete init_dummy_netdev(). Dealing with static netdevs is not worth the maintenance burden. Reviewed-by: Alexander Lobakin <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Reviewed-by: Joe Damato <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: prevent excessive coalescing on receivePaolo Abeni2025-01-031-0/+1
| | | | | | | | | | | | | | | | Currently the skb size after coalescing is only limited by the skb layout (the skb must not carry frag_list). A single coalesced skb covering several MSS can potentially fill completely the receive buffer. In such a case, the snd win will zero until the receive buffer will be empty again, affecting tput badly. Fixes: 8268ed4c9d19 ("mptcp: introduce and use mptcp_try_coalesce()") Cc: [email protected] # please delay 2 weeks after 6.13-final release Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: don't always assume copied data in mptcp_cleanup_rbuf()Paolo Abeni2025-01-031-9/+9
| | | | | | | | | | | | | | | | | Under some corner cases the MPTCP protocol can end-up invoking mptcp_cleanup_rbuf() when no data has been copied, but such helper assumes the opposite condition. Explicitly drop such assumption and performs the costly call only when strictly needed - before releasing the msk socket lock. Fixes: fd8976790a6c ("mptcp: be careful on MPTCP-level ack.") Cc: [email protected] Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: fix recvbuffer adjust on sleeping rcvmsgPaolo Abeni2025-01-031-3/+3
| | | | | | | | | | | | | | | | | | | | | | If the recvmsg() blocks after receiving some data - i.e. due to SO_RCVLOWAT - the MPTCP code will attempt multiple times to adjust the receive buffer size, wrongly accounting every time the cumulative of received data - instead of accounting only for the delta. Address the issue moving mptcp_rcv_space_adjust just after the data reception and passing it only the just received bytes. This also removes an unneeded difference between the TCP and MPTCP RX code path implementation. Fixes: 581302298524 ("mptcp: error out earlier on disconnect") Cc: [email protected] Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2024-11-141-5/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR (net-6.12-rc8). Conflicts: tools/testing/selftests/net/.gitignore 252e01e68241 ("selftests: net: add netlink-dumps to .gitignore") be43a6b23829 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw") https://lore.kernel.org/all/[email protected]/ drivers/net/phy/phylink.c 671154f174e0 ("net: phylink: ensure PHY momentary link-fails are handled") 7530ea26c810 ("net: phylink: remove "using_mac_select_pcs"") Adjacent changes: drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c 5b366eae7193 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines") e96321fad3ad ("net: ethernet: Switch back to struct platform_driver::remove()") Signed-off-by: Jakub Kicinski <[email protected]>
| * mptcp: cope racing subflow creation in mptcp_rcv_space_adjustPaolo Abeni2024-11-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Additional active subflows - i.e. created by the in kernel path manager - are included into the subflow list before starting the 3whs. A racing recvmsg() spooling data received on an already established subflow would unconditionally call tcp_cleanup_rbuf() on all the current subflows, potentially hitting a divide by zero error on the newly created ones. Explicitly check that the subflow is in a suitable state before invoking tcp_cleanup_rbuf(). Fixes: c76c6956566f ("mptcp: call tcp_cleanup_rbuf on subflows") Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/02374660836e1b52afc91966b7535c8c5f7bafb0.1731060874.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
| * mptcp: error out earlier on disconnectPaolo Abeni2024-11-121-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eric reported a division by zero splat in the MPTCP protocol: Oops: divide error: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 1 UID: 0 PID: 6094 Comm: syz-executor317 Not tainted 6.12.0-rc5-syzkaller-00291-g05b92660cdfe #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 RIP: 0010:__tcp_select_window+0x5b4/0x1310 net/ipv4/tcp_output.c:3163 Code: f6 44 01 e3 89 df e8 9b 75 09 f8 44 39 f3 0f 8d 11 ff ff ff e8 0d 74 09 f8 45 89 f4 e9 04 ff ff ff e8 00 74 09 f8 44 89 f0 99 <f7> 7c 24 14 41 29 d6 45 89 f4 e9 ec fe ff ff e8 e8 73 09 f8 48 89 RSP: 0018:ffffc900041f7930 EFLAGS: 00010293 RAX: 0000000000017e67 RBX: 0000000000017e67 RCX: ffffffff8983314b RDX: 0000000000000000 RSI: ffffffff898331b0 RDI: 0000000000000004 RBP: 00000000005d6000 R08: 0000000000000004 R09: 0000000000017e67 R10: 0000000000003e80 R11: 0000000000000000 R12: 0000000000003e80 R13: ffff888031d9b440 R14: 0000000000017e67 R15: 00000000002eb000 FS: 00007feb5d7f16c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007feb5d8adbb8 CR3: 0000000074e4c000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __tcp_cleanup_rbuf+0x3e7/0x4b0 net/ipv4/tcp.c:1493 mptcp_rcv_space_adjust net/mptcp/protocol.c:2085 [inline] mptcp_recvmsg+0x2156/0x2600 net/mptcp/protocol.c:2289 inet_recvmsg+0x469/0x6a0 net/ipv4/af_inet.c:885 sock_recvmsg_nosec net/socket.c:1051 [inline] sock_recvmsg+0x1b2/0x250 net/socket.c:1073 __sys_recvfrom+0x1a5/0x2e0 net/socket.c:2265 __do_sys_recvfrom net/socket.c:2283 [inline] __se_sys_recvfrom net/socket.c:2279 [inline] __x64_sys_recvfrom+0xe0/0x1c0 net/socket.c:2279 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7feb5d857559 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007feb5d7f1208 EFLAGS: 00000246 ORIG_RAX: 000000000000002d RAX: ffffffffffffffda RBX: 00007feb5d8e1318 RCX: 00007feb5d857559 RDX: 000000800000000e RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007feb5d8e1310 R08: 0000000000000000 R09: ffffffff81000000 R10: 0000000000000100 R11: 0000000000000246 R12: 00007feb5d8e131c R13: 00007feb5d8ae074 R14: 000000800000000e R15: 00000000fffffdef and provided a nice reproducer. The root cause is the current bad handling of racing disconnect. After the blamed commit below, sk_wait_data() can return (with error) with the underlying socket disconnected and a zero rcv_mss. Catch the error and return without performing any additional operations on the current socket. Reported-by: Eric Dumazet <[email protected]> Fixes: 419ce133ab92 ("tcp: allow again tcp_disconnect() when threads are waiting") Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/8c82ecf71662ecbc47bf390f9905de70884c9f2d.1731060874.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
* | mptcp: fix possible integer overflow in mptcp_reset_tout_timerDmitry Kandybka2024-11-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | In 'mptcp_reset_tout_timer', promote 'probe_timestamp' to unsigned long to avoid possible integer overflow. Compile tested only. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Dmitry Kandybka <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2024-11-011-0/+2
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR (net-6.12-rc6). Conflicts: drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c cbe84e9ad5e2 ("wifi: iwlwifi: mvm: really send iwl_txpower_constraints_cmd") 188a1bf89432 ("wifi: mac80211: re-order assigning channel in activate links") https://lore.kernel.org/all/[email protected]/ net/mac80211/cfg.c c4382d5ca1af ("wifi: mac80211: update the right link for tx power") 8dd0498983ee ("wifi: mac80211: Fix setting txpower with emulate_chanctx") drivers/net/ethernet/intel/ice/ice_ptp_hw.h 6e58c3310622 ("ice: fix crash on probe for DPLL enabled E810 LOM") e4291b64e118 ("ice: Align E810T GPIO to other products") ebb2693f8fbd ("ice: Read SDP section from NVM for pin definitions") ac532f4f4251 ("ice: Cleanup unused declarations") https://lore.kernel.org/all/[email protected]/ No adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
| * mptcp: init: protect sched with rcu_read_lockMatthieu Baerts (NGI0)2024-10-281-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enabling CONFIG_PROVE_RCU_LIST with its dependence CONFIG_RCU_EXPERT creates this splat when an MPTCP socket is created: ============================= WARNING: suspicious RCU usage 6.12.0-rc2+ #11 Not tainted ----------------------------- net/mptcp/sched.c:44 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by mptcp_connect/176. stack backtrace: CPU: 0 UID: 0 PID: 176 Comm: mptcp_connect Not tainted 6.12.0-rc2+ #11 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:123) lockdep_rcu_suspicious (kernel/locking/lockdep.c:6822) mptcp_sched_find (net/mptcp/sched.c:44 (discriminator 7)) mptcp_init_sock (net/mptcp/protocol.c:2867 (discriminator 1)) ? sock_init_data_uid (arch/x86/include/asm/atomic.h:28) inet_create.part.0.constprop.0 (net/ipv4/af_inet.c:386) ? __sock_create (include/linux/rcupdate.h:347 (discriminator 1)) __sock_create (net/socket.c:1576) __sys_socket (net/socket.c:1671) ? __pfx___sys_socket (net/socket.c:1712) ? do_user_addr_fault (arch/x86/mm/fault.c:1419 (discriminator 1)) __x64_sys_socket (net/socket.c:1728) do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) That's because when the socket is initialised, rcu_read_lock() is not used despite the explicit comment written above the declaration of mptcp_sched_find() in sched.c. Adding the missing lock/unlock avoids the warning. Fixes: 1730b2b2c5a5 ("mptcp: add sched in mptcp_sock") Cc: [email protected] Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/523 Reviewed-by: Geliang Tang <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | mptcp: implement mptcp_pm_connection_closedGeliang Tang2024-10-281-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The MPTCP path manager event handler mptcp_pm_connection_closed interface has been added in the commit 1b1c7a0ef7f3 ("mptcp: Add path manager interface") but it was an empty function from then on. With such name, it sounds good to invoke mptcp_event with the MPTCP_EVENT_CLOSED event type from it. It also removes a bit of duplicated code. Signed-off-by: Geliang Tang <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | mptcp: annotate data-races around subflow->fully_establishedGang Yan2024-10-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We introduce the same handling for potential data races with the 'fully_established' flag in subflow as previously done for msk->fully_established. Additionally, we make a crucial change: convert the subflow's 'fully_established' from 'bit_field' to 'bool' type. This is necessary because methods for avoiding data races don't work well with 'bit_field'. Specifically, the 'READ_ONCE' needs to know the size of the variable being accessed, which is not supported in 'bit_field'. Also, 'test_bit' expect the address of 'bit_field'. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/516 Signed-off-by: Gang Yan <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2024-10-101-3/+21
|\| | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR (net-6.12-rc3). No conflicts and no adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
| * mptcp: handle consistently DSS corruptionPaolo Abeni2024-10-101-3/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bugged peer implementation can send corrupted DSS options, consistently hitting a few warning in the data path. Use DEBUG_NET assertions, to avoid the splat on some builds and handle consistently the error, dumping related MIBs and performing fallback and/or reset according to the subflow type. Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue") Cc: [email protected] Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | tcp: add a fast path in tcp_delack_timer()Eric Dumazet2024-10-041-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | delack timer is not stopped from inet_csk_clear_xmit_timer() because we do not define INET_CSK_CLEAR_TIMERS. This is a conscious choice : inet_csk_clear_xmit_timer() is often called from another cpu. Calling del_timer() would cause false sharing and lock contention. This means that very often, tcp_delack_timer() is called at the timer expiration, while there is no ACK to transmit. This can be detected very early, avoiding the socket spinlock. Notes: - test about tp->compressed_ack is racy, but in the unlikely case there is a race, the dedicated compressed_ack_timer hrtimer would close it. - Even if the fast path is not taken, reading icsk->icsk_ack.pending and tp->compressed_ack before acquiring the socket spinlock reduces acquisition time and chances of contention. Signed-off-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: disable active MPTCP in case of blackholeMatthieu Baerts (NGI0)2024-09-111-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | An MPTCP firewall blackhole can be detected if the following SYN retransmission after a fallback to "plain" TCP is accepted. In case of blackhole, a similar technique to the one in place with TFO is now used: MPTCP can be disabled for a certain period of time, 1h by default. This time period will grow exponentially when more blackhole issues get detected right after MPTCP is re-enabled and will reset to the initial value when the blackhole issue goes away. The blackhole period can be modified thanks to a new sysctl knob: blackhole_timeout. Two new MIB counters help understanding what's happening: - 'Blackhole', incremented when a blackhole is detected. - 'MPCapableSYNTXDisabled', incremented when an MPTCP connection directly falls back to TCP during the blackhole period. Because the technique is inspired by the one used by TFO, an important part of the new code is similar to what can find in tcp_fastopen.c, with some adaptations to the MPTCP case. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/57 Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20240909-net-next-mptcp-fallback-x-mpc-v1-3-da7ebb4cd2a3@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: export mptcp_subflow_early_fallback()Matthieu Baerts (NGI0)2024-09-111-7/+0
| | | | | | | | | | | This helper will be used outside protocol.h in the following commit. While at it, also add a 'pr_fallback()' debug print, to help identifying fallbacks. Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20240909-net-next-mptcp-fallback-x-mpc-v1-1-da7ebb4cd2a3@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: avoid duplicated SUB_CLOSED eventsMatthieu Baerts (NGI0)2024-08-291-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The initial subflow might have already been closed, but still in the connection list. When the worker is instructed to close the subflows that have been marked as closed, it might then try to close the initial subflow again. A consequence of that is that the SUB_CLOSED event can be seen twice: # ip mptcp endpoint 1.1.1.1 id 1 subflow dev eth0 2.2.2.2 id 2 subflow dev eth1 # ip mptcp monitor & [ CREATED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9 [ ESTABLISHED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9 [ SF_ESTABLISHED] remid=0 locid=2 saddr4=2.2.2.2 daddr4=9.9.9.9 # ip mptcp endpoint delete id 1 [ SF_CLOSED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9 [ SF_CLOSED] remid=0 locid=0 saddr4=1.1.1.1 daddr4=9.9.9.9 The first one is coming from mptcp_pm_nl_rm_subflow_received(), and the second one from __mptcp_close_subflow(). To avoid doing the post-closed processing twice, the subflow is now marked as closed the first time. Note that it is not enough to check if we are dealing with the first subflow and check its sk_state: the subflow might have been reset or closed before calling mptcp_close_ssk(). Fixes: b911c97c7dc7 ("mptcp: add netlink event support") Cc: [email protected] Tested-by: Arınç ÜNAL <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
* mptcp: pr_debug: add missing \n at the endMatthieu Baerts (NGI0)2024-08-271-26/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pr_debug() have been added in various places in MPTCP code to help developers to debug some situations. With the dynamic debug feature, it is easy to enable all or some of them, and asks users to reproduce issues with extra debug. Many of these pr_debug() don't end with a new line, while no 'pr_cont()' are used in MPTCP code. So the goal was not to display multiple debug messages on one line: they were then not missing the '\n' on purpose. Not having the new line at the end causes these messages to be printed with a delay, when something else needs to be printed. This issue is not visible when many messages need to be printed, but it is annoying and confusing when only specific messages are expected, e.g. # echo "func mptcp_pm_add_addr_echoed +fmp" \ > /sys/kernel/debug/dynamic_debug/control # ./mptcp_join.sh "signal address"; \ echo "$(awk '{print $1}' /proc/uptime) - end"; \ sleep 5s; \ echo "$(awk '{print $1}' /proc/uptime) - restart"; \ ./mptcp_join.sh "signal address" 013 signal address (...) 10.75 - end 15.76 - restart 013 signal address [ 10.367935] mptcp:mptcp_pm_add_addr_echoed: MPTCP: msk=(...) (...) => a delay of 5 seconds: printed with a 10.36 ts, but after 'restart' which was printed at the 15.76 ts. The 'Fixes' tag here below points to the first pr_debug() used without '\n' in net/mptcp. This patch could be split in many small ones, with different Fixes tag, but it doesn't seem worth it, because it is easy to re-generate this patch with this simple 'sed' command: git grep -l pr_debug -- net/mptcp | xargs sed -i "s/\(pr_debug(\".*[^n]\)\(\"[,)]\)/\1\\\n\2/g" So in case of conflicts, simply drop the modifications, and launch this command. Fixes: f870fa0b5768 ("mptcp: Add MPTCP socket stubs") Cc: [email protected] Reviewed-by: Geliang Tang <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20240826-net-mptcp-close-extra-sf-fin-v1-4-905199fe1172@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: sched: check both backup in retransMatthieu Baerts (NGI0)2024-08-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | The 'mptcp_subflow_context' structure has two items related to the backup flags: - 'backup': the subflow has been marked as backup by the other peer - 'request_bkup': the backup flag has been set by the host Looking only at the 'backup' flag can make sense in some cases, but it is not the behaviour of the default packet scheduler when selecting paths. As explained in the commit b6a66e521a20 ("mptcp: sched: check both directions for backup"), the packet scheduler should look at both flags, because that was the behaviour from the beginning: the 'backup' flag was set by accident instead of the 'request_bkup' one. Now that the latter has been fixed, get_retrans() needs to be adapted as well. Fixes: b6a66e521a20 ("mptcp: sched: check both directions for backup") Cc: [email protected] Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20240826-net-mptcp-close-extra-sf-fin-v1-3-905199fe1172@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: close subflow when receiving TCP+FINMatthieu Baerts (NGI0)2024-08-271-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a peer decides to close one subflow in the middle of a connection having multiple subflows, the receiver of the first FIN should accept that, and close the subflow on its side as well. If not, the subflow will stay half closed, and would even continue to be used until the end of the MPTCP connection or a reset from the network. The issue has not been seen before, probably because the in-kernel path-manager always sends a RM_ADDR before closing the subflow. Upon the reception of this RM_ADDR, the other peer will initiate the closure on its side as well. On the other hand, if the RM_ADDR is lost, or if the path-manager of the other peer only closes the subflow without sending a RM_ADDR, the subflow would switch to TCP_CLOSE_WAIT, but that's it, leaving the subflow half-closed. So now, when the subflow switches to the TCP_CLOSE_WAIT state, and if the MPTCP connection has not been closed before with a DATA_FIN, the kernel owning the subflow schedules its worker to initiate the closure on its side as well. This issue can be easily reproduced with packetdrill, as visible in [1], by creating an additional subflow, injecting a FIN+ACK before sending the DATA_FIN, and expecting a FIN+ACK in return. Fixes: 40947e13997a ("mptcp: schedule worker when subflow is closed") Cc: [email protected] Link: https://github.com/multipath-tcp/packetdrill/pull/154 [1] Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20240826-net-mptcp-close-extra-sf-fin-v1-1-905199fe1172@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
* mptcp: fix bad RCVPRUNED mib accountingPaolo Abeni2024-08-011-4/+4
| | | | | | | | | | | | | | | | | Since its introduction, the mentioned MIB accounted for the wrong event: wake-up being skipped as not-needed on some edge condition instead of incoming skb being dropped after landing in the (subflow) receive queue. Move the increment in the correct location. Fixes: ce599c516386 ("mptcp: properly account bulk freed memory") Cc: [email protected] Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
* mptcp: sched: check both directions for backupMatthieu Baerts (NGI0)2024-07-301-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | The 'mptcp_subflow_context' structure has two items related to the backup flags: - 'backup': the subflow has been marked as backup by the other peer - 'request_bkup': the backup flag has been set by the host Before this patch, the scheduler was only looking at the 'backup' flag. That can make sense in some cases, but it looks like that's not what we wanted for the general use, because either the path-manager was setting both of them when sending an MP_PRIO, or the receiver was duplicating the 'backup' flag in the subflow request. Note that the use of these two flags in the path-manager are going to be fixed in the next commits, but this change here is needed not to modify the behaviour. Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests") Cc: [email protected] Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2024-06-131-0/+1
|\ | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. No conflicts, no adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>