| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
| |
Add a proper description for the sk_stream_write_space() function as
previously marked by a FIXME comment.
No functional changes.
Signed-off-by: Suchit Karunakaran <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
| |
The DMA map functions can fail and should be tested for errors.
If the mapping fails, unmap and return an error.
Signed-off-by: Thomas Fourier <[email protected]>
Acked-by: Mark Einon <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
The DMA map functions can fail and should be tested for errors.
Signed-off-by: Thomas Fourier <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add support for IPv6 environment for napi_id test.
Test Plan:
./run_kselftest.sh -t drivers/net:napi_id.py
TAP version 13
1..1
# timeout set to 45
# selftests: drivers/net: napi_id.py
# TAP version 13
# 1..1
# ok 1 napi_id.test_napi_id
# # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 1 selftests: drivers/net: napi_id.py
Signed-off-by: Tianyi Cui <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
VNIC testing on multi-core Power systems showed SAR stats drift
and packet rate inconsistencies under load.
Implements ndo_get_stats64 to provide safe aggregation of queue-level
atomic64 counters into rtnl_link_stats64 for use by tools like 'ip -s',
'ifconfig', and 'sar'. Switch to ndo_get_stats64 to align SAR reporting
with the standard kernel interface for retrieving netdev stats.
This removes redundant per-adapter stat updates, reduces overhead,
eliminates cacheline bouncing from hot path updates, and improves
the accuracy of reported packet rates.
Signed-off-by: Mingming Cao <[email protected]>
Reviewed-by: Brian King <[email protected]>
Reviewed-by: Dave Marquardt <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
----
Changes since v3:
link to v3: https://www.spinics.net/lists/netdev/msg1107999.html
-- keep per queue counters as u64 (this patch) and drop off patch 1 in v3
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Tariq Toukan says:
====================
net/mlx5: misc changes 2025-07-16
This series contains misc enhancements to the mlx5 driver.
v1: https://lore.kernel.org/[email protected]
====================
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
qdisc_sleeping variable is declared as "struct Qdisc __rcu" and
as such needs proper annotation while accessing it.
Without rtnl_dereference(), the following error is generated by sparse:
drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: warning:
incorrect type in initializer (different address spaces)
drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: expected
struct Qdisc *qdisc
drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: got struct
Qdisc [noderef] __rcu *qdisc_sleeping
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Michal Swiatkowski <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Fix the following kdoc warning:
git ls-files *.[ch] | egrep drivers/net/ethernet/mellanox/mlx5/core/ |\
xargs scripts/kernel-doc --none
drivers/net/ethernet/mellanox/mlx5/core/eswitch.h:824: warning: cannot
understand function prototype: 'struct mlx5_esw_event_info '
Signed-off-by: Moshe Shemesh <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Michal Swiatkowski <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
IPSec hardware offload in legacy mode should not be affected by the
steering mode, hence it should also work properly with hmfs mode.
Remove steering mode validation when calculating the cap for packet
offload, this will also enable the missing cap MLX5_IPSEC_CAP_PRIO
needed for crypto offload.
Signed-off-by: Lama Kayal <[email protected]>
Reviewed-by: Jianbo Liu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Reviewed-by: Michal Swiatkowski <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |
|
|
|
|
|
|
|
|
| |
readl() returns 32-bit value but Clause 22/45 registers are 16-bit wide.
Masking with 0xFFFF avoids using garbage upper bits.
Signed-off-by: Jack Ping CHNG <[email protected]>
Reviewed-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
| |
The __esw_qos_alloc_node() function returns NULL on error. It doesn't
return error pointers. Update the error checking to match.
Fixes: 96619c485fa6 ("net/mlx5: Add support for setting tc-bw on nodes")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
We recently changed this from using devm_ioremap() to using
devm_ioremap_resource() and unfortunately the former returns NULL while
the latter returns error pointers. The check for errors needs to be
updated as well.
Fixes: e27dba1951ce ("net: Use of_reserved_mem_region_to_resource{_byname}() for "memory-region"")
Signed-off-by: Dan Carpenter <[email protected]>
Acked-by: Lorenzo Bianconi <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
| |
The devm_ioremap_resource() function returns error pointers. It never
returns NULL. Update the check to match.
Fixes: e27dba1951ce ("net: Use of_reserved_mem_region_to_resource{_byname}() for "memory-region"")
Signed-off-by: Dan Carpenter <[email protected]>
Acked-by: Lorenzo Bianconi <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Luo Jie says:
====================
Add shared PHY counter support for QCA807x and QCA808x
The implementation of the PHY counter is identical for both QCA808x and
QCA807x series devices. This includes counters for both good and bad CRC
frames in the RX and TX directions, which are active when CRC checking
is enabled.
This patch series introduces PHY counter functions into a shared library,
enabling counter support for the QCA808x and QCA807x families through this
common infrastructure. Additionally, enable CRC checking and configure
automatic clearing of counters after reading within config_init() to ensure
accurate counter recording.
v2: https://lore.kernel.org/[email protected]
v1: https://lore.kernel.org/[email protected]
====================
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Within the QCA807X PHY operation's config_init() function, enable CRC
checking for received and transmitted frames and configure counter to
clear after being read to support counter recording. Additionally, add
support for PHY counter operations.
Signed-off-by: Luo Jie <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Enable CRC checking for received and transmitted frames, and configure
counters to clear after being read within config_init() for accurate
counter recording. Additionally, add PHY counter operations and integrate
shared functions.
Signed-off-by: Luo Jie <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add PHY counter functionality to the shared library. The implementation
is identical for the current QCA807X and QCA808X PHYs.
The PHY counter can be configured to perform CRC checking for both received
and transmitted packets. Additionally, the packet counter can be set to
automatically clear after it is read.
The PHY counter includes 32-bit packet counters for both RX (received) and
TX (transmitted) packets, as well as 16-bit counters for recording CRC
error packets for both RX and TX.
Signed-off-by: Luo Jie <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
| |
bool notify is referenced nowhere else in the function except to check
whether or not to call rtnl_offload_xstats_notify(). Remove it and move
the call to the previous branch.
Signed-off-by: Dennis Chen <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Petr Machata <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2025-07-17
We've added 13 non-merge commits during the last 20 day(s) which contain
a total of 4 files changed, 712 insertions(+), 84 deletions(-).
The main changes are:
1) Avoid skipping or repeating a sk when using a TCP bpf_iter,
from Jordan Rife.
2) Clarify the driver requirement on using the XDP metadata,
from Song Yoong Siang
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
doc: xdp: Clarify driver implementation for XDP Rx metadata
selftests/bpf: Add tests for bucket resume logic in established sockets
selftests/bpf: Create iter_tcp_destroy test program
selftests/bpf: Create established sockets in socket iterator tests
selftests/bpf: Make ehash buckets configurable in socket iterator tests
selftests/bpf: Allow for iteration over multiple states
selftests/bpf: Allow for iteration over multiple ports
selftests/bpf: Add tests for bucket resume logic in listening sockets
bpf: tcp: Avoid socket skips and repeats during iteration
bpf: tcp: Use bpf_tcp_iter_batch_item for bpf_tcp_iter_state batch items
bpf: tcp: Get rid of st_bucket_done
bpf: tcp: Make sure iter->batch always contains a full bucket snapshot
bpf: tcp: Make mem flags configurable through bpf_iter_tcp_realloc_batch
====================
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Clarify that drivers must remove device-reserved metadata from the
data_meta area before passing frames to XDP programs.
Additionally, expand the explanation of how userspace and BPF programs
should coordinate the use of METADATA_SIZE, and add a detailed diagram
to illustrate pointer adjustments and metadata layout.
Also describe the requirements and constraints enforced by
bpf_xdp_adjust_meta().
Signed-off-by: Song Yoong Siang <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
| | |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Jordan Rife says:
====================
bpf: tcp: Exactly-once socket iteration
TCP socket iterators use iter->offset to track progress through a
bucket, which is a measure of the number of matching sockets from the
current bucket that have been seen or processed by the iterator. On
subsequent iterations, if the current bucket has unprocessed items, we
skip at least iter->offset matching items in the bucket before adding
any remaining items to the next batch. However, iter->offset isn't
always an accurate measure of "things already seen" when the underlying
bucket changes between reads, which can lead to repeated or skipped
sockets. Instead, this series remembers the cookies of the sockets we
haven't seen yet in the current bucket and resumes from the first cookie
in that list that we can find on the next iteration.
This is a continuation of the work started in [1]. This series largely
replicates the patterns applied to UDP socket iterators, applying them
instead to TCP socket iterators.
CHANGES
=======
v5 -> v6:
* In patch ten ("selftests/bpf: Create established sockets in socket
iterator tests"), use poll() to choose a socket that has a connection
ready to be accept()ed. Before, connect_to_server would set the
O_NONBLOCK flag on all listening sockets so that accept_from_one could
loop through them all and find the one that connect_to_addr_str
connected to. However, this is subtly buggy and could potentially lead
to test flakes, since the 3 way handshake isn't necessarily done when
connect returns, so it's possible none of the accept() calls succeed.
Use poll() instead to guarantee that the socket we accept() from is
ready and eliminate the need for the O_NONBLOCK flag (Martin).
v4 -> v5:
* Move WARN_ON_ONCE before the `done` label in patch two ("bpf: tcp:
Make sure iter->batch always contains a full bucket snapshot"")
(Martin).
* Remove unnecessary kfunc declaration in patch eleven ("selftests/bpf:
Create iter_tcp_destroy test program") (Martin).
* Make sure to close the socket fd at the end of `destroy` in patch
twelve ("selftests/bpf: Add tests for bucket resume logic in
established sockets") (Martin).
v3 -> v4:
* Drop braces around sk_nulls_for_each_from in patch five ("bpf: tcp:
Avoid socket skips and repeats during iteration") (Stanislav).
* Add a break after the TCP_SEQ_STATE_ESTABLISHED case in patch five
(Stanislav).
* Add an `if (sock_type == SOCK_STREAM)` check before assigning
TCP_LISTEN to skel->rodata->ss in patch eight ("selftests/bpf: Allow
for iteration over multiple states") to more clearly express the
intent that the option is only consumed for SOCK_STREAM tests
(Stanislav).
* Move the `i = 0` assignment into the for loop in patch ten
("selftests/bpf: Create established sockets in socket iterator
tests") (Stanislav).
v2 -> v3:
* Unroll the loop inside bpf_iter_tcp_batch to make the logic easier to
follow in patch two ("bpf: tcp: Make sure iter->batch always contains
a full bucket snapshot"). This gets rid of the `resizes` variable from
v2 and eliminates the extra conditional that checks how many batch
resize attempts have occurred so far (Stanislav).
Note: This changes the behavior slightly. Before, in the case that
the second call to tcp_seek_last_pos (and later bpf_iter_tcp_resume)
advances to a new bucket, which may happen if the current bucket is
emptied after releasing its lock, the `resizes` "budget" would be
reset, the net effect being that we would try a batch resize with
GFP_USER at most once per bucket. Now, we try to resize the batch
with GFP_USER at most once per call, so it makes it slightly more
likely that we hit the GFP_NOWAIT scenario. However, this edge case
should be rare in practice anyway, and the new behavior is more or
less consistent with the original retry logic, so avoid the loop and
prefer code clarity.
* Move the call to bpf_iter_tcp_put_batch out of
bpf_iter_tcp_realloc_batch and call it directly before invoking
bpf_iter_tcp_realloc_batch with GFP_USER inside bpf_iter_tcp_batch.
/Don't/ call it before invoking bpf_iter_tcp_realloc_batch the second
time while we hold the lock with GFP_NOWAIT. This avoids a conditional
inside bpf_iter_tcp_realloc_batch from v2 that only calls
bpf_iter_tcp_put_batch if flags != GFP_NOWAIT and is a bit more
explicit (Stanislav).
* Adjust patch five ("bpf: tcp: Avoid socket skips and repeats during
iteration") to fit with the new logic in patch two.
v1 -> v2:
* In patch five ("bpf: tcp: Avoid socket skips and repeats during
iteration"), remove unnecessary bucket bounds checks in
bpf_iter_tcp_resume. In either case, if st->bucket is outside the
current table's range then bpf_iter_tcp_resume_* calls *_get_first
which immediately returns NULL anyway and the logic will fall through.
(Martin)
* Add a check at the top of bpf_iter_tcp_resume_listening and
bpf_iter_tcp_resume_established to see if we're done with the current
bucket and advance it immediately instead of wasting time finding the
first matching socket in that bucket with
(listening|established)_get_first. In v1, we originally discussed
adding logic to advance the bucket in bpf_iter_tcp_seq_next and
bpf_iter_tcp_seq_stop, but after trying this the logic seemed harder
to track. Overall, keeping everything inside bpf_iter_tcp_resume_*
seemed a bit clearer. (Martin)
* Instead of using a timeout in the last patch ("selftests/bpf: Add
tests for bucket resume logic in established sockets") to wait for
sockets to leave the ehash table after calling close(), use
bpf_sock_destroy to deterministically destroy and remove them. This
introduces one more patch ("selftests/bpf: Create iter_tcp_destroy
test program") to create the iterator program that destroys a selected
socket. Drive this through a destroy() function in the last patch
which, just like close(), accepts a socket file descriptor. (Martin)
* Introduce one more patch ("selftests/bpf: Allow for iteration over
multiple states") to fix a latent bug in iter_tcp_soreuse where the
sk->sk_state != TCP_LISTEN check was ignored. Add the "ss" variable to
allow test code to configure which socket states to allow.
[1]: https://lore.kernel.org/bpf/[email protected]/
====================
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Replicate the set of test cases used for UDP socket iterators to test
similar scenarios for TCP established sockets.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Prepare for bucket resume tests for established TCP sockets by creating
a program to immediately destroy and remove sockets from the TCP ehash
table, since close() is not deterministic.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Prepare for bucket resume tests for established TCP sockets by creating
established sockets. Collect socket fds from connect() and accept()
sides and pass them to test cases.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Prepare for bucket resume tests for established TCP sockets by making
the number of ehash buckets configurable. Subsequent patches force all
established sockets into the same bucket by setting ehash_buckets to
one.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add parentheses around loopback address check to fix up logic and make
the socket state filter configurable for the TCP socket iterators.
Iterators can skip the socket state check by setting ss to 0.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Prepare to test TCP socket iteration over both listening and established
sockets by allowing the BPF iterator programs to skip the port check.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Replicate the set of test cases used for UDP socket iterators to test
similar scenarios for TCP listening sockets.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Replace the offset-based approach for tracking progress through a bucket
in the TCP table with one based on socket cookies. Remember the cookies
of unprocessed sockets from the last batch and use this list to
pick up where we left off or, in the case that the next socket
disappears between reads, find the first socket after that point that
still exists in the bucket and resume from there.
This approach guarantees that all sockets that existed when iteration
began and continue to exist throughout will be visited exactly once.
Sockets that are added to the table during iteration may or may not be
seen, but if they are they will be seen exactly once.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Prepare for the next patch that tracks cookies between iterations by
converting struct sock **batch to union bpf_tcp_iter_batch_item *batch
inside struct bpf_tcp_iter_state.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Get rid of the st_bucket_done field to simplify TCP iterator state and
logic. Before, st_bucket_done could be false if bpf_iter_tcp_batch
returned a partial batch; however, with the last patch ("bpf: tcp: Make
sure iter->batch always contains a full bucket snapshot"),
st_bucket_done == true is equivalent to iter->cur_sk == iter->end_sk.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Require that iter->batch always contains a full bucket snapshot. This
invariant is important to avoid skipping or repeating sockets during
iteration when combined with the next few patches. Before, there were
two cases where a call to bpf_iter_tcp_batch may only capture part of a
bucket:
1. When bpf_iter_tcp_realloc_batch() returns -ENOMEM.
2. When more sockets are added to the bucket while calling
bpf_iter_tcp_realloc_batch(), making the updated batch size
insufficient.
In cases where the batch size only covers part of a bucket, it is
possible to forget which sockets were already visited, especially if we
have to process a bucket in more than two batches. This forces us to
choose between repeating or skipping sockets, so don't allow this:
1. Stop iteration and propagate -ENOMEM up to userspace if reallocation
fails instead of continuing with a partial batch.
2. Try bpf_iter_tcp_realloc_batch() with GFP_USER just as before, but if
we still aren't able to capture the full bucket, call
bpf_iter_tcp_realloc_batch() again while holding the bucket lock to
guarantee the bucket does not change. On the second attempt use
GFP_NOWAIT since we hold onto the spin lock.
I did some manual testing to exercise the code paths where GFP_NOWAIT is
used and where ERR_PTR(err) is returned. I used the realloc test cases
included later in this series to trigger a scenario where a realloc
happens inside bpf_iter_tcp_batch and made a small code tweak to force
the first realloc attempt to allocate a too-small batch, thus requiring
another attempt with GFP_NOWAIT. Some printks showed both reallocs with
the tests passing:
Jun 27 00:00:53 crow kernel: again GFP_USER
Jun 27 00:00:53 crow kernel: again GFP_NOWAIT
Jun 27 00:00:53 crow kernel: again GFP_USER
Jun 27 00:00:53 crow kernel: again GFP_NOWAIT
With this setup, I also forced each of the bpf_iter_tcp_realloc_batch
calls to return -ENOMEM to ensure that iteration ends and that the
read() in userspace fails.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | |/
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Prepare for the next patch which needs to be able to choose either
GFP_USER or GFP_NOWAIT for calls to bpf_iter_tcp_realloc_batch.
Signed-off-by: Jordan Rife <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Make sure Python doesn't buffer the output, otherwise for some
tests we may see false positive timeouts in NIPA. NIPA thinks that
a machine has hung if the test doesn't print anything for 3min.
This is also nice to heave for running the tests manually,
especially in vng.
Reviewed-by: Petr Machata <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Kuniyuki Iwashima says:
====================
neighbour: Convert RTM_GETNEIGH to RCU and make pneigh RTNL-free.
This is kind of v3 of the series below [0] but without NEIGHTBL patches.
Patch 1 ~ 4 and 9 come from the series to convert RTM_GETNEIGH to RCU.
Other patches clean up pneigh_lookup() and convert the pneigh code to
RCU + private mutex so that we can easily remove RTNL from RTM_NEWNEIGH
in the later series.
[0]: https://lore.kernel.org/netdev/[email protected]/
v2: https://lore.kernel.org/[email protected]
v1: https://lore.kernel.org/[email protected]
====================
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
neigh_add() updates pneigh_entry() found or created by pneigh_create().
This update is serialised by RTNL, but we will remove it.
Let's move the update part to pneigh_create() and make it return errno
instead of a pointer of pneigh_entry.
Now, the pneigh code is RTNL free.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
tbl->phash_buckets[] is only modified in the slow path by pneigh_create()
and pneigh_delete() under the table lock.
Both of them are called under RTNL, so no extra lock is needed, but we
will remove RTNL from the paths.
pneigh_create() looks up a pneigh_entry, and this part can be lockless,
but it would complicate the logic like
1. lookup
2. allocate pengih_entry for GFP_KERNEL
3. lookup again but under lock
4. if found, return it after freeing the allocated memory
5. else, return the new one
Instead, let's add a per-table mutex and run lookup and allocation
under it.
Note that updating pneigh_entry part in neigh_add() is still protected
by RTNL and will be moved to pneigh_create() in the next patch.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Now, all callers of pneigh_lookup() are under RCU, and the read
lock there is no longer needed.
Let's drop the lock, inline __pneigh_lookup_1() to pneigh_lookup(),
and call it from pneigh_create().
The next patch will remove tbl->lock from pneigh_create().
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
__pneigh_lookup() is the lockless version of pneigh_lookup(),
but its only caller pndisc_is_router() holds the table lock and
reads pneigh_netry.flags.
This is because accessing pneigh_entry after pneigh_lookup() was
illegal unless the caller holds RTNL or the table lock.
Now, pneigh_entry is guaranteed to be alive during the RCU critical
section.
Let's call pneigh_lookup() and use READ_ONCE() for n->flags in
pndisc_is_router() and remove __pneigh_lookup().
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Now pneigh_entry is guaranteed to be alive during the
RCU critical section even without holding tbl->lock.
Let's use rcu_dereference() in pneigh_get_{first,next}().
Note that neigh_seq_start() still holds tbl->lock for the
normal neighbour entry.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Now pneigh_entry is guaranteed to be alive during the
RCU critical section even without holding tbl->lock.
Let's drop read_lock_bh(&tbl->lock) and use rcu_dereference()
to iterate tbl->phash_buckets[] in pneigh_dump_table()
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Only __dev_get_by_index() is the RTNL dependant in neigh_get().
Let's replace it with dev_get_by_index_rcu() and convert RTM_GETNEIGH
to RCU.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We will convert pneigh readers to RCU, and its flags and protocol
will be read locklessly.
Let's annotate the access to the two fields.
Note that all access to pn->permanent is under RTNL (neigh_add()
and pneigh_ifdown_and_unlock()), so WRITE_ONCE() and READ_ONCE()
are not needed.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We will convert RTM_GETNEIGH to RCU.
neigh_get() looks up pneigh_entry by pneigh_lookup() and passes
it to pneigh_fill_info().
Then, we must ensure that the entry is alive till pneigh_fill_info()
completes, but read_lock_bh(&tbl->lock) in pneigh_lookup() does not
guarantee that.
Also, we will convert all readers of tbl->phash_buckets[] to RCU.
Let's use call_rcu() to free pneigh_entry and update phash_buckets[]
and ->next by rcu_assign_pointer().
pneigh_ifdown_and_unlock() uses list_head to avoid overwriting
->next and moving RCU iterators to another list.
pndisc_destructor() (only IPv6 ndisc uses this) uses a mutex, so it
is not delayed to call_rcu(), where we cannot sleep. This is fine
because the mcast code works with RCU and ipv6_dev_mc_dec() frees
mcast objects after RCU grace period.
While at it, we change the return type of pneigh_ifdown_and_unlock()
to void.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The next patch will free pneigh_entry with call_rcu().
Then, we need to annotate neigh_table.phash_buckets[] and
pneigh_entry.next with __rcu.
To make the next patch cleaner, let's annotate the fields in advance.
Currently, all accesses to the fields are under the neigh table lock,
so rcu_dereference_protected() is used with 1 for now, but most of them
(except in pneigh_delete() and pneigh_ifdown_and_unlock()) will be
replaced with rcu_dereference() and rcu_dereference_check().
Note that pneigh_ifdown_and_unlock() changes pneigh_entry.next to a
local list, which is illegal because the RCU iterator could be moved
to another list. This part will be fixed in the next patch.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which
is confusing.
When called with the last argument, creat, 0, pneigh_lookup() literally
looks up a proxy neighbour entry. This is the case of the reader path
as the fast path and RTM_GETNEIGH.
pneigh_lookup(), however, creates a pneigh_entry when called with creat 1
from RTM_NEWNEIGH and SIOCSARP, which require RTNL.
Let's split pneigh_lookup() into two functions.
We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock)
in the new pneigh_lookup() will be dropped.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
neigh_valid_get_req() calls neigh_find_table() to fetch neigh_tables[].
neigh_find_table() uses rcu_dereference_rtnl(), but RTNL actually does
not protect it at all; neigh_table_clear() can be called without RTNL
and only waits for RCU readers by synchronize_rcu().
Fortunately, there is no bug because IPv4 is built-in, IPv6 cannot be
unloaded, and DECNET was removed.
To fetch neigh_tables[] by rcu_dereference() later, let's move
neigh_find_table() from neigh_valid_get_req() to neigh_get().
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We will remove RTNL for neigh_get() and run it under RCU instead.
neigh_get_reply() and pneigh_get_reply() allocate skb with GFP_KERNEL.
Let's move the allocation before __dev_get_by_index() in neigh_get().
Now, neigh_get_reply() and pneigh_get_reply() are inlined and
rtnl_unicast() is factorised.
We will convert pneigh_lookup() to __pneigh_lookup() later.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We will remove RTNL for neigh_get() and run it under RCU instead.
neigh_get() returns -EINVAL in the following cases:
* NDA_DST is not specified
* Both ndm->ndm_ifindex and NTF_PROXY are not specified
These validations do not require RCU.
Let's move them to neigh_valid_get_req().
While at it, the extack string for the first case is replaced with
NL_SET_ERR_ATTR_MISS().
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
| |/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
neigh_get() passes 4 local variable pointers to neigh_valid_get_req().
If it returns a pointer of struct ndmsg, we do not need to pass two
of them.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|