aboutsummaryrefslogtreecommitdiffstats
path: root/net/devlink
Commit message (Collapse)AuthorAgeFilesLines
* devlink: rate: Unset parent pointer in devl_rate_nodes_destroyShay Drory2025-11-191-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function devl_rate_nodes_destroy is documented to "Unset parent for all rate objects". However, it was only calling the driver-specific `rate_leaf_parent_set` or `rate_node_parent_set` ops and decrementing the parent's refcount, without actually setting the `devlink_rate->parent` pointer to NULL. This leaves a dangling pointer in the `devlink_rate` struct, which cause refcount error in netdevsim[1] and mlx5[2]. In addition, this is inconsistent with the behavior of `devlink_nl_rate_parent_node_set`, where the parent pointer is correctly cleared. This patch fixes the issue by explicitly setting `devlink_rate->parent` to NULL after notifying the driver, thus fulfilling the function's documented behavior for all rate objects. [1] repro steps: echo 1 > /sys/bus/netdevsim/new_device devlink dev eswitch set netdevsim/netdevsim1 mode switchdev echo 1 > /sys/bus/netdevsim/devices/netdevsim1/sriov_numvfs devlink port function rate add netdevsim/netdevsim1/test_node devlink port function rate set netdevsim/netdevsim1/128 parent test_node echo 1 > /sys/bus/netdevsim/del_device dmesg: refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 8 PID: 1530 at lib/refcount.c:31 refcount_warn_saturate+0x42/0xe0 CPU: 8 UID: 0 PID: 1530 Comm: bash Not tainted 6.18.0-rc4+ #1 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:refcount_warn_saturate+0x42/0xe0 Call Trace: <TASK> devl_rate_leaf_destroy+0x8d/0x90 __nsim_dev_port_del+0x6c/0x70 [netdevsim] nsim_dev_reload_destroy+0x11c/0x140 [netdevsim] nsim_drv_remove+0x2b/0xb0 [netdevsim] device_release_driver_internal+0x194/0x1f0 bus_remove_device+0xc6/0x130 device_del+0x159/0x3c0 device_unregister+0x1a/0x60 del_device_store+0x111/0x170 [netdevsim] kernfs_fop_write_iter+0x12e/0x1e0 vfs_write+0x215/0x3d0 ksys_write+0x5f/0xd0 do_syscall_64+0x55/0x10f0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 [2] devlink dev eswitch set pci/0000:08:00.0 mode switchdev devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 1000 devlink port function rate add pci/0000:08:00.0/group1 devlink port function rate set pci/0000:08:00.0/32768 parent group1 modprobe -r mlx5_ib mlx5_fwctl mlx5_core dmesg: refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 7 PID: 16151 at lib/refcount.c:31 refcount_warn_saturate+0x42/0xe0 CPU: 7 UID: 0 PID: 16151 Comm: bash Not tainted 6.17.0-rc7_for_upstream_min_debug_2025_10_02_12_44 #1 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:refcount_warn_saturate+0x42/0xe0 Call Trace: <TASK> devl_rate_leaf_destroy+0x8d/0x90 mlx5_esw_offloads_devlink_port_unregister+0x33/0x60 [mlx5_core] mlx5_esw_offloads_unload_rep+0x3f/0x50 [mlx5_core] mlx5_eswitch_unload_sf_vport+0x40/0x90 [mlx5_core] mlx5_sf_esw_event+0xc4/0x120 [mlx5_core] notifier_call_chain+0x33/0xa0 blocking_notifier_call_chain+0x3b/0x50 mlx5_eswitch_disable_locked+0x50/0x110 [mlx5_core] mlx5_eswitch_disable+0x63/0x90 [mlx5_core] mlx5_unload+0x1d/0x170 [mlx5_core] mlx5_uninit_one+0xa2/0x130 [mlx5_core] remove_one+0x78/0xd0 [mlx5_core] pci_device_remove+0x39/0xa0 device_release_driver_internal+0x194/0x1f0 unbind_store+0x99/0xa0 kernfs_fop_write_iter+0x12e/0x1e0 vfs_write+0x215/0x3d0 ksys_write+0x5f/0xd0 do_syscall_64+0x53/0x1f0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: d75559845078 ("devlink: Allow setting parent node of rate objects") Signed-off-by: Shay Drory <[email protected]> Reviewed-by: Carolina Jubran <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* net: replace use of system_wq with system_percpu_wqMarco Crivellari2025-09-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Marco Crivellari <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2025-09-181-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR (net-6.17-rc7). No conflicts. Adjacent changes: drivers/net/ethernet/mellanox/mlx5/core/en/fs.h 9536fbe10c9d ("net/mlx5e: Add PSP steering in local NIC RX") 7601a0a46216 ("net/mlx5e: Add a miss level for ipsec crypto offload") Signed-off-by: Jakub Kicinski <[email protected]>
| * devlink rate: Remove unnecessary 'static' from a couple placesCosmin Ratiu2025-09-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | devlink_rate_node_get_by_name() and devlink_rate_nodes_destroy() have a couple of unnecessary static variables for iterating over devlink rates. This could lead to races/corruption/unhappiness if two concurrent operations execute the same function. Remove 'static' from both. It's amazing this was missed for 4+ years. While at it, I confirmed there are no more examples of this mistake in net/ with 1, 2 or 3 levels of indentation. Fixes: a8ecb93ef03d ("devlink: Introduce rate nodes") Signed-off-by: Cosmin Ratiu <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
* | devlink: Add a 'num_doorbells' driverinit paramCosmin Ratiu2025-09-181-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | This parameter can be used by drivers to configure a different number of doorbells. Signed-off-by: Cosmin Ratiu <[email protected]> Reviewed-by: Dragos Tatulea <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
* | devlink: Add 'total_vfs' generic device paramVlad Dumitrescu2025-09-101-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NICs are typically configured with total_vfs=0, forcing users to rely on external tools to enable SR-IOV (a widely used and essential feature). Add total_vfs parameter to devlink for SR-IOV max VF configurability. Enables standard kernel tools to manage SR-IOV, addressing the need for flexible VF configuration. Signed-off-by: Vlad Dumitrescu <[email protected]> Tested-by: Kamal Heib <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | devlink: Make health reporter burst period configurableShahar Shitrit2025-08-272-4/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable configuration of the burst period — a time window starting from the first error recovery, during which the reporter allows recovery attempts for each reported error. This feature is helpful when a single underlying issue causes multiple errors, as it delays the start of the grace period to allow sufficient time for recovering all related errors. For example, if multiple TX queues time out simultaneously, a sufficient burst period could allow all affected TX queues to be recovered within that window. Without this period, only the first TX queue that reports a timeout will undergo recovery, while the remaining TX queues will be blocked once the grace period begins. Configuration example: $ devlink health set pci/0000:00:09.0 reporter tx burst_period 500 Configuration example with ynl: ./tools/net/ynl/pyynl/cli.py \ --spec Documentation/netlink/specs/devlink.yaml \ --do health-reporter-set --json '{ "bus-name": "auxiliary", "dev-name": "mlx5_core.eth.0", "port-index": 65535, "health-reporter-name": "tx", "health-reporter-burst-period": 500 }' Signed-off-by: Shahar Shitrit <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Dragos Tatulea <[email protected]> Reviewed-by: Carolina Jubran <[email protected]> Signed-off-by: Mark Bloch <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | devlink: Introduce burst period for health reporterShahar Shitrit2025-08-271-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the devlink health reporter starts the grace period immediately after handling an error, blocking any further recoveries until it finished. However, when a single root cause triggers multiple errors in a short time frame, it is desirable to treat them as a bulk of errors and to allow their recoveries, avoiding premature blocking of subsequent related errors, and reducing the risk of inconsistent or incomplete error handling. To address this, introduce a configurable burst period for devlink health reporter. Start this period when the first error is handled, and allow recovery attempts for reported errors during this window. Once burst period expires, begin the grace period to block further recoveries until it concludes. Timeline summary: ----|--------|------------------------------/----------------------/-- error is error is burst period grace period reported recovered (recoveries allowed) (recoveries blocked) For calculating the burst period duration, use the same last_recovery_ts as the grace period. Update it on recovery only when the burst period is inactive (either disabled or at the first error). This patch implements the framework for the burst period and effectively sets its value to 0 at reporter creation, so the current behavior remains unchanged, which ensures backward compatibility. A downstream patch will make the burst period configurable. Signed-off-by: Shahar Shitrit <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Mark Bloch <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | devlink: Move health reporter recovery abort logic to a separate functionShahar Shitrit2025-08-271-8/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extract the health reporter recovery abort logic into a separate function devlink_health_recover_abort(). The function encapsulates the conditions for aborting recovery: - When auto-recovery is disabled - When previous error wasn't recovered - When within the grace period after last recovery Signed-off-by: Shahar Shitrit <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Dragos Tatulea <[email protected]> Reviewed-by: Carolina Jubran <[email protected]> Signed-off-by: Mark Bloch <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | devlink: Move graceful period parameter to reporter opsShahar Shitrit2025-08-271-17/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the default graceful period from a parameter to devlink_health_reporter_create() to a field in the devlink_health_reporter_ops structure. This change improves consistency, as the graceful period is inherently tied to the reporter's behavior and recovery policy. It simplifies the signature of devlink_health_reporter_create() and its internal helper functions. It also centralizes the reporter configuration at the ops structure, preparing the groundwork for a downstream patch that will introduce a devlink health reporter burst period attribute whose default value will similarly be provided by the driver via the ops structure. Signed-off-by: Shahar Shitrit <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Mark Bloch <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | devlink/port: Check attributes early and constifyParav Pandit2025-08-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | Constify the devlink port attributes to indicate they are read only and does not depend on anything else. Therefore, validate it early before setting in the devlink port. Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Parav Pandit <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | devlink/port: Simplify return checksParav Pandit2025-08-151-23/+6
|/ | | | | | | | | | Drop always returning 0 from the helper routine and simplify its callers. Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Parav Pandit <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: let driver opt out of automatic phys_port_name generationJedrzej Jagielski2025-08-121-1/+1
| | | | | | | | | | | | | | | | | | | Currently when adding devlink port, phys_port_name is automatically generated within devlink port initialization flow. As a result adding devlink port support to driver may result in forced changes of interface names, which breaks already existing network configs. This is an expected behavior but in some scenarios it would not be preferable to provide such limitation for legacy driver not being able to keep 'pre-devlink' interface name. Add flag no_phys_port_name to devlink_port_attrs struct which indicates if devlink should not alter name of interface. Suggested-by: Jiri Pirko <[email protected]> Link: https://lore.kernel.org/all/nbwrfnjhvrcduqzjl4a2jafnvvud6qsbxlvxaxilnryglf4j7r@btuqrimnfuly/ Signed-off-by: Jedrzej Jagielski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* devlink: Fix excessive stack usage in rate TC bandwidth parsingCarolina Jubran2025-07-243-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The devlink_nl_rate_tc_bw_parse function uses a large stack array for devlink attributes, which triggers a warning about excessive stack usage: net/devlink/rate.c: In function 'devlink_nl_rate_tc_bw_parse': net/devlink/rate.c:382:1: error: the frame size of 1648 bytes is larger than 1536 bytes [-Werror=frame-larger-than=] Introduce a separate attribute set specifically for rate TC bandwidth parsing that only contains the two attributes actually used: index and bandwidth. This reduces the stack array from DEVLINK_ATTR_MAX entries to just 2 entries, solving the stack usage issue. Update devlink selftest to use the new 'index' and 'bw' attribute names consistent with the YAML spec. Example usage with ynl with the new spec: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"index": 0, "bw": 50}, {"index": 1, "bw": 50}, {"index": 2, "bw": 0}, {"index": 3, "bw": 0}, {"index": 4, "bw": 0}, {"index": 5, "bw": 0}, {"index": 6, "bw": 0}, {"index": 7, "bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'bw': 50, 'index': 0}, {'bw': 50, 'index': 1}, {'bw': 0, 'index': 2}, {'bw': 0, 'index': 3}, {'bw': 0, 'index': 4}, {'bw': 0, 'index': 5}, {'bw': 0, 'index': 6}, {'bw': 0, 'index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Fixes: 566e8f108fc7 ("devlink: Extend devlink rate API with traffic classes bandwidth management") Reported-by: Arnd Bergmann <[email protected]> Closes: https://lore.kernel.org/netdev/[email protected]/ Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Suggested-by: Jakub Kicinski <[email protected]> Signed-off-by: Carolina Jubran <[email protected]> Tested-by: Carolina Jubran <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: Add new "clock_id" generic device paramIvan Vecera2025-07-101-0/+5
| | | | | | | | | | Add a new device generic parameter to specify clock ID that should be used by the device for registering DPLL devices and pins. Signed-off-by: Ivan Vecera <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: Add support for u64 parametersIvan Vecera2025-07-101-0/+10
| | | | | | | | | | | | Only 8, 16 and 32-bit integers are supported for numeric devlink parameters. The subsequent patch adds support for DPLL clock ID that is defined as 64-bit number. Add support for u64 parameter type. Signed-off-by: Ivan Vecera <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: Extend devlink rate API with traffic classes bandwidth managementCarolina Jubran2025-07-023-4/+139
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce support for specifying relative bandwidth shares between traffic classes (TC) in the devlink-rate API. This new option allows users to allocate bandwidth across multiple traffic classes in a single command. This feature provides a more granular control over traffic management, especially for scenarios requiring Enhanced Transmission Selection. Users can now define a relative bandwidth share for each traffic class. For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5 (RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the total bandwidth. The actual percentage each class receives depends on the ratio of its share value to the sum of all shares. Example: DEV=pci/0000:08:00.0 $ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \ tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0 $ devlink port function rate set $DEV/vfs_group \ tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0 Example usage with ynl: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"rate-tc-index": 0, "rate-tc-bw": 50}, {"rate-tc-index": 1, "rate-tc-bw": 50}, {"rate-tc-index": 2, "rate-tc-bw": 0}, {"rate-tc-index": 3, "rate-tc-bw": 0}, {"rate-tc-index": 4, "rate-tc-bw": 0}, {"rate-tc-index": 5, "rate-tc-bw": 0}, {"rate-tc-index": 6, "rate-tc-bw": 0}, {"rate-tc-index": 7, "rate-tc-bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0}, {'rate-tc-bw': 50, 'rate-tc-index': 1}, {'rate-tc-bw': 0, 'rate-tc-index': 2}, {'rate-tc-bw': 0, 'rate-tc-index': 3}, {'rate-tc-bw': 0, 'rate-tc-index': 4}, {'rate-tc-bw': 0, 'rate-tc-index': 5}, {'rate-tc-bw': 0, 'rate-tc-index': 6}, {'rate-tc-bw': 0, 'rate-tc-index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Signed-off-by: Carolina Jubran <[email protected]> Reviewed-by: Cosmin Ratiu <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: Mark Bloch <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: Add new "enable_phc" generic device paramDavid Arinzon2025-06-191-0/+5
| | | | | | | | | | | Add a new device generic parameter to enable/disable the PHC (PTP Hardware Clock) functionality in the device associated with the devlink instance. Signed-off-by: David Arinzon <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: use DEVLINK_VAR_ATTR_TYPE_* instead of NLA_* in fmsgJiri Pirko2025-05-071-44/+21
| | | | | | | | | Use newly introduced DEVLINK_VAR_ATTR_TYPE_* enum values instead of internal NLA_* in fmsg health reporter code. Signed-off-by: Jiri Pirko <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: avoid param type value translationsJiri Pirko2025-05-071-44/+2
| | | | | | | | | | Assign DEVLINK_PARAM_TYPE_* enum values to DEVLINK_VAR_ATTR_TYPE_* to ensure the same values are used internally and in UAPI. Benefit from that by removing the value translations. Signed-off-by: Jiri Pirko <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: define enum for attr types of dynamic attributesJiri Pirko2025-05-073-18/+58
| | | | | | | | | | | | | | | | | | Devlink param and health reporter fmsg use attributes with dynamic type which is determined according to a different type. Currently used values are NLA_*. The problem is, they are not part of UAPI. They may change which would cause a break. To make this future safe, introduce a enum that shadows NLA_* values in it and is part of UAPI. Also, this allows to possibly carry types that are unrelated to NLA_* values. Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: add value check to devlink_info_version_put()Jedrzej Jagielski2025-04-151-1/+1
| | | | | | | | | | | Prevent from proceeding if there's nothing to print. Suggested-by: Przemek Kitszel <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Kalesh AP <[email protected]> Tested-by: Bharath R <[email protected]> Signed-off-by: Jedrzej Jagielski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* devlink: fix xa_alloc_cyclic() error handlingMichal Swiatkowski2025-03-191-1/+1
| | | | | | | | | | | | | | | | In case of returning 1 from xa_alloc_cyclic() (wrapping) ERR_PTR(1) will be returned, which will cause IS_ERR() to be false. Which can lead to dereference not allocated pointer (rel). Fix it by checking if err is lower than zero. This wasn't found in real usecase, only noticed. Credit to Pierre. Fixes: c137743bce02 ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Michal Swiatkowski <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
* devlink: Improve the port attributes descriptionParav Pandit2025-01-031-5/+6
| | | | | | | | | | | | | | | | | Current PF number description is vague, sometimes interpreted as some PF index. VF number in the PCI specification starts at 1; however in kernel, it starts at 0 for representor model. Improve the description of devlink port attributes PF, VF and SF numbers with these details. Reviewed-by: Sridhar Samudrala <[email protected]> Reviewed-by: Shay Drory <[email protected]> Reviewed-by: Mark Bloch <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Parav Pandit <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: add devlink_fmsg_dump_skb() functionMateusz Polchlopek2024-12-171-0/+67
| | | | | | | | | | | Add devlink_fmsg_dump_skb() function that adds some diagnostic information about skb (like length, pkt type, MAC, etc) to devlink fmsg mechanism using bunch of devlink_fmsg_put() function calls. Signed-off-by: Mateusz Polchlopek <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Przemek Kitszel <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* net: convert to nla_get_*_default()Johannes Berg2024-11-111-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the original conversion is from the spatch below, but I edited some and left out other instances that were either buggy after conversion (where default values don't fit into the type) or just looked strange. @@ expression attr, def; expression val; identifier fn =~ "^nla_get_.*"; fresh identifier dfn = fn ## "_default"; @@ ( -if (attr) - val = fn(attr); -else - val = def; +val = dfn(attr, def); | -if (!attr) - val = def; -else - val = fn(attr); +val = dfn(attr, def); | -if (!attr) - return def; -return fn(attr); +return dfn(attr, def); | -attr ? fn(attr) : def +dfn(attr, def) | -!attr ? def : fn(attr) +dfn(attr, def) ) Signed-off-by: Johannes Berg <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Link: https://patch.msgid.link/20241108114145.0580b8684e7f.I740beeaa2f70ebfc19bfca1045a24d6151992790@changeid Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: remove unused devlink_resource_register()Przemek Kitszel2024-10-291-33/+0
| | | | | | | | | | | | Remove unused devlink_resource_register(); all the drivers use devl_resource_register() variant instead. Reviewed-by: Wojciech Drewek <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Joe Damato <[email protected]> Signed-off-by: Przemek Kitszel <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: remove unused devlink_resource_occ_get_register() and _unregister()Przemek Kitszel2024-10-291-39/+0
| | | | | | | | | | | | | Remove not used devlink_resource_occ_get_register() and devlink_resource_occ_get_unregister() functions; current devlink resource users are fine with devl_ variants of the two. Reviewed-by: Wojciech Drewek <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Joe Damato <[email protected]> Signed-off-by: Przemek Kitszel <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: region: snapshot IDs: consolidate error valuesPrzemek Kitszel2024-10-291-2/+2
| | | | | | | | | | | | | | | Consolidate error codes for too big message size. Current code is written to return -EINVAL when tailroom in the skb msg would be exhausted precisely when it's time to nest, and return -EMSGSIZE in all other "not enough space" conditions. Reviewed-by: Wojciech Drewek <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Joe Damato <[email protected]> Signed-off-by: Przemek Kitszel <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: devl_resource_register(): differentiate error codesPrzemek Kitszel2024-10-291-1/+1
| | | | | | | | | | | | | | Differentiate error codes of devl_resource_register(). Replace one of -EINVAL exit paths by -EEXIST. This should aid developers introducing new resources and registering them in the wrong order. Reviewed-by: Wojciech Drewek <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Joe Damato <[email protected]> Signed-off-by: Przemek Kitszel <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: use devlink_nl_put_u64() helperPrzemek Kitszel2024-10-297-76/+59
| | | | | | | | | | | Use devlink_nl_put_u64() shortcut added by prev commit on all devlink/. Reviewed-by: Wojciech Drewek <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Joe Damato <[email protected]> Signed-off-by: Przemek Kitszel <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: introduce devlink_nl_put_u64()Przemek Kitszel2024-10-291-0/+5
| | | | | | | | | | | | Add devlink_nl_put_u64() that abstracts padding for u64 values. All u64 values are passed with the very same padding option. Reviewed-by: Wojciech Drewek <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Joe Damato <[email protected]> Signed-off-by: Przemek Kitszel <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* genetlink: extend info user-storage to match NL cb ctxPaolo Abeni2024-10-101-1/+1
| | | | | | | | | | | | | | | | This allows a more uniform implementation of non-dump and dump operations, and will be used later in the series to avoid some per-operation allocation. Additionally rename the NL_ASSERT_DUMP_CTX_FITS macro, to fit a more extended usage. Suggested-by: Jakub Kicinski <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Link: https://patch.msgid.link/1130cc2896626b84587a2a5f96a5c6829638f4da.1728460186.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: Constify the 'table_ops' parameter of devl_dpipe_table_register()Christophe JAILLET2024-06-051-1/+1
| | | | | | | | | | | | | | | | "struct devlink_dpipe_table_ops" only contains some function pointers. Update "struct devlink_dpipe_table" and the 'table_ops' parameter of devl_dpipe_table_register() so that structures in drivers can be constified. Constifying these structures will move some data to a read-only section, so increase overall security. Signed-off-by: Christophe JAILLET <[email protected]> Reviewed-by: Wojciech Drewek <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
* devlink: extend devlink_param *set pointerMateusz Polchlopek2024-04-221-3/+4
| | | | | | | | | | | | Extend devlink_param *set function pointer to take extack as a param. Sometimes it is needed to pass information to the end user from set function. It is more proper to use for that netlink instead of passing message to dmesg. Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Mateusz Polchlopek <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* devlink: Support setting max_io_eqsParav Pandit2024-04-081-0/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many devices send event notifications for the IO queues, such as tx and rx queues, through event queues. Enable a privileged owner, such as a hypervisor PF, to set the number of IO event queues for the VF and SF during the provisioning stage. example: Get maximum IO event queues of the VF device:: $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 function: hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10 Set maximum IO event queues of the VF device:: $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32 $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 function: hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32 Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Shay Drory <[email protected]> Signed-off-by: Parav Pandit <[email protected]> Signed-off-by: David S. Miller <[email protected]>
* netlink: add nlmsg_consume() and use it in devlink compatJakub Kicinski2024-04-061-1/+1
| | | | | | | | | | | | devlink_compat_running_version() sticks out when running netdevsim tests and watching dropped skbs. Add nlmsg_consume() for cases were we want to free a netlink skb but it is expected, rather than a drop. af_netlink code uses consume_skb() directly, which is fine, but some may prefer the symmetry of nlmsg_new() / nlmsg_consume(). Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
* netlink: introduce type-checking attribute iterationJohannes Berg2024-03-291-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are, especially with multi-attr arrays, many cases of needing to iterate all attributes of a specific type in a netlink message or a nested attribute. Add specific macros to support that case. Also convert many instances using this spatch: @@ iterator nla_for_each_attr; iterator name nla_for_each_attr_type; identifier nla; expression head, len, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_attr(nla, head, len, rem) +nla_for_each_attr_type(nla, ATTR, head, len, rem) { <... T x; ...> -if (nla_type(nla) == ATTR) { ... -} } @@ identifier nla; iterator nla_for_each_nested; iterator name nla_for_each_nested_type; expression attr, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_nested(nla, attr, rem) +nla_for_each_nested_type(nla, ATTR, attr, rem) { <... T x; ...> -if (nla_type(nla) == ATTR) { ... -} } @@ iterator nla_for_each_attr; iterator name nla_for_each_attr_type; identifier nla; expression head, len, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_attr(nla, head, len, rem) +nla_for_each_attr_type(nla, ATTR, head, len, rem) { <... T x; ...> -if (nla_type(nla) != ATTR) continue; ... } @@ identifier nla; iterator nla_for_each_nested; iterator name nla_for_each_nested_type; expression attr, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_nested(nla, attr, rem) +nla_for_each_nested_type(nla, ATTR, attr, rem) { <... T x; ...> -if (nla_type(nla) != ATTR) continue; ... } Although I had to undo one bad change this made, and I also adjusted some other code for whitespace and to use direct variable initialization now. Signed-off-by: Johannes Berg <[email protected]> Link: https://lore.kernel.org/r/20240328203144.b5a6c895fb80.I1869b44767379f204998ff44dd239803f39c23e0@changeid Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: use kvzalloc() to allocate devlink instance resourcesJian Wen2024-03-291-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During live migration of a virtual machine, the SR-IOV VF need to be re-registered. It may fail when the memory is badly fragmented. The related log is as follows. kernel: hv_netvsc 6045bdaa-c0d1-6045-bdaa-c0d16045bdaa eth0: VF slot 1 added ... kernel: kworker/0:0: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 kernel: CPU: 0 PID: 24006 Comm: kworker/0:0 Tainted: G E 5.4...x86_64 #1 kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 kernel: Workqueue: events work_for_cpu_fn kernel: Call Trace: kernel: dump_stack+0x8b/0xc8 kernel: warn_alloc+0xff/0x170 kernel: __alloc_pages_slowpath+0x92c/0xb2b kernel: ? get_page_from_freelist+0x1d4/0x1140 kernel: __alloc_pages_nodemask+0x2f9/0x320 kernel: alloc_pages_current+0x6a/0xb0 kernel: kmalloc_order+0x1e/0x70 kernel: kmalloc_order_trace+0x26/0xb0 kernel: ? __switch_to_asm+0x34/0x70 kernel: __kmalloc+0x276/0x280 kernel: ? _raw_spin_unlock_irqrestore+0x1e/0x40 kernel: devlink_alloc+0x29/0x110 kernel: mlx5_devlink_alloc+0x1a/0x20 [mlx5_core] kernel: init_one+0x1d/0x650 [mlx5_core] kernel: local_pci_probe+0x46/0x90 kernel: work_for_cpu_fn+0x1a/0x30 kernel: process_one_work+0x16d/0x390 kernel: worker_thread+0x1d3/0x3f0 kernel: kthread+0x105/0x140 kernel: ? max_active_store+0x80/0x80 kernel: ? kthread_bind+0x20/0x20 kernel: ret_from_fork+0x3a/0x50 Signed-off-by: Jian Wen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: fix port new reply cmd typeJiri Pirko2024-03-201-1/+1
| | | | | | | | | | | | | | | | | | | Due to a c&p error, port new reply fills-up cmd with wrong value, any other existing port command replies and notifications. Fix it by filling cmd with value DEVLINK_CMD_PORT_NEW. Skimmed through devlink userspace implementations, none of them cares about this cmd value. Reported-by: Chenyuan Yang <[email protected]> Closes: https://lore.kernel.org/all/ZfZcDxGV3tSy4qsV@cy-server/ Fixes: cd76dcd68d96 ("devlink: Support add and delete devlink port") Signed-off-by: Jiri Pirko <[email protected]> Reviewed-by: Parav Pandit <[email protected]> Reviewed-by: Kalesh AP <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: Fix devlink parallel commands processingShay Drory2024-03-131-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | Commit 870c7ad4a52b ("devlink: protect devlink->dev by the instance lock") added devlink instance locking inside a loop that iterates over all the registered devlink instances on the machine in the pre-doit phase. This can lead to serialization of devlink commands over different devlink instances. For example: While the first devlink instance is executing firmware flash, all commands to other devlink instances on the machine are forced to wait until the first devlink finishes. Therefore, in the pre-doit phase, take the devlink instance lock only for the devlink instance the command is targeting. Devlink layer is taking a reference on the devlink instance, ensuring the devlink->dev pointer is valid. This reference taking was introduced by commit a380687200e0 ("devlink: take device reference for devlink object"). Without this commit, it would not be safe to access devlink->dev lockless. Fixes: 870c7ad4a52b ("devlink: protect devlink->dev by the instance lock") Signed-off-by: Shay Drory <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
* devlink: Fix length of eswitch inline-modeWilliam Tu2024-03-111-1/+1
| | | | | | | | | | | | | | | | | Set eswitch inline-mode to be u8, not u16. Otherwise, errors below $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev \ inline-mode network Error: Attribute failed policy validation. kernel answers: Numerical result out of rang netlink: 'devlink': attribute type 26 has an invalid length. Fixes: f2f9dd164db0 ("netlink: specs: devlink: add the remaining command to generate complete split_ops") Signed-off-by: William Tu <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: fix port dump cmd typeJiri Pirko2024-02-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | Unlike other commands, due to a c&p error, port dump fills-up cmd with wrong value, different from port-get request cmd, port-get doit reply and port notification. Fix it by filling cmd with value DEVLINK_CMD_PORT_NEW. Skimmed through devlink userspace implementations, none of them cares about this cmd value. Only ynl, for which, this is actually a fix, as it expects doit and dumpit ops rsp_value to be the same. Omit the fixes tag, even thought this is fix, better to target this for next release. Fixes: bfcd3a466172 ("Introduce devlink infrastructure") Signed-off-by: Jiri Pirko <[email protected]> Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: fix possible use-after-free and memory leaks in devlink_init()Vasiliy Kovalev2024-02-201-3/+9
| | | | | | | | | | | | The pernet operations structure for the subsystem must be registered before registering the generic netlink family. Make an unregister in case of unsuccessful registration. Fixes: 687125b5799c ("devlink: split out core code") Signed-off-by: Vasiliy Kovalev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
* devlink: avoid potential loop in devlink_rel_nested_in_notify_work()Jiri Pirko2024-02-071-6/+6
| | | | | | | | | | | | In case devlink_rel_nested_in_notify_work() can not take the devlink lock mutex. Convert the work to delayed work and in case of reschedule do it jiffie later and avoid potential looping. Suggested-by: Paolo Abeni <[email protected]> Fixes: c137743bce02 ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Jiri Pirko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: Fix referring to hw_addr attribute during state validationParav Pandit2024-01-311-1/+1
| | | | | | | | | | | | | | When port function state change is requested, and when the driver does not support it, it refers to the hw address attribute instead of state attribute. Seems like a copy paste error. Fix it by referring to the port function state attribute. Fixes: c0bea69d1ca7 ("devlink: Validate port function request") Signed-off-by: Parav Pandit <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* devlink: extend multicast filtering by port indexJiri Pirko2023-12-195-5/+30
| | | | | | | | | Expose the previously introduced notification multicast messages filtering infrastructure and allow the user to select messages using port index. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
* devlink: add a command to set notification filter and use it for multicastsJiri Pirko2023-12-194-4/+157
| | | | | | | | | | | | | | | | | | Currently the user listening on a socket for devlink notifications gets always all messages for all existing instances, even if he is interested only in one of those. That may cause unnecessary overhead on setups with thousands of instances present. User is currently able to narrow down the devlink objects replies to dump commands by specifying select attributes. Allow similar approach for notifications. Introduce a new devlink NOTIFY_FILTER_SET which the user passes the select attributes. Store these per-socket and use them for filtering messages during multicast send. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
* devlink: introduce a helper for netlink multicast sendJiri Pirko2023-12-199-22/+18
| | | | | | | | | Introduce a helper devlink_nl_notify_send() so each object notification function does not have to call genlmsg_multicast_netns() with the same arguments. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
* devlink: send notifications only if there are listenersJiri Pirko2023-12-199-9/+21
| | | | | | | | | Introduce devlink_nl_notify_need() helper and using it to check at the beginning of notification functions to avoid overhead of composing notification messages in case nobody listens. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>