aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/net/ethernet/intel/i40e/i40e_main.c
Commit message (Collapse)AuthorAgeFilesLines
* i40e: improve VF MAC filters accountingLukasz Czapnik2025-09-181-4/+22
| | | | | | | | | | | | | | | | | | | When adding new VM MAC, driver checks only *active* filters in vsi->mac_filter_hash. Each MAC, even in non-active state is using resources. To determine number of MACs VM uses, count VSI filters in *any* state. Add i40e_count_all_filters() to simply count all filters, and rename i40e_count_filters() to i40e_count_active_filters() to avoid ambiguity. Fixes: cfb1d572c986 ("i40e: Add ensurance of MacVlan resources for every trusted VF") Cc: [email protected] Signed-off-by: Lukasz Czapnik <[email protected]> Reviewed-by: Aleksandr Loktionov <[email protected]> Signed-off-by: Przemek Kitszel <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* i40e: fix Jumbo Frame support after iPXE bootJacob Keller2025-09-091-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The i40e hardware has multiple hardware settings which define the Maximum Frame Size (MFS) of the physical port. The firmware has an AdminQ command (0x0603) to configure the MFS, but the i40e Linux driver never issues this command. In most cases this is no problem, as the NVM default value has the device configured for its maximum value of 9728. Unfortunately, recent versions of the iPXE intelxl driver now issue the 0x0603 Set Mac Config command, modifying the MFS and reducing it from its default value of 9728. This occurred as part of iPXE commit 6871a7de705b ("[intelxl] Use admin queue to set port MAC address and maximum frame size"), a prerequisite change for supporting the E800 series hardware in iPXE. Both the E700 and E800 firmware support the AdminQ command, and the iPXE code shares much of the logic between the two device drivers. The ice E800 Linux driver already issues the 0x0603 Set Mac Config command early during probe, and is thus unaffected by the iPXE change. Since commit 3a2c6ced90e1 ("i40e: Add a check to see if MFS is set"), the i40e driver does check the I40E_PRTGL_SAH register, but it only logs a warning message if its value is below the 9728 default. This register also only covers received packets and not transmitted packets. A warning can inform system administrators, but does not correct the issue. No interactions from userspace cause the driver to write to PRTGL_SAH or issue the 0x0603 AdminQ command. Only a GLOBR reset will restore the value to its default value. There is no obvious method to trigger a GLOBR reset from user space. To fix this, introduce the i40e_aq_set_mac_config() function, similar to the one from the ice driver. Call this during early probe to ensure that the device configuration matches driver expectation. Unlike E800, the E700 firmware also has a bit to control whether the MAC should append CRC data. It is on by default, but setting a 0 to this bit would disable CRC. The i40e implementation must set this bit to ensure CRC will be appended by the MAC. In addition to the AQ command, instead of just checking the I40E_PRTGL_SAH register, update its value to the 9728 default and write it back. This ensures that the hardware is in the expected state, regardless of whether the iPXE (or any other early boot driver) has modified this state. This is a better user experience, as we now fix the issues with larger MTU instead of merely warning. It also aligns with the way the ice E800 series driver works. A final note: The Fixes tag provided here is not strictly accurate. The issue occurs as a result of an external entity (the iPXE intelxl driver), and this is not a regression specifically caused by the mentioned change. However, I believe the original change to just warn about PRTGL_SAH being too low was an insufficient fix. Fixes: 3a2c6ced90e1 ("i40e: Add a check to see if MFS is set") Link: https://github.com/ipxe/ipxe/commit/6871a7de705b6f6a4046f0d19da9bcd689c3bc8e Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Aleksandr Loktionov <[email protected]> Reviewed-by: Michal Schmidt <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* i40e: fix IRQ freeing in i40e_vsi_request_irq_msix error pathMichal Schmidt2025-09-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If request_irq() in i40e_vsi_request_irq_msix() fails in an iteration later than the first, the error path wants to free the IRQs requested so far. However, it uses the wrong dev_id argument for free_irq(), so it does not free the IRQs correctly and instead triggers the warning: Trying to free already-free IRQ 173 WARNING: CPU: 25 PID: 1091 at kernel/irq/manage.c:1829 __free_irq+0x192/0x2c0 Modules linked in: i40e(+) [...] CPU: 25 UID: 0 PID: 1091 Comm: NetworkManager Not tainted 6.17.0-rc1+ #1 PREEMPT(lazy) Hardware name: [...] RIP: 0010:__free_irq+0x192/0x2c0 [...] Call Trace: <TASK> free_irq+0x32/0x70 i40e_vsi_request_irq_msix.cold+0x63/0x8b [i40e] i40e_vsi_request_irq+0x79/0x80 [i40e] i40e_vsi_open+0x21f/0x2f0 [i40e] i40e_open+0x63/0x130 [i40e] __dev_open+0xfc/0x210 __dev_change_flags+0x1fc/0x240 netif_change_flags+0x27/0x70 do_setlink.isra.0+0x341/0xc70 rtnl_newlink+0x468/0x860 rtnetlink_rcv_msg+0x375/0x450 netlink_rcv_skb+0x5c/0x110 netlink_unicast+0x288/0x3c0 netlink_sendmsg+0x20d/0x430 ____sys_sendmsg+0x3a2/0x3d0 ___sys_sendmsg+0x99/0xe0 __sys_sendmsg+0x8a/0xf0 do_syscall_64+0x82/0x2c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e [...] </TASK> ---[ end trace 0000000000000000 ]--- Use the same dev_id for free_irq() as for request_irq(). I tested this with inserting code to fail intentionally. Fixes: 493fb30011b3 ("i40e: Move q_vectors from pointer to array to array of pointers") Signed-off-by: Michal Schmidt <[email protected]> Reviewed-by: Aleksandr Loktionov <[email protected]> Reviewed-by: Subbaraya Sundeep <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* i40e: use libie_aq_strMichal Swiatkowski2025-07-241-133/+76
| | | | | | | | | | | | | | | | | There is no need to store the err string in hw->err_str. Simplify it and use common helper. hw->err_str is still used for other purpouse. It should be marked that previously for unknown error the numeric value was passed as a string. Now the "LIBIE_AQ_RC_UNKNOWN" is used for such cases. Add libie_aminq module in i40e Kconfig. Reviewed-by: Przemek Kitszel <[email protected]> Reviewed-by: Larysa Zaremba <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* i40e: use libie adminq descriptorsMichal Swiatkowski2025-07-241-17/+14
| | | | | | | | | | | | | | | | | | | | | | | | Use libie_aq_desc instead of i40e_aq_desc. Do needed changes to allow clean build. Get version descriptor is a little less detailed on i40e. To not mess up with shifting or union inside libie desc use get version descriptor from i40e. Move additional caps for i40e to libie. Fix RCT in declaration that is using libie_aq_desc; Use libie_aq_raw() wherever it can be used. The libie aq error is extended, cover it in ice driver just to clean build. In next patches the libie code for that will be used in each of intel driver. Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* i40e: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()Vladimir Oltean2025-07-031-22/+2
| | | | | | | | | | | | | | | | New timestamping API was introduced in commit 66f7223039c0 ("net: add NDOs for configuring hardware timestamping") from kernel v6.6. It is time to convert the Intel i40e driver to the new API, so that timestamping configuration can be removed from the ndo_eth_ioctl() path completely. Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Vadim Fedorenko <[email protected]> Reviewed-by: Milena Olech <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* udp_tunnel: remove rtnl_lock dependencyStanislav Fomichev2025-06-191-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drivers that are using ops lock and don't depend on RTNL lock still need to manage it because udp_tunnel's RTNL dependency. Introduce new udp_tunnel_nic_lock and use it instead of rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to grab udp_tunnel_nic_lock mutex and might sleep). Cover more places in v4: - netlink - udp_tunnel_notify_add_rx_port (ndo_open) - triggers udp_tunnel_nic_device_sync_work - udp_tunnel_notify_del_rx_port (ndo_stop) - triggers udp_tunnel_nic_device_sync_work - udp_tunnel_get_rx_info (__netdev_update_features) - triggers NETDEV_UDP_TUNNEL_PUSH_INFO - udp_tunnel_drop_rx_info (__netdev_update_features) - triggers NETDEV_UDP_TUNNEL_DROP_INFO - udp_tunnel_nic_reset_ntf (ndo_open) - notifiers - udp_tunnel_nic_netdevice_event, depending on the event: - triggers NETDEV_UDP_TUNNEL_PUSH_INFO - triggers NETDEV_UDP_TUNNEL_DROP_INFO - ethnl_tunnel_info_reply_size - udp_tunnel_nic_set_port_priv (two intel drivers) Cc: Michael Chan <[email protected]> Suggested-by: Jakub Kicinski <[email protected]> Signed-off-by: Stanislav Fomichev <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2025-06-121-1/+1
|\ | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR (net-6.16-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
| * treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar2025-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
* | i40e: add link_down_events statisticDawid Osuchowski2025-06-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a link_down_events counter to the i40e driver, incremented each time the link transitions from up to down. This counter can help diagnose issues related to link stability, such as port flapping or unexpected link drops. The value is exposed via ethtool's get_link_ext_stats() interface. Co-developed-by: Martyna Szapar-Mudlaw <[email protected]> Signed-off-by: Martyna Szapar-Mudlaw <[email protected]> Reviewed-by: Michal Swiatkowski <[email protected]> Signed-off-by: Dawid Osuchowski <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* | net: intel: move RSS packet classifier types to libieJacob Keller2025-06-091-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Intel i40e, iavf, and ice drivers all include a definition of the packet classifier filter types used to program RSS hash enable bits. For i40e, these bits are used for both the PF and VF to configure the PFQF_HENA and VFQF_HENA registers. For ice and iAVF, these bits are used to communicate the desired hash enable filter over virtchnl via its struct virtchnl_rss_hashena. The virtchnl.h header makes no mention of where the bit definitions reside. Maintaining a separate copy of these bits across three drivers is cumbersome. Move the definition to libie as a new pctype.h header file. Each driver can include this, and drop its own definition. The ice implementation also defined a ICE_AVF_FLOW_FIELD_INVALID, intending to use this to indicate when there were no hash enable bits set. This is confusing, since the enumeration is using bit positions. A value of 0 *should* indicate the first bit. Instead, rewrite the code that uses ICE_AVF_FLOW_FIELD_INVALID to just check if the avf_hash is zero. From context this should be clear that we're checking if none of the bits are set. The values are kept as bit positions instead of encoding the BIT_ULL directly into their value. While most users will simply use BIT_ULL immediately, i40e uses the macros both with BIT_ULL and test_bit/set_bit calls. Reviewed-by: Przemek Kitszel <[email protected]> Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Aleksandr Loktionov <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* | net: intel: rename 'hena' to 'hashcfg' for clarityJacob Keller2025-06-091-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | i40e, ice, and iAVF all use 'hena' as a shorthand for the "hash enable" configuration. This comes originally from the X710 datasheet 'xxQF_HENA' registers. In the context of the registers the meaning is fairly clear. However, on its own, hena is a weird name that can be more difficult to understand. This is especially true in ice. The E810 hardware doesn't even have registers with HENA in the name. Replace the shorthand 'hena' with 'hashcfg'. This makes it clear the variables deal with the Hash configuration, not just a single boolean on/off for all hashing. Do not update the register names. These come directly from the datasheet for X710 and X722, and it is more important that the names can be searched. Suggested-by: Przemek Kitszel <[email protected]> Reviewed-by: Aleksandr Loktionov <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner2025-04-051-2/+2
| | | | | | | | | | timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
* i40e: add ability to reset VF for Tx and Rx MDD eventsAleksandr Loktionov2025-01-081-13/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement "mdd-auto-reset-vf" priv-flag to handle Tx and Rx MDD events for VFs. This flag is also used in other network adapters like ICE. Usage: - "on" - The problematic VF will be automatically reset if a malformed descriptor is detected. - "off" - The problematic VF will be disabled. In cases where a VF sends malformed packets classified as malicious, it can cause the Tx queue to freeze, rendering it unusable for several minutes. When an MDD event occurs, this new implementation allows for a graceful VF reset to quickly restore operational state. Currently, VF queues are disabled if an MDD event occurs. This patch adds the ability to reset the VF if a Tx or Rx MDD event occurs. It also includes MDD event logging throttling to avoid dmesg pollution and unifies the format of Tx and Rx MDD messages. Note: Standard message rate limiting functions like dev_info_ratelimited() do not meet our requirements. Custom rate limiting is implemented, please see the code for details. Co-developed-by: Jan Sokolowski <[email protected]> Signed-off-by: Jan Sokolowski <[email protected]> Co-developed-by: Padraig J Connolly <[email protected]> Signed-off-by: Padraig J Connolly <[email protected]> Signed-off-by: Aleksandr Loktionov <[email protected]> Reviewed-by: Michal Schmidt <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* i40e: Remove unused i40e_commit_partition_bw_settingDr. David Alan Gilbert2025-01-061-83/+0
| | | | | | | | | | | | | | i40e_commit_partition_bw_setting() was added in 2017 by commit 4fc8c6763957 ("i40e: genericize the partition bandwidth control") but hasn't been used. Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Reviewed-by: Kalesh AP <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* i40e: Remove unused i40e_del_filterDr. David Alan Gilbert2025-01-061-26/+2
| | | | | | | | | | | | | | | | | The last use of i40e_del_filter() was removed in 2016 by commit 9569a9a4547d ("i40e: when adding or removing MAC filters, correctly handle VLANs") Remove it. Fix up a comment that referenced it. Note: The __ version of this function is still used. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* i40e: Remove unused i40e_get_cur_guaranteed_fd_countDr. David Alan Gilbert2025-01-061-13/+0
| | | | | | | | | | | | | The last use of i40e_get_cur_guaranteed_fd_count() was removed in 2015 by commit 04294e38a451 ("i40e: FD filters flush policy changes") Remove it. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Reviewed-by: Kalesh AP <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* module: Convert symbol namespace to string literalPeter Zijlstra2024-12-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
* ndo_fdb_add: Add a parameter to report whether notification was sentPetr Machata2024-11-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when FDB entries are added to or deleted from a VXLAN netdevice, the VXLAN driver emits one notification, including the VXLAN-specific attributes. The core however always sends a notification as well, a generic one. Thus two notifications are unnecessarily sent for these operations. A similar situation comes up with bridge driver, which also emits notifications on its own: # ip link add name vx type vxlan id 1000 dstport 4789 # bridge monitor fdb & [1] 1981693 # bridge fdb add de:ad:be:ef:13:37 dev vx self dst 192.0.2.1 de:ad:be:ef:13:37 dev vx dst 192.0.2.1 self permanent de:ad:be:ef:13:37 dev vx self permanent In order to prevent this duplicity, add a paremeter to ndo_fdb_add, bool *notified. The flag is primed to false, and if the callee sends a notification on its own, it sets it to true, thus informing the core that it should not generate another notification. Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Amit Cohen <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Link: https://patch.msgid.link/cbf6ae8195e85cbf922f8058ce4eba770f3b71ed.1731589511.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <[email protected]>
* i40e: fix race condition by adding filter's intermediate sync stateAleksandr Loktionov2024-11-041-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a race condition in the i40e driver that leads to MAC/VLAN filters becoming corrupted and leaking. Address the issue that occurs under heavy load when multiple threads are concurrently modifying MAC/VLAN filters by setting mac and port VLAN. 1. Thread T0 allocates a filter in i40e_add_filter() within i40e_ndo_set_vf_port_vlan(). 2. Thread T1 concurrently frees the filter in __i40e_del_filter() within i40e_ndo_set_vf_mac(). 3. Subsequently, i40e_service_task() calls i40e_sync_vsi_filters(), which refers to the already freed filter memory, causing corruption. Reproduction steps: 1. Spawn multiple VFs. 2. Apply a concurrent heavy load by running parallel operations to change MAC addresses on the VFs and change port VLANs on the host. 3. Observe errors in dmesg: "Error I40E_AQ_RC_ENOSPC adding RX filters on VF XX, please set promiscuous on manually for VF XX". Exact code for stable reproduction Intel can't open-source now. The fix involves implementing a new intermediate filter state, I40E_FILTER_NEW_SYNC, for the time when a filter is on a tmp_add_list. These filters cannot be deleted from the hash list directly but must be removed using the full process. Fixes: 278e7d0b9d68 ("i40e: store MAC/VLAN filters in a hash with the MAC Address as key") Signed-off-by: Aleksandr Loktionov <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Reviewed-by: Michal Schmidt <[email protected]> Tested-by: Michal Schmidt <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* i40e: Fix macvlan leak by synchronizing access to mac_filter_hashAleksandr Loktionov2024-10-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch addresses a macvlan leak issue in the i40e driver caused by concurrent access to vsi->mac_filter_hash. The leak occurs when multiple threads attempt to modify the mac_filter_hash simultaneously, leading to inconsistent state and potential memory leaks. To fix this, we now wrap the calls to i40e_del_mac_filter() and zeroing vf->default_lan_addr.addr with spin_lock/unlock_bh(&vsi->mac_filter_hash_lock), ensuring atomic operations and preventing concurrent access. Additionally, we add lockdep_assert_held(&vsi->mac_filter_hash_lock) in i40e_add_mac_filter() to help catch similar issues in the future. Reproduction steps: 1. Spawn VFs and configure port vlan on them. 2. Trigger concurrent macvlan operations (e.g., adding and deleting portvlan and/or mac filters). 3. Observe the potential memory leak and inconsistent state in the mac_filter_hash. This synchronization ensures the integrity of the mac_filter_hash and prevents the described leak. Fixes: fed0d9f13266 ("i40e: Fix VF's MAC Address change on VM") Reviewed-by: Arkadiusz Kubalewski <[email protected]> Signed-off-by: Aleksandr Loktionov <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* i40e: Add Energy Efficient Ethernet ability for X710 Base-T/KR/KX cardsAleksandr Loktionov2024-08-301-3/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add "EEE: Enabled/Disabled" to dmesg for supported X710 Base-T/KR/KX cards. According to the IEEE standard report the EEE ability and the EEE Link Partner ability. Use the kernel's 'ethtool_keee' structure and report EEE link modes. Example: dmesg | grep 'NIC Link is' ethtool --show-eee <device> Before: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None Supported EEE link modes: Not reported Advertised EEE link modes: Not reported Link partner advertised EEE link modes: Not reported After: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None, EEE: Enabled Supported EEE link modes: 100baseT/Full 1000baseT/Full 10000baseT/Full Advertised EEE link modes: 100baseT/Full 1000baseT/Full 10000baseT/Full Link partner advertised EEE link modes: 100baseT/Full 1000baseT/Full 10000baseT/Full Reviewed-by: Arkadiusz Kubalewski <[email protected]> Signed-off-by: Aleksandr Loktionov <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* Merge branch '40GbE' of ↵Jakub Kicinski2024-07-131-1/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2024-07-11 (net/intel) This series contains updates to most Intel network drivers. Tony removes MODULE_AUTHOR from drivers containing the entry. Simon Horman corrects a kdoc entry for i40e. Pawel adds implementation for devlink param "local_forwarding" on ice. Michal removes unneeded call, and code, for eswitch rebuild for ice. Sasha removed a no longer used field from igc. * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: igc: Remove the internal 'eee_advert' field ice: remove eswitch rebuild ice: Add support for devlink local_forwarding param i40e: correct i40e_addr_to_hkey() name in kdoc net: intel: Remove MODULE_AUTHORs ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
| * net: intel: Remove MODULE_AUTHORsTony Nguyen2024-07-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | We are moving away from the Sourceforge email address. Rather than removing or updating the email for the affected entries, remove the MODULE_AUTHOR altogether as its usage is incorrect [1]. Link: https://lore.kernel.org/netdev/20200626115236.7f36d379@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ [1] Reviewed-by: Jesse Brandeburg <[email protected]> Acked-by: Alexander Lobakin <[email protected]> # libeth, libie Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* | i40e: Fix XDP program unloading while removing the driverMichal Kubiak2024-07-101-5/+4
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 6533e558c650 ("i40e: Fix reset path while removing the driver") introduced a new PF state "__I40E_IN_REMOVE" to block modifying the XDP program while the driver is being removed. Unfortunately, such a change is useful only if the ".ndo_bpf()" callback was called out of the rmmod context because unloading the existing XDP program is also a part of driver removing procedure. In other words, from the rmmod context the driver is expected to unload the XDP program without reporting any errors. Otherwise, the kernel warning with callstack is printed out to dmesg. Example failing scenario: 1. Load the i40e driver. 2. Load the XDP program. 3. Unload the i40e driver (using "rmmod" command). The example kernel warning log: [ +0.004646] WARNING: CPU: 94 PID: 10395 at net/core/dev.c:9290 unregister_netdevice_many_notify+0x7a9/0x870 [...] [ +0.010959] RIP: 0010:unregister_netdevice_many_notify+0x7a9/0x870 [...] [ +0.002726] Call Trace: [ +0.002457] <TASK> [ +0.002119] ? __warn+0x80/0x120 [ +0.003245] ? unregister_netdevice_many_notify+0x7a9/0x870 [ +0.005586] ? report_bug+0x164/0x190 [ +0.003678] ? handle_bug+0x3c/0x80 [ +0.003503] ? exc_invalid_op+0x17/0x70 [ +0.003846] ? asm_exc_invalid_op+0x1a/0x20 [ +0.004200] ? unregister_netdevice_many_notify+0x7a9/0x870 [ +0.005579] ? unregister_netdevice_many_notify+0x3cc/0x870 [ +0.005586] unregister_netdevice_queue+0xf7/0x140 [ +0.004806] unregister_netdev+0x1c/0x30 [ +0.003933] i40e_vsi_release+0x87/0x2f0 [i40e] [ +0.004604] i40e_remove+0x1a1/0x420 [i40e] [ +0.004220] pci_device_remove+0x3f/0xb0 [ +0.003943] device_release_driver_internal+0x19f/0x200 [ +0.005243] driver_detach+0x48/0x90 [ +0.003586] bus_remove_driver+0x6d/0xf0 [ +0.003939] pci_unregister_driver+0x2e/0xb0 [ +0.004278] i40e_exit_module+0x10/0x5f0 [i40e] [ +0.004570] __do_sys_delete_module.isra.0+0x197/0x310 [ +0.005153] do_syscall_64+0x85/0x170 [ +0.003684] ? syscall_exit_to_user_mode+0x69/0x220 [ +0.004886] ? do_syscall_64+0x95/0x170 [ +0.003851] ? exc_page_fault+0x7e/0x180 [ +0.003932] entry_SYSCALL_64_after_hwframe+0x71/0x79 [ +0.005064] RIP: 0033:0x7f59dc9347cb [ +0.003648] Code: 73 01 c3 48 8b 0d 65 16 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 16 0c 00 f7 d8 64 89 01 48 [ +0.018753] RSP: 002b:00007ffffac99048 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ +0.007577] RAX: ffffffffffffffda RBX: 0000559b9bb2f6e0 RCX: 00007f59dc9347cb [ +0.007140] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000559b9bb2f748 [ +0.007146] RBP: 00007ffffac99070 R08: 1999999999999999 R09: 0000000000000000 [ +0.007133] R10: 00007f59dc9a5ac0 R11: 0000000000000206 R12: 0000000000000000 [ +0.007141] R13: 00007ffffac992d8 R14: 0000559b9bb2f6e0 R15: 0000000000000000 [ +0.007151] </TASK> [ +0.002204] ---[ end trace 0000000000000000 ]--- Fix this by checking if the XDP program is being loaded or unloaded. Then, block only loading a new program while "__I40E_IN_REMOVE" is set. Also, move testing "__I40E_IN_REMOVE" flag to the beginning of XDP_SETUP callback to avoid unnecessary operations and checks. Fixes: 6533e558c650 ("i40e: Fix reset path while removing the driver") Signed-off-by: Michal Kubiak <[email protected]> Reviewed-by: Maciej Fijalkowski <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* i40e: Fully suspend and resume IO operations in EEH caseThinh Tran2024-05-301-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | When EEH events occurs, the callback functions in the i40e, which are managed by the EEH driver, will completely suspend and resume all IO operations. - In the PCI error detected callback, replaced i40e_prep_for_reset() with i40e_io_suspend(). The change is to fully suspend all I/O operations - In the PCI error slot reset callback, replaced pci_enable_device_mem() with pci_enable_device(). This change enables both I/O and memory of the device. - In the PCI error resume callback, replaced i40e_handle_reset_warning() with i40e_io_resume(). This change allows the system to resume I/O operations Fixes: a5f3d2c17b07 ("powerpc/pseries/pci: Add MSI domains") Reviewed-by: Jacob Keller <[email protected]> Tested-by: Robert Thomas <[email protected]> Signed-off-by: Thinh Tran <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/20240528-net-2024-05-28-intel-net-fixes-v1-3-dc8593d2bbc6@intel.com Signed-off-by: Jakub Kicinski <[email protected]>
* i40e: factoring out i40e_suspend/i40e_resumeThinh Tran2024-05-301-114/+135
| | | | | | | | | | | | | | | | | | | | | Two new functions, i40e_io_suspend() and i40e_io_resume(), have been introduced. These functions were factored out from the existing i40e_suspend() and i40e_resume() respectively. This factoring was done due to concerns about the logic of the I40E_SUSPENSED state, which caused the device to be unable to recover. The functions are now used in the EEH handling for device suspend/resume callbacks. The function i40e_enable_mc_magic_wake() has been moved ahead of i40e_io_suspend() to ensure it is declared before being used. Tested-by: Robert Thomas <[email protected]> Signed-off-by: Thinh Tran <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/20240528-net-2024-05-28-intel-net-fixes-v1-2-dc8593d2bbc6@intel.com Signed-off-by: Jakub Kicinski <[email protected]>
* i40e: flower: validate control flagsAsbjørn Sloth Tønnesen2024-05-081-0/+4
| | | | | | | | | | | | | | | | | This driver currently doesn't support any control flags. Use flow_rule_has_control_flags() to check for control flags, such as can be set through `tc flower ... ip_flags frag`. In case any control flags are masked, flow_rule_has_control_flags() sets a NL extended error message, and we return -EOPNOTSUPP. Only compile-tested. Signed-off-by: Asbjørn Sloth Tønnesen <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Sujai Buvaneswaran <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* net: annotate writes on dev->mtu from ndo_change_mtu()Eric Dumazet2024-05-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Simon reported that ndo_change_mtu() methods were never updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted in commit 501a90c94510 ("inet: protect against too small mtu values.") We read dev->mtu without holding RTNL in many places, with READ_ONCE() annotations. It is time to take care of ndo_change_mtu() methods to use corresponding WRITE_ONCE() Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Simon Horman <[email protected]> Closes: https://lore.kernel.org/netdev/[email protected]/ Reviewed-by: Jacob Keller <[email protected]> Reviewed-by: Sabrina Dubroca <[email protected]> Reviewed-by: Simon Horman <[email protected]> Acked-by: Shannon Nelson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* i40e: Add and use helper to reconfigure TC for given VSIIvan Vecera2024-04-301-8/+24
| | | | | | | | | | | | | | | | | Add helper i40e_vsi_reconfig_tc(vsi) that configures TC for given VSI using previously stored TC bitmap. Effectively replaces open-coded patterns: enabled_tc = vsi->tc_config.enabled_tc; vsi->tc_config.enabled_tc = 0; i40e_vsi_config_tc(vsi, enabled_tc); Reviewed-by: Michal Schmidt <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Signed-off-by: Ivan Vecera <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* i40e: Add helper to access main VEBIvan Vecera2024-04-301-10/+17
| | | | | | | | | | | | | Add a helper to access main VEB: i40e_pf_get_main_veb(pf) replaces 'pf->veb[pf->lan_veb]' Reviewed-by: Michal Schmidt <[email protected]> Reviewed-by: Aleksandr Loktionov <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Signed-off-by: Ivan Vecera <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* i40e: Consolidate checks whether given VSI is mainIvan Vecera2024-04-301-10/+9
| | | | | | | | | | | | | | | | | In the driver code there are 3 types of checks whether given VSI is main or not: 1. vsi->type ==/!= I40E_VSI_MAIN 2. vsi ==/!= pf->vsi[pf->lan_vsi] 3. vsi->seid ==/!= pf->vsi[pf->lan_vsi]->seid All of them are equivalent and can be consolidated. Convert cases 2 and 3 to case 1. Reviewed-by: Michal Schmidt <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Signed-off-by: Ivan Vecera <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* i40e: Add helper to access main VSIIvan Vecera2024-04-301-47/+60
| | | | | | | | | | | Add simple helper i40e_pf_get_main_vsi(pf) to access main VSI that replaces pattern 'pf->vsi[pf->lan_vsi]' Reviewed-by: Michal Schmidt <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Signed-off-by: Ivan Vecera <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* i40e: Refactor argument of i40e_detect_recover_hung()Ivan Vecera2024-04-301-1/+1
| | | | | | | | | | | | | | Commit 07d44190a389 ("i40e/i40evf: Detect and recover hung queue scenario") changes i40e_detect_recover_hung() argument type from i40e_pf* to i40e_vsi* to be shareable by both i40e and i40evf. Because the i40evf does not exist anymore and the function is exclusively used by i40e we can revert this change. Reviewed-by: Michal Schmidt <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Signed-off-by: Ivan Vecera <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* i40e: Refactor argument of several client notification functionsIvan Vecera2024-04-301-7/+5
| | | | | | | | | | | | | | | | | Commit 0ef2d5afb12d ("i40e: KISS the client interface") simplified the client interface so in practice it supports only one client per i40e netdev. But we have still 2 notification functions that uses as parameter a pointer to VSI of netdevice associated with the client. After the mentioned commit only possible and used VSI is the main (LAN) VSI. So refactor these functions so they are called with PF pointer argument and the associated VSI (LAN) is taken inside them. Reviewed-by: Michal Schmidt <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Signed-off-by: Ivan Vecera <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* i40e: Remove flags field from i40e_vebIvan Vecera2024-04-301-8/+5
| | | | | | | | | | | | | The field is initialized always to zero and it is never read. Remove it. Reviewed-by: Michal Schmidt <[email protected]> Reviewed-by: Aleksandr Loktionov <[email protected]> Reviewed-by: Kalesh AP <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Signed-off-by: Ivan Vecera <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* Merge branch '40GbE' of ↵Jakub Kicinski2024-04-261-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin says: Here's a two-shot: introduce {,Intel} Ethernet common library (libeth and libie) and switch iavf to Page Pool. Details are in the commit messages; here's a summary: Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. The first name that came to my mind was "libie" -- "Intel Ethernet common library". Also this sounds like "lovelie" (-> one word, no "lib I E" pls) and can be expanded as "lib Internet Explorer" :P The "generic", pure-software part is placed separately, so that it can be easily reused in any driver by any vendor without linking to the Intel pre-200G guts. In a few words, it's something any modern driver does the same way, but nobody moved it level up (yet). The series is only the beginning. From now on, adding every new feature or doing any good driver refactoring will remove much more lines than add for quite some time. There's a basic roadmap with some deduplications planned already, not speaking of that touching every line now asks: "can I share this?". The final destination is very ambitious: have only one unified driver for at least i40e, ice, iavf, and idpf with a struct ops for each generation. That's never gonna happen, right? But you still can at least try. PP conversion for iavf lands within the same series as these two are tied closely. libie will support Page Pool model only, so that a driver can't use much of the lib until it's converted. iavf is only the example, the rest will eventually be converted soon on a per-driver basis. That is when it gets really interesting. Stay tech. * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: MAINTAINERS: add entry for libeth and libie iavf: switch to Page Pool iavf: pack iavf_ring more efficiently libeth: add Rx buffer management page_pool: add DMA-sync-for-CPU inline helper page_pool: constify some read-only function arguments slab: introduce kvmalloc_array_node() and kvcalloc_node() iavf: drop page splitting and recycling iavf: kill "legacy-rx" for good net: intel: introduce {, Intel} Ethernet common library ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
| * net: intel: introduce {, Intel} Ethernet common libraryAlexander Lobakin2024-04-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2024-04-251-3/+3
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/ti/icssg/icssg_prueth.c net/mac80211/chan.c 89884459a0b9 ("wifi: mac80211: fix idle calculation with multi-link") 87f5500285fb ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()") https://lore.kernel.org/all/[email protected]/ net/unix/garbage.c 1971d13ffa84 ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().") 4090fa373f0e ("af_unix: Replace garbage collection algorithm.") drivers/net/ethernet/ti/icssg/icssg_prueth.c drivers/net/ethernet/ti/icssg/icssg_common.c 4dcd0e83ea1d ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()") e2dc7bfd677f ("net: ti: icssg-prueth: Move common functions into a separate file") No adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
| * i40e: Report MFS in decimal base instead of hexErwan Velu2024-04-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the MFS is set below the default (0x2600), a warning message is reported like the following : MFS for port 1 has been set below the default: 600 This message is a bit confusing as the number shown here (600) is in fact an hexa number: 0x600 = 1536 Without any explicit "0x" prefix, this message is read like the MFS is set to 600 bytes. MFS, as per MTUs, are usually expressed in decimal base. This commit reports both current and default MFS values in decimal so it's less confusing for end-users. A typical warning message looks like the following : MFS for port 1 (1536) has been set below the default (9728) Signed-off-by: Erwan Velu <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Tony Brelinski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Fixes: 3a2c6ced90e1 ("i40e: Add a check to see if MFS is set") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
| * i40e: Do not use WQ_MEM_RECLAIM flag for workqueueSindhu Devale2024-04-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Issue reported by customer during SRIOV testing, call trace: When both i40e and the i40iw driver are loaded, a warning in check_flush_dependency is being triggered. This seems to be because of the i40e driver workqueue is allocated with the WQ_MEM_RECLAIM flag, and the i40iw one is not. Similar error was encountered on ice too and it was fixed by removing the flag. Do the same for i40e too. [Feb 9 09:08] ------------[ cut here ]------------ [ +0.000004] workqueue: WQ_MEM_RECLAIM i40e:i40e_service_task [i40e] is flushing !WQ_MEM_RECLAIM infiniband:0x0 [ +0.000060] WARNING: CPU: 0 PID: 937 at kernel/workqueue.c:2966 check_flush_dependency+0x10b/0x120 [ +0.000007] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore nls_utf8 cifs cifs_arc4 nls_ucs2_utils rdma_cm iw_cm ib_cm cifs_md4 dns_resolver netfs qrtr rfkill sunrpc vfat fat intel_rapl_msr intel_rapl_common irdma intel_uncore_frequency intel_uncore_frequency_common ice ipmi_ssif isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp gnss coretemp ib_uverbs rapl intel_cstate ib_core iTCO_wdt iTCO_vendor_support acpi_ipmi mei_me ipmi_si intel_uncore ioatdma i2c_i801 joydev pcspkr mei ipmi_devintf lpc_ich intel_pch_thermal i2c_smbus ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c ast sd_mod drm_shmem_helper t10_pi drm_kms_helper sg ixgbe drm i40e ahci crct10dif_pclmul libahci crc32_pclmul igb crc32c_intel libata ghash_clmulni_intel i2c_algo_bit mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod fuse [ +0.000050] CPU: 0 PID: 937 Comm: kworker/0:3 Kdump: loaded Not tainted 6.8.0-rc2-Feb-net_dev-Qiueue-00279-gbd43c5687e05 #1 [ +0.000003] Hardware name: Intel Corporation S2600BPB/S2600BPB, BIOS SE5C620.86B.02.01.0013.121520200651 12/15/2020 [ +0.000001] Workqueue: i40e i40e_service_task [i40e] [ +0.000024] RIP: 0010:check_flush_dependency+0x10b/0x120 [ +0.000003] Code: ff 49 8b 54 24 18 48 8d 8b b0 00 00 00 49 89 e8 48 81 c6 b0 00 00 00 48 c7 c7 b0 97 fa 9f c6 05 8a cc 1f 02 01 e8 35 b3 fd ff <0f> 0b e9 10 ff ff ff 80 3d 78 cc 1f 02 00 75 94 e9 46 ff ff ff 90 [ +0.000002] RSP: 0018:ffffbd294976bcf8 EFLAGS: 00010282 [ +0.000002] RAX: 0000000000000000 RBX: ffff94d4c483c000 RCX: 0000000000000027 [ +0.000001] RDX: ffff94d47f620bc8 RSI: 0000000000000001 RDI: ffff94d47f620bc0 [ +0.000001] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffff7fff [ +0.000001] R10: ffffbd294976bb98 R11: ffffffffa0be65e8 R12: ffff94c5451ea180 [ +0.000001] R13: ffff94c5ab5e8000 R14: ffff94c5c20b6e05 R15: ffff94c5f1330ab0 [ +0.000001] FS: 0000000000000000(0000) GS:ffff94d47f600000(0000) knlGS:0000000000000000 [ +0.000002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000001] CR2: 00007f9e6f1fca70 CR3: 0000000038e20004 CR4: 00000000007706f0 [ +0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ +0.000001] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ +0.000001] PKRU: 55555554 [ +0.000001] Call Trace: [ +0.000001] <TASK> [ +0.000002] ? __warn+0x80/0x130 [ +0.000003] ? check_flush_dependency+0x10b/0x120 [ +0.000002] ? report_bug+0x195/0x1a0 [ +0.000005] ? handle_bug+0x3c/0x70 [ +0.000003] ? exc_invalid_op+0x14/0x70 [ +0.000002] ? asm_exc_invalid_op+0x16/0x20 [ +0.000006] ? check_flush_dependency+0x10b/0x120 [ +0.000002] ? check_flush_dependency+0x10b/0x120 [ +0.000002] __flush_workqueue+0x126/0x3f0 [ +0.000015] ib_cache_cleanup_one+0x1c/0xe0 [ib_core] [ +0.000056] __ib_unregister_device+0x6a/0xb0 [ib_core] [ +0.000023] ib_unregister_device_and_put+0x34/0x50 [ib_core] [ +0.000020] i40iw_close+0x4b/0x90 [irdma] [ +0.000022] i40e_notify_client_of_netdev_close+0x54/0xc0 [i40e] [ +0.000035] i40e_service_task+0x126/0x190 [i40e] [ +0.000024] process_one_work+0x174/0x340 [ +0.000003] worker_thread+0x27e/0x390 [ +0.000001] ? __pfx_worker_thread+0x10/0x10 [ +0.000002] kthread+0xdf/0x110 [ +0.000002] ? __pfx_kthread+0x10/0x10 [ +0.000002] ret_from_fork+0x2d/0x50 [ +0.000003] ? __pfx_kthread+0x10/0x10 [ +0.000001] ret_from_fork_asm+0x1b/0x30 [ +0.000004] </TASK> [ +0.000001] ---[ end trace 0000000000000000 ]--- Fixes: 4d5957cbdecd ("i40e: remove WQ_UNBOUND and the task limit of our workqueue") Signed-off-by: Sindhu Devale <[email protected]> Reviewed-by: Arkadiusz Kubalewski <[email protected]> Reviewed-by: Mateusz Polchlopek <[email protected]> Signed-off-by: Aleksandr Loktionov <[email protected]> Tested-by: Robert Ganzynkowicz <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2024-04-051-2/+11
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/ip_gre.c 17af420545a7 ("erspan: make sure erspan_base_hdr is present in skb->head") 5832c4a77d69 ("ip_tunnel: convert __be16 tunnel flags to bitmaps") https://lore.kernel.org/all/[email protected]/ Adjacent changes: net/ipv6/ip6_fib.c d21d40605bca ("ipv6: Fix infinite recursion in fib6_dump_done().") 5fc68320c1fb ("ipv6: remove RTNL protection from inet6_dump_fib()") Signed-off-by: Jakub Kicinski <[email protected]>
| * i40e: fix i40e_count_filters() to count only active/new filtersAleksandr Loktionov2024-03-261-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bug usually affects untrusted VFs, because they are limited to 18 MACs, it affects them badly, not letting to create MAC all filters. Not stable to reproduce, it happens when VF user creates MAC filters when other MACVLAN operations are happened in parallel. But consequence is that VF can't receive desired traffic. Fix counter to be bumped only for new or active filters. Fixes: 621650cabee5 ("i40e: Refactoring VF MAC filters counting to make more reliable") Signed-off-by: Aleksandr Loktionov <[email protected]> Reviewed-by: Arkadiusz Kubalewski <[email protected]> Reviewed-by: Paul Menzel <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
| * i40e: Enforce software interrupt during busy-poll exitIvan Vecera2024-03-261-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As for ice bug fixed by commit b7306b42beaf ("ice: manage interrupts during poll exit") followed by commit 23be7075b318 ("ice: fix software generating extra interrupts") I'm seeing the similar issue also with i40e driver. In certain situation when busy-loop is enabled together with adaptive coalescing, the driver occasionally misses that there are outstanding descriptors to clean when exiting busy poll. Try to catch the remaining work by triggering a software interrupt when exiting busy poll. No extra interrupts will be generated when busy polling is not used. The issue was found when running sockperf ping-pong tcp test with adaptive coalescing and busy poll enabled (50 as value busy_pool and busy_read sysctl knobs) and results in huge latency spikes with more than 100000us. The fix is inspired from the ice driver and do the following: 1) During napi poll exit in case of busy-poll (napo_complete_done() returns false) this is recorded to q_vector that we were in busy loop. 2) Extends i40e_buildreg_itr() to be able to add an enforced software interrupt into built value 2) In i40e_update_enable_itr() enforces a software interrupt trigger if we are exiting busy poll to catch any pending clean-ups 3) Reuses unused 3rd ITR (interrupt throttle) index and set it to 20K interrupts per second to limit the number of these sw interrupts. Test results ============ Prior: [root@dell-per640-07 net]# sockperf ping-pong -i 10.9.9.1 --tcp -m 1000 --mps=max -t 120 sockperf: == version #3.10-no.git == sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s) [ 0] IP = 10.9.9.1 PORT = 11111 # TCP sockperf: Warmup stage (sending a few dummy messages)... sockperf: Starting test... sockperf: Test end (interrupted by timer) sockperf: Test ended sockperf: [Total Run] RunTime=119.999 sec; Warm up time=400 msec; SentMessages=2438563; ReceivedMessages=2438562 sockperf: ========= Printing statistics for Server No: 0 sockperf: [Valid Duration] RunTime=119.549 sec; SentMessages=2429473; ReceivedMessages=2429473 sockperf: ====> avg-latency=24.571 (std-dev=93.297, mean-ad=4.904, median-ad=1.510, siqr=1.063, cv=3.797, std-error=0.060, 99.0% ci=[24.417, 24.725]) sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0 sockperf: Summary: Latency is 24.571 usec sockperf: Total 2429473 observations; each percentile contains 24294.73 observations sockperf: ---> <MAX> observation = 103294.331 sockperf: ---> percentile 99.999 = 45.633 sockperf: ---> percentile 99.990 = 37.013 sockperf: ---> percentile 99.900 = 35.910 sockperf: ---> percentile 99.000 = 33.390 sockperf: ---> percentile 90.000 = 28.626 sockperf: ---> percentile 75.000 = 27.741 sockperf: ---> percentile 50.000 = 26.743 sockperf: ---> percentile 25.000 = 25.614 sockperf: ---> <MIN> observation = 12.220 After: [root@dell-per640-07 net]# sockperf ping-pong -i 10.9.9.1 --tcp -m 1000 --mps=max -t 120 sockperf: == version #3.10-no.git == sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s) [ 0] IP = 10.9.9.1 PORT = 11111 # TCP sockperf: Warmup stage (sending a few dummy messages)... sockperf: Starting test... sockperf: Test end (interrupted by timer) sockperf: Test ended sockperf: [Total Run] RunTime=119.999 sec; Warm up time=400 msec; SentMessages=2400055; ReceivedMessages=2400054 sockperf: ========= Printing statistics for Server No: 0 sockperf: [Valid Duration] RunTime=119.549 sec; SentMessages=2391186; ReceivedMessages=2391186 sockperf: ====> avg-latency=24.965 (std-dev=5.934, mean-ad=4.642, median-ad=1.485, siqr=1.067, cv=0.238, std-error=0.004, 99.0% ci=[24.955, 24.975]) sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0 sockperf: Summary: Latency is 24.965 usec sockperf: Total 2391186 observations; each percentile contains 23911.86 observations sockperf: ---> <MAX> observation = 195.841 sockperf: ---> percentile 99.999 = 45.026 sockperf: ---> percentile 99.990 = 39.009 sockperf: ---> percentile 99.900 = 35.922 sockperf: ---> percentile 99.000 = 33.482 sockperf: ---> percentile 90.000 = 28.902 sockperf: ---> percentile 75.000 = 27.821 sockperf: ---> percentile 50.000 = 26.860 sockperf: ---> percentile 25.000 = 25.685 sockperf: ---> <MIN> observation = 12.277 Fixes: 0bcd952feec7 ("ethernet/intel: consolidate NAPI and NAPI exit") Reported-by: Hugo Ferreira <[email protected]> Reviewed-by: Michal Schmidt <[email protected]> Signed-off-by: Ivan Vecera <[email protected]> Reviewed-by: Jesse Brandeburg <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* | Merge branch '1GbE' of ↵Jakub Kicinski2024-04-021-6/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2024-03-29 (net: intel) This series contains updates to most Intel drivers. Jesse moves declaration of pci_driver struct to remove need for forward declarations in igb and converts Intel drivers to user newer power management ops. Sasha reworks power management flow on igc to avoid using rtnl_lock() during those flows. Maciej reorganizes i40e_nvm file to avoid forward declarations. * '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: i40e: avoid forward declarations in i40e_nvm.c igc: Refactor runtime power management flow net: intel: implement modern PM ops declarations igb: simplify pci ops declaration ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
| * | net: intel: implement modern PM ops declarationsJesse Brandeburg2024-03-291-6/+4
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Switch the Intel networking drivers to use the new power management ops declaration formats and macros, which allows us to drop __maybe_unused, as well as a bunch of ifdef checking CONFIG_PM. This is safe to do because the compiler drops the unused functions, verified by checking for any of the power management function symbols being present in System.map for a build without CONFIG_PM. If a driver has runtime PM, define the ops with pm_ptr(), and if the driver has Simple PM, use pm_sleep_ptr(), as well as the new versions of the macros for declaring the members of the pm_ops structs. Checked with network-enabled allnoconfig, allyesconfig, allmodconfig on x64_64. Reviewed-by: Alan Brady <[email protected]> Signed-off-by: Jesse Brandeburg <[email protected]> Reviewed-by: Maciej Fijalkowski <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* / netlink: introduce type-checking attribute iterationJohannes Berg2024-03-291-6/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are, especially with multi-attr arrays, many cases of needing to iterate all attributes of a specific type in a netlink message or a nested attribute. Add specific macros to support that case. Also convert many instances using this spatch: @@ iterator nla_for_each_attr; iterator name nla_for_each_attr_type; identifier nla; expression head, len, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_attr(nla, head, len, rem) +nla_for_each_attr_type(nla, ATTR, head, len, rem) { <... T x; ...> -if (nla_type(nla) == ATTR) { ... -} } @@ identifier nla; iterator nla_for_each_nested; iterator name nla_for_each_nested_type; expression attr, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_nested(nla, attr, rem) +nla_for_each_nested_type(nla, ATTR, attr, rem) { <... T x; ...> -if (nla_type(nla) == ATTR) { ... -} } @@ iterator nla_for_each_attr; iterator name nla_for_each_attr_type; identifier nla; expression head, len, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_attr(nla, head, len, rem) +nla_for_each_attr_type(nla, ATTR, head, len, rem) { <... T x; ...> -if (nla_type(nla) != ATTR) continue; ... } @@ identifier nla; iterator nla_for_each_nested; iterator name nla_for_each_nested_type; expression attr, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_nested(nla, attr, rem) +nla_for_each_nested_type(nla, ATTR, attr, rem) { <... T x; ...> -if (nla_type(nla) != ATTR) continue; ... } Although I had to undo one bad change this made, and I also adjusted some other code for whitespace and to use direct variable initialization now. Signed-off-by: Johannes Berg <[email protected]> Link: https://lore.kernel.org/r/20240328203144.b5a6c895fb80.I1869b44767379f204998ff44dd239803f39c23e0@changeid Signed-off-by: Jakub Kicinski <[email protected]>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2024-03-071-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/page_pool_user.c 0b11b1c5c320 ("netdev: let netlink core handle -EMSGSIZE errors") 429679dcf7d9 ("page_pool: fix netlink dump stop/resume") Signed-off-by: Jakub Kicinski <[email protected]>
| * i40e: disable NAPI right after disabling irqs when handling xsk_poolMaciej Fijalkowski2024-03-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Disable NAPI before shutting down queues that this particular NAPI contains so that the order of actions in i40e_queue_pair_disable() mirrors what we do in i40e_queue_pair_enable(). Fixes: 123cecd427b6 ("i40e: added queue pair disable/enable functions") Signed-off-by: Maciej Fijalkowski <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel) Acked-by: Magnus Karlsson <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* | net: adopt skb_network_header_len() more broadlyEric Dumazet2024-03-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | (skb_transport_header(skb) - skb_network_header(skb)) can be replaced by skb_network_header_len(skb) Add a DEBUG_NET_WARN_ON_ONCE() in skb_network_header_len() to catch cases were the transport_header was not set. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>