aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/net/ethernet/intel/ice/ice.h
Commit message (Collapse)AuthorAgeFilesLines
* ice: fix NULL pointer dereference in ice_unplug_aux_dev() on resetEmil Tantilov2025-08-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Issuing a reset when the driver is loaded without RDMA support, will results in a crash as it attempts to remove RDMA's non-existent auxbus device: echo 1 > /sys/class/net/<if>/device/reset BUG: kernel NULL pointer dereference, address: 0000000000000008 ... RIP: 0010:ice_unplug_aux_dev+0x29/0x70 [ice] ... Call Trace: <TASK> ice_prepare_for_reset+0x77/0x260 [ice] pci_dev_save_and_disable+0x2c/0x70 pci_reset_function+0x88/0x130 reset_store+0x5a/0xa0 kernfs_fop_write_iter+0x15e/0x210 vfs_write+0x273/0x520 ksys_write+0x6b/0xe0 do_syscall_64+0x79/0x3b0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ice_unplug_aux_dev() checks pf->cdev_info->adev for NULL pointer, but pf->cdev_info will also be NULL, leading to the deref in the trace above. Introduce a flag to be set when the creation of the auxbus device is successful, to avoid multiple NULL pointer checks in ice_unplug_aux_dev(). Fixes: c24a65b6a27c7 ("iidc/ice/irdma: Update IDC to support multiple consumers") Signed-off-by: Emil Tantilov <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: use libie_aq_strMichal Swiatkowski2025-07-241-1/+0
| | | | | | | | | | | | | Simple: s/ice_aq_str/libie_aq_str Add libie_aminq module in ice Kconfig. Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Reviewed-by: Aleksandr Loktionov <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* ice, libie: move generic adminq descriptors to libMichal Swiatkowski2025-07-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The descriptor structure is the same in ice, ixgbe and i40e. Move it to common libie header to use it across different driver. Leave device specific adminq commands in separate folders. This lead to a change that need to be done in filling/getting descriptor: - previous: struct specific_desc *cmd; cmd = &desc.params.specific_desc; - now: struct specific_desc *cmd; cmd = libie_aq_raw(&desc); Do this changes across the driver to allow clean build. The casting only have to be done in case of specific descriptors, for generic one union can still be used. Changes beside code moving: - change ICE_ prefix to LIBIE_ prefix (ice_ and libie_ too) - remove shift variables not otherwise needed (in libie_aq_flags) - fill/get descriptor data based on desc.params.raw whenever the descriptor isn't defined in libie - move defines from the libie_aq_sth structure outside - add libie_aq_raw helper and use it instead of explicit casting Reviewed by: Przemek Kitszel <[email protected]> Reviewed-by: Aleksandr Loktionov <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* ice: move TSPLL functions to a separate fileKarol Kolacinski2025-06-181-0/+1
| | | | | | | | | | | | | Collect TSPLL related functions and definitions and move them to a separate file to have all TSPLL functionality in one place. Move CGU related functions and definitions to ice_common.* Reviewed-by: Michal Kubiak <[email protected]> Reviewed-by: Milena Olech <[email protected]> Signed-off-by: Karol Kolacinski <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* ice: add link_down_events statisticMartyna Szapar-Mudlaw2025-06-091-0/+1
| | | | | | | | | | | | | | | Introduce a link_down_events counter to the ice driver, incremented each time the link transitions from up to down. This counter can help diagnose issues related to link stability, such as port flapping or unexpected link drops. The value is exposed via ethtool's get_link_ext_stats() interface. Reviewed-by: Kory Maincent <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Martyna Szapar-Mudlaw <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* Merge branch 'for-next' of ↵Jakub Kicinski2025-05-131-5/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux Tony Nguyen says: ==================== Prepare for Intel IPU E2000 (GEN3) This is the first part in introducing RDMA support for idpf. ---------------------------------------------------------------- Tatyana Nikolova says: To align with review comments, the patch series introducing RDMA RoCEv2 support for the Intel Infrastructure Processing Unit (IPU) E2000 line of products is going to be submitted in three parts: 1. Modify ice to use specific and common IIDC definitions and pass a core device info to irdma. 2. Add RDMA support to idpf and modify idpf to use specific and common IIDC definitions and pass a core device info to irdma. 3. Add RDMA RoCEv2 support for the E2000 products, referred to as GEN3 to irdma. This first part is a 5 patch series based on the original "iidc/ice/irdma: Update IDC to support multiple consumers" patch to allow for multiple CORE PCI drivers, using the auxbus. Patches: 1) Move header file to new name for clarity and replace ice specific DSCP define with a kernel equivalent one in irdma 2) Unify naming convention 3) Separate header file into common and driver specific info 4) Replace ice specific DSCP define with a kernel equivalent one in ice 5) Implement core device info struct and update drivers to use it ---------------------------------------------------------------- v1: https://lore.kernel.org/[email protected] IWL reviews: [v5] https://lore.kernel.org/[email protected] [v4] https://lore.kernel.org/[email protected] [v3] https://lore.kernel.org/[email protected] [v2] https://lore.kernel.org/[email protected] [v1] https://lore.kernel.org/[email protected] * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux: iidc/ice/irdma: Update IDC to support multiple consumers ice: Replace ice specific DSCP mapping num with a kernel define iidc/ice/irdma: Break iidc.h into two headers iidc/ice/irdma: Rename to iidc_* convention iidc/ice/irdma: Rename IDC header file ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
| * iidc/ice/irdma: Update IDC to support multiple consumersDave Ertman2025-05-091-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation of supporting more than a single core PCI driver for RDMA, move ice specific structs like qset_params, qos_info and qos_params from iidc_rdma.h to iidc_rdma_ice.h. Previously, the ice driver was just exporting its entire PF struct to the auxiliary driver, but since each core driver will have its own different PF struct, implement a universal struct that all core drivers can provide to the auxiliary driver through the probe call. Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Dave Ertman <[email protected]> Co-developed-by: Mustafa Ismail <[email protected]> Signed-off-by: Mustafa Ismail <[email protected]> Co-developed-by: Shiraz Saleem <[email protected]> Signed-off-by: Shiraz Saleem <[email protected]> Co-developed-by: Tatyana Nikolova <[email protected]> Signed-off-by: Tatyana Nikolova <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* | ice: enable timesync operation on 2xNAC E825 devicesKarol Kolacinski2025-04-111-2/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to the E825C specification, SBQ address for ports on a single complex is device 2 for PHY 0 and device 13 for PHY1. For accessing ports on a dual complex E825C (so called 2xNAC mode), the driver should use destination device 2 (referred as phy_0) for the current complex PHY and device 13 (referred as phy_0_peer) for peer complex PHY. Differentiate SBQ destination device by checking if current PF port number is on the same PHY as target port number. Adjust 'ice_get_lane_number' function to provide unique port number for ports from PHY1 in 'dual' mode config (by adding fixed offset for PHY1 ports). Cache this value in ice_hw struct. Introduce ice_get_primary_hw wrapper to get access to timesync register not available from second NAC. Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Karol Kolacinski <[email protected]> Co-developed-by: Grzegorz Nitka <[email protected]> Signed-off-by: Grzegorz Nitka <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* | ice: do not add LLDP-specific filter if not necessaryLarysa Zaremba2025-04-111-0/+1
|/ | | | | | | | | | | | | | | | | | | | Commit 34295a3696fb ("ice: implement new LLDP filter command") introduced the ability to use LLDP-specific filter that directs all LLDP traffic to a single VSI. However, current goal is for all trusted VFs to be able to see LLDP neighbors, which is impossible to do with the special filter. Make using the generic filter the default choice and fall back to special one only if a generic filter cannot be added. That way setups with "NVMs where an already existent LLDP filter is blocking the creation of a filter to allow LLDP packets" will still be able to configure software Rx LLDP on PF only, while all other setups would be able to forward them to VFs too. Reviewed-by: Michal Swiatkowski <[email protected]> Signed-off-by: Larysa Zaremba <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: Add E830 checksum offload supportPaul Greenwalt2025-03-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | E830 supports raw receive and generic transmit checksum offloads. Raw receive checksum support is provided by hardware calculating the checksum over the whole packet, regardless of type. The calculated checksum is provided to driver in the Rx flex descriptor. Then the driver assigns the checksum to skb->csum and sets skb->ip_summed to CHECKSUM_COMPLETE. Generic transmit checksum support is provided by hardware calculating the checksum given two offsets: the start offset to begin checksum calculation, and the offset to insert the calculated checksum in the packet. Support is advertised to the stack using NETIF_F_HW_CSUM feature. E830 has the following limitations when both generic transmit checksum offload and TCP Segmentation Offload (TSO) are enabled: 1. Inner packet header modification is not supported. This restriction includes the inability to alter TCP flags, such as the push flag. As a result, this limitation can impact the receiver's ability to coalesce packets, potentially degrading network throughput. 2. The Maximum Segment Size (MSS) is limited to 1023 bytes, which prevents support of Maximum Transmission Unit (MTU) greater than 1063 bytes. Therefore NETIF_F_HW_CSUM and NETIF_F_ALL_TSO features are mutually exclusive. NETIF_F_HW_CSUM hardware feature support is indicated but is not enabled by default. Instead, IP checksums and NETIF_F_ALL_TSO are the defaults. Enforcement of mutual exclusivity of NETIF_F_HW_CSUM and NETIF_F_ALL_TSO is done in ice_set_features(). Mutual exclusivity of IP checksums and NETIF_F_HW_CSUM is handled by netdev_fix_features(). When NETIF_F_HW_CSUM is requested the provided skb->csum_start and skb->csum_offset are passed to hardware in the Tx context descriptor generic checksum (GCS) parameters. Hardware calculates the 1's complement from skb->csum_start to the end of the packet, and inserts the result in the packet at skb->csum_offset. Co-developed-by: Alice Michael <[email protected]> Signed-off-by: Alice Michael <[email protected]> Co-developed-by: Eric Joyner <[email protected]> Signed-off-by: Eric Joyner <[email protected]> Signed-off-by: Paul Greenwalt <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
* ice: use napi's irq affinity and rmap IRQ notifiersAhmed Zaki2025-02-271-3/+0
| | | | | | | | | Delete the driver CPU affinity and aRFS rmap info, use the core's API instead. Signed-off-by: Ahmed Zaki <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* ice: Remove unnecessary ice_is_e8xx() functionsKarol Kolacinski2025-02-101-5/+0
| | | | | | | | | | | | | | | | | | | Remove unnecessary ice_is_e8xx() functions and PHY model. Instead, use MAC type where applicable. Don't check device type in ice_ptp_maybe_trigger_tx_interrupt(), because in reality it depends on the ready bitmap, which only E810 does not have. Call ice_ptp_cfg_phy_interrupt() unconditionally, because all further function calls check the MAC type anyway and this allows simpler code in the future with addition of the new MAC types. Reorder ICE_MAC_* cases in switches in ice_ptp* as in enum ice_mac_type. Signed-off-by: Karol Kolacinski <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* ice: simplify VF MSI-X managingMichal Swiatkowski2025-02-051-7/+3
| | | | | | | | | | | | | | | | | After implementing pf->msix.max field, base vector for other use cases (like VFs) can be fixed. This simplify code when changing MSI-X amount on particular VF, because there is no need to move a base vector. A fixed base vector allows to reserve vectors from the beginning instead of from the end, which is also simpler in code. Store total and rest value in the same struct as max and min for PF. Move tracking vectors from ice_sriov.c to ice_irq.c as it can be also use for other none PF use cases (SIOV). Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice, irdma: move interrupts code to irdmaMichal Swiatkowski2025-02-051-1/+0
| | | | | | | | | | | | | Move responsibility of MSI-X requesting for RDMA feature from ice driver to irdma driver. It is done to allow simple fallback when there is not enough MSI-X available. Change amount of MSI-X used for control from 4 to 1, as it isn't needed to have more than one MSI-X for this purpose. Reviewed-by: Jacob Keller <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: get rid of num_lan_msix fieldMichal Swiatkowski2025-02-051-1/+0
| | | | | | | | | | | | Remove the field to allow having more queues than MSI-X on VSI. As default the number will be the same, but if there won't be more MSI-X available VSI can run with at least one MSI-X. Reviewed-by: Jacob Keller <[email protected]> Reviewed-by: Wojciech Drewek <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: remove splitting MSI-X between featuresMichal Swiatkowski2025-02-051-2/+0
| | | | | | | | | | | | | | | | | With dynamic approach to alloc MSI-X there is no sense to statically split MSI-X between PF features. Splitting was also calculating needed MSI-X. Move this part to separate function and use as max value. Remove ICE_ESWITCH_MSIX, as there is no need for additional MSI-X for switchdev. Reviewed-by: Jacob Keller <[email protected]> Reviewed-by: Wojciech Drewek <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: devlink PF MSI-X max and min parameterMichal Swiatkowski2025-02-051-0/+7
| | | | | | | | | | | Use generic devlink PF MSI-X parameter to allow user to change MSI-X range. Add notes about this parameters into ice devlink documentation. Tested-by: Pucha Himasekhar Reddy <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: add Tx hang devlink health reporterPrzemek Kitszel2024-12-171-0/+2
| | | | | | | | | | | | | | Add Tx hang devlink health reporter, see struct ice_tx_hang_event to see what exactly is reported. For now dump descriptors with little metadata and skb diagnostic information. Reviewed-by: Igor Bagnucki <[email protected]> Reviewed-by: Wojciech Drewek <[email protected]> Co-developed-by: Mateusz Polchlopek <[email protected]> Signed-off-by: Mateusz Polchlopek <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Przemek Kitszel <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: only allow Tx promiscuous for multicastBrett Creeley2024-11-131-4/+2
| | | | | | | | | | | | | | | | Currently when any VF is trusted and true promiscuous mode is enabled on the PF, the VF will receive all unicast traffic directed to the device's internal switch. This includes traffic external to the NIC and also from other VSI (i.e. VFs). This does not match the expected behavior as unicast traffic should only be visible from external sources in this case. Disable the Tx promiscuous mode bits for unicast promiscuous mode. Reviewed-by: Mateusz Polchlopek <[email protected]> Signed-off-by: Brett Creeley <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* Merge branch 'net-introduce-tx-h-w-shaping-api'Jakub Kicinski2024-10-101-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Paolo Abeni says: ==================== net: introduce TX H/W shaping API We have a plurality of shaping-related drivers API, but none flexible enough to meet existing demand from vendors[1]. This series introduces new device APIs to configure in a flexible way TX H/W shaping. The new functionalities are exposed via a newly defined generic netlink interface and include introspection capabilities. Some self-tests are included, on top of a dummy netdevsim implementation. Finally a basic implementation for the iavf driver is provided. Some usage examples: * Configure shaping on a given queue: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/shaper.yaml \ --do set --json '{"ifindex": '$IFINDEX', "shaper": {"handle": {"scope": "queue", "id":'$QUEUEID'}, "bw-max": 2000000}}' * Container B/W sharing The orchestration infrastructure wants to group the container-related queues under a RR scheduling and limit the aggregate bandwidth: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/shaper.yaml \ --do group --json '{"ifindex": '$IFINDEX', "leaves": [ {"handle": {"scope": "queue", "id":'$QID1'}, "weight": '$W1'}, {"handle": {"scope": "queue", "id":'$QID2'}, "weight": '$W2'}], {"handle": {"scope": "queue", "id":'$QID3'}, "weight": '$W3'}], "handle": {"scope":"node"}, "bw-max": 10000000}' {'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 0}} Q1 \ \ Q2 -- node 0 ------- netdev / (bw-max: 10M) Q3 / * Delegation A containers wants to limit the aggregate B/W bandwidth of 2 of the 3 queues it owns - the starting configuration is the one from the previous point: SPEC=Documentation/netlink/specs/net_shaper.yaml ./tools/net/ynl/cli.py --spec $SPEC \ --do group --json '{"ifindex": '$IFINDEX', "leaves": [ {"handle": {"scope": "queue", "id":'$QID1'}, "weight": '$W1'}, {"handle": {"scope": "queue", "id":'$QID2'}, "weight": '$W2'}], "handle": {"scope": "node"}, "bw-max": 5000000 }' {'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 1}} Q1 -- node 1 --------\ / (bw-max: 5M) \ Q2 / node 0 ------- netdev /(bw-max: 10M) Q3 ------------------/ In a group operation, when prior to the op itself, the leaves have different parents, the user must specify the parent handle for the group. I.e., starting from the previous config: ./tools/net/ynl/cli.py --spec $SPEC \ --do group --json '{"ifindex": '$IFINDEX', "leaves": [ {"handle": {"scope": "queue", "id":'$QID1'}, "weight": '$W1'}, {"handle": {"scope": "queue", "id":'$QID3'}, "weight": '$W3'}], "handle": {"scope": "node"}, "bw-max": 3000000 }' Netlink error: Invalid argument nl_len = 96 (80) nl_flags = 0x300 nl_type = 2 error: -22 extack: {'msg': 'All the leaves shapers must have the same old parent'} ./tools/net/ynl/cli.py --spec $SPEC \ --do group --json '{"ifindex": '$IFINDEX', "leaves": [ {"handle": {"scope": "queue", "id":'$QID1'}, "weight": '$W1'}, {"handle": {"scope": "queue", "id":'$QID3'}, "weight": '$W3'}], "handle": {"scope": "node"}, "parent": {"scope": "node", "id": 1}, "bw-max": 3000000 } {'ifindex': $IFINDEX, 'handle': {'scope': 'node', 'id': 2}} Q1 -- node 2 --- /(bw-max:3M)\ Q3 / \ ---- node 1 \ / (bw-max: 5M)\ Q2 node 0 ------- netdev (bw-max: 10M) * Cleanup: Still starting from config 1To delete a single queue shaper ./tools/net/ynl/cli.py --spec $SPEC --do delete --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "queue", "id":'$QID3'}}' Q1 -- node 2 --- (bw-max:3M)\ \ ---- node 1 \ / (bw-max: 5M)\ Q2 node 0 ------- netdev (bw-max: 10M) Deleting a node shaper relinks all its leaves to the node's parent: ./tools/net/ynl/cli.py --spec $SPEC --do delete --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "node", "id":2}}' Q1 ---\ \ node 1----- \ / (bw-max: 5M)\ Q2----/ node 0 ------- netdev (bw-max: 10M) Deleting the last shaper under a node shaper deletes the node, too: ./tools/net/ynl/cli.py --spec $SPEC --do delete --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "queue", "id":'$QID1'}}' ./tools/net/ynl/cli.py --spec $SPEC --do delete --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "queue", "id":'$QID2'}}' ./tools/net/ynl/cli.py --spec $SPEC --do get --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "node", "id": 1}}' Netlink error: No such file or directory nl_len = 44 (28) nl_flags = 0x300 nl_type = 2 error: -2 extack: {'bad-attr': '.handle'} Such delete recurses on parents that are left over with no leaves: ./tools/net/ynl/cli.py --spec $SPEC --do get --json \ '{"ifindex": '$IFINDEX', "handle": {"scope": "node", "id": 0}}' Netlink error: No such file or directory nl_len = 44 (28) nl_flags = 0x300 nl_type = 2 error: -2 extack: {'bad-attr': '.handle'} v8: https://lore.kernel.org/[email protected] v7: https://lore.kernel.org/[email protected] v6: https://lore.kernel.org/[email protected] v5: https://lore.kernel.org/[email protected] v4: https://lore.kernel.org/[email protected] v3: https://lore.kernel.org/[email protected] RFC v2: https://lore.kernel.org/[email protected] RFC v1: https://lore.kernel.org/[email protected] ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
| * ice: Support VF queue rate limit and quanta size configurationWenjun Wu2024-10-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support to configure VF queue rate limit and quanta size. For quanta size configuration, the quanta profiles are divided evenly by PF numbers. For each port, the first quanta profile is reserved for default. When VF is asked to set queue quanta size, PF will search for an available profile, change the fields and assigned this profile to the queue. Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Wenjun Wu <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Link: https://patch.msgid.link/fddefc2c1ec3ab32b241ce444af401da19e834dd.1728460186.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
* | ice: store max_frame and rx_buf_len only in ice_rx_ringJacob Keller2024-10-081-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The max_frame and rx_buf_len fields of the VSI set the maximum frame size for packets on the wire, and configure the size of the Rx buffer. In the hardware, these are per-queue configuration. Most VSI types use a simple method to determine the size of the buffers for all queues. However, VFs may potentially configure different values for each queue. While the Linux iAVF driver does not do this, it is allowed by the virtchnl interface. The current virtchnl code simply sets the per-VSI fields inbetween calls to ice_vsi_cfg_single_rxq(). This technically works, as these fields are only ever used when programming the Rx ring, and otherwise not checked again. However, it is confusing to maintain. The Rx ring also already has an rx_buf_len field in order to access the buffer length in the hotpath. It also has extra unused bytes in the ring structure which we can make use of to store the maximum frame size. Drop the VSI max_frame and rx_buf_len fields. Add max_frame to the Rx ring, and slightly re-order rx_buf_len to better fit into the gaps in the structure layout. Change the ice_vsi_cfg_frame_size function so that it writes to the ring fields. Call this function once per ring in ice_vsi_cfg_rxqs(). This is done over calling it inside the ice_vsi_cfg_rxq(), because ice_vsi_cfg_rxq() is called in the virtchnl flow where the max_frame and rx_buf_len have already been configured. Change the accesses for rx_buf_len and max_frame to all point to the ring structure. This has the added benefit that ice_vsi_cfg_rxq() no longer has the surprise side effect of updating ring->rx_buf_len based on the VSI field. Update the virtchnl ice_vc_cfg_qs_msg() function to set the ring values directly, and drop references to the removed VSI fields. This now makes the VF logic clear, as the ring fields are obviously per-queue. This reduces the required cognitive load when reasoning about this logic. Note that removing the VSI fields does leave a 4 byte gap, but the ice_vsi structure has many gaps, and its layout is not as critical in the hot path. The structure may benefit from a more thorough repacking, but no attempt was made in this change. Signed-off-by: Jacob Keller <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* | ice: add E830 HW VF mailbox message limit supportPaul Greenwalt2024-10-081-0/+1
|/ | | | | | | | | | | | | | | | | E830 adds hardware support to prevent the VF from overflowing the PF mailbox with VIRTCHNL messages. E830 will use the hardware feature (ICE_F_MBX_LIMIT) instead of the software solution ice_is_malicious_vf(). To prevent a VF from overflowing the PF, the PF sets the number of messages per VF that can be in the PF's mailbox queue (ICE_MBX_OVERFLOW_WATERMARK). When the PF processes a message from a VF, the PF decrements the per VF message count using the E830_MBX_VF_DEC_TRIG register. Signed-off-by: Paul Greenwalt <[email protected]> Reviewed-by: Alexander Lobakin <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: Introduce ice_get_phy_model() wrapperSergey Temerkhanov2024-10-011-0/+5
| | | | | | | | | | Introduce ice_get_phy_model() to improve code readability Signed-off-by: Sergey Temerkhanov <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* ice: base subfunction aux driverPiotr Raczynski2024-09-061-1/+6
| | | | | | | | | | | | | | | | | | | Implement subfunction driver. It is probe when subfunction port is activated. VSI is already created. During the probe VSI is being configured. MAC unicast and broadcast filter is added to allow traffic to pass. Store subfunction pointer in VSI struct. The same is done for VF pointer. Make union of subfunction and VF pointer as only one of them can be set with one VSI. Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Piotr Raczynski <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: add basic devlink subfunctions supportPiotr Raczynski2024-09-061-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement devlink port handlers responsible for ethernet type devlink subfunctions. Create subfunction devlink port and setup all resources needed for a subfunction netdev to operate. Configure new VSI for each new subfunction, initialize and configure interrupts and Tx/Rx resources. Set correct MAC filters and create new netdev. For now, subfunction is limited to only one Tx/Rx queue pair. Only allocate new subfunction VSI with devlink port new command. Allocate and free subfunction MSIX interrupt vectors using new API calls with pci_msix_alloc_irq_at and pci_msix_free_irq. Support both automatic and manual subfunction numbers. If no subfunction number is provided, use xa_alloc to pick a number automatically. This will find the first free index and use that as the number. This reduces burden on users in the simple case where a specific number is not required. It may also be slightly faster to check that a number exists since xarray lookup should be faster than a linear scan of the dyn_ports xarray. Co-developed-by: Jacob Keller <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Piotr Raczynski <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: export ice ndo_ops functionsPiotr Raczynski2024-09-061-0/+8
| | | | | | | | | | | | | Make some of the netdevice_ops functions visible from outside for another VSI type created netdev. Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Reviewed-by: Wojciech Drewek <[email protected]> Signed-off-by: Piotr Raczynski <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: protect XDP configuration with a mutexLarysa Zaremba2024-09-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The main threat to data consistency in ice_xdp() is a possible asynchronous PF reset. It can be triggered by a user or by TX timeout handler. XDP setup and PF reset code access the same resources in the following sections: * ice_vsi_close() in ice_prepare_for_reset() - already rtnl-locked * ice_vsi_rebuild() for the PF VSI - not protected * ice_vsi_open() - already rtnl-locked With an unfortunate timing, such accesses can result in a crash such as the one below: [ +1.999878] ice 0000:b1:00.0: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 14 [ +2.002992] ice 0000:b1:00.0: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 18 [Mar15 18:17] ice 0000:b1:00.0 ens801f0np0: NETDEV WATCHDOG: CPU: 38: transmit queue 14 timed out 80692736 ms [ +0.000093] ice 0000:b1:00.0 ens801f0np0: tx_timeout: VSI_num: 6, Q 14, NTC: 0x0, HW_HEAD: 0x0, NTU: 0x0, INT: 0x4000001 [ +0.000012] ice 0000:b1:00.0 ens801f0np0: tx_timeout recovery level 1, txqueue 14 [ +0.394718] ice 0000:b1:00.0: PTP reset successful [ +0.006184] BUG: kernel NULL pointer dereference, address: 0000000000000098 [ +0.000045] #PF: supervisor read access in kernel mode [ +0.000023] #PF: error_code(0x0000) - not-present page [ +0.000023] PGD 0 P4D 0 [ +0.000018] Oops: 0000 [#1] PREEMPT SMP NOPTI [ +0.000023] CPU: 38 PID: 7540 Comm: kworker/38:1 Not tainted 6.8.0-rc7 #1 [ +0.000031] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021 [ +0.000036] Workqueue: ice ice_service_task [ice] [ +0.000183] RIP: 0010:ice_clean_tx_ring+0xa/0xd0 [ice] [...] [ +0.000013] Call Trace: [ +0.000016] <TASK> [ +0.000014] ? __die+0x1f/0x70 [ +0.000029] ? page_fault_oops+0x171/0x4f0 [ +0.000029] ? schedule+0x3b/0xd0 [ +0.000027] ? exc_page_fault+0x7b/0x180 [ +0.000022] ? asm_exc_page_fault+0x22/0x30 [ +0.000031] ? ice_clean_tx_ring+0xa/0xd0 [ice] [ +0.000194] ice_free_tx_ring+0xe/0x60 [ice] [ +0.000186] ice_destroy_xdp_rings+0x157/0x310 [ice] [ +0.000151] ice_vsi_decfg+0x53/0xe0 [ice] [ +0.000180] ice_vsi_rebuild+0x239/0x540 [ice] [ +0.000186] ice_vsi_rebuild_by_type+0x76/0x180 [ice] [ +0.000145] ice_rebuild+0x18c/0x840 [ice] [ +0.000145] ? delay_tsc+0x4a/0xc0 [ +0.000022] ? delay_tsc+0x92/0xc0 [ +0.000020] ice_do_reset+0x140/0x180 [ice] [ +0.000886] ice_service_task+0x404/0x1030 [ice] [ +0.000824] process_one_work+0x171/0x340 [ +0.000685] worker_thread+0x277/0x3a0 [ +0.000675] ? preempt_count_add+0x6a/0xa0 [ +0.000677] ? _raw_spin_lock_irqsave+0x23/0x50 [ +0.000679] ? __pfx_worker_thread+0x10/0x10 [ +0.000653] kthread+0xf0/0x120 [ +0.000635] ? __pfx_kthread+0x10/0x10 [ +0.000616] ret_from_fork+0x2d/0x50 [ +0.000612] ? __pfx_kthread+0x10/0x10 [ +0.000604] ret_from_fork_asm+0x1b/0x30 [ +0.000604] </TASK> The previous way of handling this through returning -EBUSY is not viable, particularly when destroying AF_XDP socket, because the kernel proceeds with removal anyway. There is plenty of code between those calls and there is no need to create a large critical section that covers all of them, same as there is no need to protect ice_vsi_rebuild() with rtnl_lock(). Add xdp_state_lock mutex to protect ice_vsi_rebuild() and ice_xdp(). Leaving unprotected sections in between would result in two states that have to be considered: 1. when the VSI is closed, but not yet rebuild 2. when VSI is already rebuild, but not yet open The latter case is actually already handled through !netif_running() case, we just need to adjust flag checking a little. The former one is not as trivial, because between ice_vsi_close() and ice_vsi_rebuild(), a lot of hardware interaction happens, this can make adding/deleting rings exit with an error. Luckily, VSI rebuild is pending and can apply new configuration for us in a managed fashion. Therefore, add an additional VSI state flag ICE_VSI_REBUILD_PENDING to indicate that ice_xdp() can just hot-swap the program. Also, as ice_vsi_rebuild() flow is touched in this patch, make it more consistent by deconfiguring VSI when coalesce allocation fails. Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Fixes: efc2214b6047 ("ice: Add support for XDP") Reviewed-by: Wojciech Drewek <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> Signed-off-by: Larysa Zaremba <[email protected]> Reviewed-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: improve updating ice_{t,r}x_ring::xsk_poolMaciej Fijalkowski2024-07-291-6/+5
| | | | | | | | | | | | | | xsk_buff_pool pointers that ice ring structs hold are updated via ndo_bpf that is executed in process context while it can be read by remote CPU at the same time within NAPI poll. Use synchronize_net() after pointer update and {READ,WRITE}_ONCE() when working with mentioned pointer. Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Reviewed-by: Shannon Nelson <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel) Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: map XDP queues to vectors in ice_vsi_map_rings_to_vectors()Larysa Zaremba2024-06-061-0/+1
| | | | | | | | | | | | | | | | | | | ice_pf_dcb_recfg() re-maps queues to vectors with ice_vsi_map_rings_to_vectors(), which does not restore the previous state for XDP queues. This leads to no AF_XDP traffic after rebuild. Map XDP queues to vectors in ice_vsi_map_rings_to_vectors(). Also, move the code around, so XDP queues are mapped independently only through .ndo_bpf(). Fixes: 6624e780a577 ("ice: split ice_vsi_setup into smaller functions") Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Larysa Zaremba <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-5-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <[email protected]>
* ice: add flag to distinguish reset from .ndo_bpf in XDP rings configLarysa Zaremba2024-06-061-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6624e780a577 ("ice: split ice_vsi_setup into smaller functions") has placed ice_vsi_free_q_vectors() after ice_destroy_xdp_rings() in the rebuild process. The behaviour of the XDP rings config functions is context-dependent, so the change of order has led to ice_destroy_xdp_rings() doing additional work and removing XDP prog, when it was supposed to be preserved. Also, dependency on the PF state reset flags creates an additional, fortunately less common problem: * PFR is requested e.g. by tx_timeout handler * .ndo_bpf() is asked to delete the program, calls ice_destroy_xdp_rings(), but reset flag is set, so rings are destroyed without deleting the program * ice_vsi_rebuild tries to delete non-existent XDP rings, because the program is still on the VSI * system crashes With a similar race, when requested to attach a program, ice_prepare_xdp_rings() can actually skip setting the program in the VSI and nevertheless report success. Instead of reverting to the old order of function calls, add an enum argument to both ice_prepare_xdp_rings() and ice_destroy_xdp_rings() in order to distinguish between calls from rebuild and .ndo_bpf(). Fixes: efc2214b6047 ("ice: Add support for XDP") Reviewed-by: Igor Bagnucki <[email protected]> Signed-off-by: Larysa Zaremba <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-4-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <[email protected]>
* ice: remove af_xdp_zc_qps bitmapLarysa Zaremba2024-06-061-11/+21
| | | | | | | | | | | | | | | | | | | | | | | Referenced commit has introduced a bitmap to distinguish between ZC and copy-mode AF_XDP queues, because xsk_get_pool_from_qid() does not do this for us. The bitmap would be especially useful when restoring previous state after rebuild, if only it was not reallocated in the process. This leads to e.g. xdpsock dying after changing number of queues. Instead of preserving the bitmap during the rebuild, remove it completely and distinguish between ZC and copy-mode queues based on the presence of a device associated with the pool. Fixes: e102db780e1c ("ice: track AF_XDP ZC enabled queues in bitmap") Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Larysa Zaremba <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Chandan Kumar Rout <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/20240603-net-2024-05-30-intel-net-fixes-v2-3-e3563aa89b0c@intel.com Signed-off-by: Jakub Kicinski <[email protected]>
* ice: refactor struct ice_vsi_cfg_params to be inside of struct ice_vsiMateusz Polchlopek2024-05-061-6/+8
| | | | | | | | | | | | | | | Refactor struct ice_vsi_cfg_params to be embedded into struct ice_vsi. Prior to that the members of the struct were scattered around ice_vsi, and were copy-pasted for purposes of reinit. Now we have struct handy, and it is easier to have something sticky in the flags field. Suggested-by: Przemek Kitszel <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Reviewed-by: Vaishnavi Tipireddy <[email protected]> Signed-off-by: Mateusz Polchlopek <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* ice: store VF relative MSI-X index in q_vector->vf_reg_idxJacob Keller2024-04-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ice physical function driver needs to configure the association of queues and interrupts on behalf of its virtual functions. This is done over virtchnl by the VF sending messages during its initialization phase. These messages contain a vector_id which the VF wants to associate with a given queue. This ID is relative to the VF space, where 0 indicates the control IRQ for non-queue interrupts. When programming the mapping, the PF driver currently passes this vector_id directly to the low level functions for programming. This works for SR-IOV, because the hardware uses the VF-based indexing for interrupts. This won't work for Scalable IOV, which uses PF-based indexing for programming its VSIs. To handle this, the driver needs to be able to look up the proper index to use for programming. For typical IRQs, this would be the q_vector->reg_idx field. The q_vector->reg_idx can't be set to a VF relative value, because it is used when the PF needs to control the interrupt, such as when triggering a software interrupt on stopping the Tx queue. Thus, introduce a new q_vector->vf_reg_idx which can store the VF relative index for registers which expect this. Use this in ice_cfg_interrupt to look up the VF index from the q_vector. This allows removing the vector ID parameter of ice_cfg_interrupt. Also notice that this function returns an int, but then is cast to the virtchnl error enumeration, virtchnl_status_code. Update the return type to indicate it does not return an integer error code. We can't use normal error codes here because the return values are passed across the virtchnl interface. This will allow the future Scalable IOV VFs to correctly look up the index needed for programming the VF queues without breaking SR-IOV. Signed-off-by: Jacob Keller <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: add ice_adapter for shared data across PFs on the same NICMichal Schmidt2024-04-011-0/+2
| | | | | | | | | | | | | | | | | There is a need for synchronization between ice PFs on the same physical adapter. Add a "struct ice_adapter" for holding data shared between PFs of the same multifunction PCI device. The struct is refcounted - each ice_pf holds a reference to it. Its first use will be for PTP. I expect it will be useful also to improve the ugliness that is ice_prot_id_tbl. Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Michal Schmidt <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* ice: remove switchdev control plane VSIMichal Swiatkowski2024-03-251-1/+0
| | | | | | | | | | | | | For slow-path Rx and Tx PF VSI is used. There is no need to have control plane VSI. Remove all code related to it. Eswitch rebuild can't fail without rebuilding control plane VSI. Return void from ice_eswitch_rebuild(). Reviewed-by: Marcin Szycik <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Tested-by: Sujai Buvaneswaran <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: remove eswitch changing queues algorithmMichal Swiatkowski2024-03-251-6/+0
| | | | | | | | | | | | Changing queues used by eswitch will be done through PF netdev. There is no need to reserve queues if the number of used queues is known. Reviewed-by: Wojciech Drewek <[email protected]> Reviewed-by: Marcin Szycik <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Tested-by: Sujai Buvaneswaran <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: Fix debugfs with devlink reloadWojciech Drewek2024-02-121-0/+1
| | | | | | | | | | | | | | | | | During devlink reload it is needed to remove debugfs entries correlated with only one PF. ice_debugfs_exit() removes all entries created by ice driver so we can't use it. Introduce ice_debugfs_pf_deinit() in order to release PF's debugfs entries. Move ice_debugfs_exit() call to ice_module_exit(), it makes more sense since ice_debugfs_init() is called in ice_module_init() and not in ice_probe(). Signed-off-by: Wojciech Drewek <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Reviewed-by: Brett Creeley <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: Remove and readd netdev during devlink reloadWojciech Drewek2024-02-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | Recent changes to the devlink reload (commit 9b2348e2d6c9 ("devlink: warn about existing entities during reload-reinit")) force the drivers to destroy devlink ports during reinit. Adjust ice driver to this requirement, unregister netdvice, destroy devlink port. ice_init_eth() was removed and all the common code between probe and reload was moved to ice_load(). During devlink reload we can't take devl_lock (it's already taken) and in ice_probe() we have to lock it. Use devl_* variant of the API which does not acquire and release devl_lock. Guard ice_load() with devl_lock only in case of probe. Suggested-by: Jiri Pirko <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Reviewed-by: Vadim Fedorenko <[email protected]> Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Brett Creeley <[email protected]> Signed-off-by: Wojciech Drewek <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* ice: Add a new counter for Rx EIPE errorsAniruddha Paul2024-02-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | HW incorrectly reports EIPE errors on encapsulated packets with L2 padding inside inner packet. HW shows outer UDP/IPV4 packet checksum errors as part of the EIPE flags of the Rx descriptor. These are reported only if checksum offload is enabled and L3/L4 parsed flag is valid in Rx descriptor. When that error is reported by HW, we don't act on it instead of incrementing main Rx errors statistic as it would normally happen. Add a new statistic to count these errors since we still want to print them. Signed-off-by: Aniruddha Paul <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Reviewed-by: Jan Glaza <[email protected]> Signed-off-by: Jan Sokolowski <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* ice: introduce PTP state machineJacob Keller2024-01-301-1/+0
| | | | | | | | | | | | | | | | | Add PTP state machine so that the driver can correctly identify PTP state around resets. When the driver got information about ungraceful reset, PTP was not prepared for reset and it returned error. When this situation occurs, prepare PTP before rebuilding its structures. Signed-off-by: Jacob Keller <[email protected]> Co-developed-by: Karol Kolacinski <[email protected]> Signed-off-by: Karol Kolacinski <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
* ice: Enable SW interrupt from FW for LL TSKarol Kolacinski2024-01-021-0/+2
| | | | | | | | | | | | | | | | | | | | Introduce new capability - Low Latency Timestamping with Interrupt. On supported devices, driver can request a single timestamp from FW without polling the register afterwards. Instead, FW can issue a dedicated interrupt when the timestamp was read from the PHY register and its value is available to read from the register. This eliminates the need of bottom half scheduling, which results in minimal delay for timestamping. For this mode, allocate TS indices sequentially, so that timestamps are always completed in FIFO manner. Co-developed-by: Yochai Hagvi <[email protected]> Signed-off-by: Yochai Hagvi <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Karol Kolacinski <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* ice: Schedule service task in IRQ top halfKarol Kolacinski2024-01-021-1/+0
| | | | | | | | | | | | Schedule service task and EXTTS in the top half to avoid bottom half scheduling if possible, which significantly reduces timestamping delay. Co-developed-by: Michal Michalik <[email protected]> Signed-off-by: Michal Michalik <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Karol Kolacinski <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* Merge tag 'for-netdev' of ↵Jakub Kicinski2023-12-191-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-12-18 This PR is larger than usual and contains changes in various parts of the kernel. The main changes are: 1) Fix kCFI bugs in BPF, from Peter Zijlstra. End result: all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y. 2) Introduce BPF token object, from Andrii Nakryiko. It adds an ability to delegate a subset of BPF features from privileged daemon (e.g., systemd) through special mount options for userns-bound BPF FS to a trusted unprivileged application. The design accommodates suggestions from Christian Brauner and Paul Moore. Example: $ sudo mkdir -p /sys/fs/bpf/token $ sudo mount -t bpf bpffs /sys/fs/bpf/token \ -o delegate_cmds=prog_load:MAP_CREATE \ -o delegate_progs=kprobe \ -o delegate_attachs=xdp 3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei. - Complete precision tracking support for register spills - Fix verification of possibly-zero-sized stack accesses - Fix access to uninit stack slots - Track aligned STACK_ZERO cases as imprecise spilled registers. It improves the verifier "instructions processed" metric from single digit to 50-60% for some programs. - Fix verifier retval logic 4) Support for VLAN tag in XDP hints, from Larysa Zaremba. 5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu. End result: better memory utilization and lower I$ miss for calls to BPF via BPF trampoline. 6) Fix race between BPF prog accessing inner map and parallel delete, from Hou Tao. 7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu. It allows BPF interact with IPSEC infra. The intent is to support software RSS (via XDP) for the upcoming ipsec pcpu work. Experiments on AWS demonstrate single tunnel pcpu ipsec reaching line rate on 100G ENA nics. 8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao. 9) BPF file verification via fsverity, from Song Liu. It allows BPF progs get fsverity digest. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits) bpf: Ensure precise is reset to false in __mark_reg_const_zero() selftests/bpf: Add more uprobe multi fail tests bpf: Fail uprobe multi link with negative offset selftests/bpf: Test the release of map btf s390/bpf: Fix indirect trampoline generation selftests/bpf: Temporarily disable dummy_struct_ops test on s390 x86/cfi,bpf: Fix bpf_exception_cb() signature bpf: Fix dtor CFI cfi: Add CFI_NOSEAL() x86/cfi,bpf: Fix bpf_struct_ops CFI x86/cfi,bpf: Fix bpf_callback_t CFI x86/cfi,bpf: Fix BPF JIT call cfi: Flip headers selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment bpf: Limit the number of kprobes when attaching program to multiple kprobes bpf: Limit the number of uprobes when attaching program to multiple uprobes bpf: xdp: Register generic_kfunc_set with XDP programs selftests/bpf: utilize string values for delegate_xxx mount options ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
| * ice: Support HW timestamp hintLarysa Zaremba2023-12-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use previously refactored code and create a function that allows XDP code to read HW timestamp. Also, introduce packet context, where hints-related data will be stored. ice_xdp_buff contains only a pointer to this structure, to avoid copying it in ZC mode later in the series. HW timestamp is the first supported hint in the driver, so also add xdp_metadata_ops. Reviewed-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Larysa Zaremba <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
* | ice: configure FW loggingPaul M Stillwell Jr2023-12-141-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Users want the ability to debug FW issues by retrieving the FW logs from the E8xx devices. Use debugfs to allow the user to configure the log level and number of messages for FW logging. If FW logging is supported on the E8xx then the file 'fwlog' will be created under the PCI device ID for the ice driver. If the file does not exist then either the E8xx doesn't support FW logging or debugfs is not enabled on the system. One thing users want to do is control which events are reported. The user can read and write the 'fwlog/modules/<module name>' to get/set the log levels. Each module in the FW that supports logging ht as a file under 'fwlog/modules' that supports reading (to see what the current log level is) and writing (to change the log level). The format to set the log levels for a module are: # echo <log level> > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/<module> The supported log levels are: * none * error * warning * normal * verbose Each level includes the messages from the previous/lower level The modules that are supported are: * general * ctrl * link * link_topo * dnl * i2c * sdp * mdio * adminq * hdma * lldp * dcbx * dcb * xlr * nvm * auth * vpd * iosf * parser * sw * scheduler * txq * rsvd * post * watchdog * task_dispatch * mng * synce * health * tsdrv * pfreg * mdlver * all The module 'all' is a special module which allows the user to read or write to all of the modules. The following example command would set the DCB module to the 'normal' log level: # echo normal > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/dcb If the user wants to set the DCB, Link, and the AdminQ modules to 'verbose' then the commands are: # echo verbose > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/dcb # echo verbose > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/link # echo verbose > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/adminq If the user wants to set all modules to the 'warning' level then the command is: # echo warning > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/all If the user wants to disable logging for a module then they can set the level to 'none'. An example setting the 'watchdog' module is: # echo none > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/watchdog If the user wants to see what the log level is for a specific module then the command is: # cat /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/dcb This will return the log level for the DCB module. If the user wants to see the log level for all the modules then the command is: # cat /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/all Writing to the module file will update the configuration, but NOT enable the configuration (that is a separate command). In addition to configuring the modules, the user can also configure the number of log messages (nr_messages) to include in a single Admin Receive Queue (ARQ) event.The range is 1-128 (1 means push every log message, 128 means push only when the max AQ command buffer is full). The suggested value is 10. To see/change the resolution the user can read/write the 'fwlog/nr_messages' file. An example changing the value to 50 is # echo 50 > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/nr_messages To see the current value of 'nr_messages' then the command is: # cat /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/nr_messages Signed-off-by: Paul M Stillwell Jr <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
* | ice: enable symmetric-xor RSS for Toeplitz hash functionJeff Guo2023-12-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow the user to set the symmetric Toeplitz hash function via: # ethtool -X eth0 hfunc toeplitz symmetric-xor All existing RSS configurations will be converted to symmetric unless they have a non-symmetric field (other than IP src/dst and L4 src/dst ports) used for hashing. The driver will reject a new RSS configuration if such a field is requested. The hash function in the E800 NICs is set per-VSI and a specific AQ command is needed to modify the hash function. Use the AQ command to enable setting the symmetric Toeplitz RSS hash function for any VSI in the new ice_set_rss_hfunc(). When the Symmetric Toeplitz hash function is used, the hardware sets the input set of the RSS (Toeplitz) algorithm to be the XOR of the fields index by HSYMM and the fields index by the INSET registers. We use this to create a symmetric hash by setting the HSYMM registers to point to their counterparts in the INSET registers: HSYMM [src_fv] = dst_fv; HSYMM [dst_fv] = src_fv; where src_fv and dst_fv are the indexes of the protocol's src and dst fields. Reviewed-by: Wojciech Drewek <[email protected]> Signed-off-by: Jeff Guo <[email protected]> Signed-off-by: Jesse Brandeburg <[email protected]> Co-developed-by: Ahmed Zaki <[email protected]> Signed-off-by: Ahmed Zaki <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
* | ice: read internal temperature sensorKonrad Knitter2023-12-051-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 4.30 firmware exposes internal thermal sensor reading via admin queue commands. Expose those readouts via hwmon API when supported. Datasheet: Get Sensor Reading Command (Opcode: 0x0632) +--------------------+--------+--------------------+-------------------------+ | Name | Bytes | Value | Remarks | +--------------------+--------+--------------------+-------------------------+ | Flags | 1-0 | | | | Opcode | 2-3 | 0x0632 | Command opcode | | Datalen | 4-5 | 0 | No external buffer. | | Return value | 6-7 | | Return value. | | Cookie High | 8-11 | Cookie | | | Cookie Low | 12-15 | Cookie | | | Sensor | 16 | | 0x00: Internal temp | | | | | 0x01-0xFF: Reserved. | | Format | 17 | Requested response | Only 0x00 is supported. | | | | format | 0x01-0xFF: Reserved. | | Reserved | 18-23 | | | | Data Address high | 24-27 | Response buffer | | | | | address | | | Data Address low | 28-31 | Response buffer | | | | | address | | +--------------------+--------+--------------------+-------------------------+ Get Sensor Reading Response (Opcode: 0x0632) +--------------------+--------+--------------------+-------------------------+ | Name | Bytes | Value | Remarks | +--------------------+--------+--------------------+-------------------------+ | Flags | 1-0 | | | | Opcode | 2-3 | 0x0632 | Command opcode | | Datalen | 4-5 | 0 | No external buffer | | Return value | 6-7 | | Return value. | | | | | EINVAL: Invalid | | | | | parameters | | | | | ENOENT: Unsupported | | | | | sensor | | | | | EIO: Sensor access | | | | | error | | Cookie High | 8-11 | Cookie | | | Cookie Low | 12-15 | Cookie | | | Sensor Reading | 16-23 | | Format of the reading | | | | | is dependent on request | | Data Address high | 24-27 | Response buffer | | | | | address | | | Data Address low | 28-31 | Response buffer | | | | | address | | +--------------------+--------+--------------------+-------------------------+ Sensor Reading for Sensor 0x00 (Internal Chip Temperature): +--------------------+--------+--------------------+-------------------------+ | Name | Bytes | Value | Remarks | +--------------------+--------+--------------------+-------------------------+ | Thermal Sensor | 0 | | Reading in degrees | | reading | | | Celsius. Signed int8 | | Warning High | 1 | | Warning High threshold | | threshold | | | in degrees Celsius. | | | | | Unsigned int8. | | | | | 0xFF when unsupported | | Critical High | 2 | | Critical High threshold | | threshold | | | in degrees Celsius. | | | | | Unsigned int8. | | | | | 0xFF when unsupported | | Fatal High | 3 | | Fatal High threshold | | threshold | | | in degrees Celsius. | | | | | Unsigned int8. | | | | | 0xFF when unsupported | | Reserved | 4-7 | | | +--------------------+--------+--------------------+-------------------------+ Driver provides current reading from HW as well as device specific thresholds for thermal alarm (Warning, Critical, Fatal) events. $ sensors Output ========================================================= ice-pci-b100 Adapter: PCI adapter temp1: +62.0°C (high = +95.0°C, crit = +105.0°C) (emerg = +115.0°C) Tested on Intel Corporation Ethernet Controller E810-C for SFP Co-developed-by: Marcin Domagala <[email protected]> Signed-off-by: Marcin Domagala <[email protected]> Co-developed-by: Eric Joyner <[email protected]> Signed-off-by: Eric Joyner <[email protected]> Reviewed-by: Marcin Szycik <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Signed-off-by: Konrad Knitter <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
* ice: reserve number of CP queuesMichal Swiatkowski2023-11-131-0/+6
| | | | | | | | | | | | | | | | | Rebuilding CP VSI each time the PR is created drastically increase the time of maximum VFs creation. Add function to reserve number of CP queues to deal with this problem. Use the same function to decrease number of queues in case of removing VFs. Assume that caller of ice_eswitch_reserve_cp_queues() will also call ice_eswitch_attach/detach() correct number of times. Still one by one PR adding is handy for VF resetting routine. Reviewed-by: Wojciech Drewek <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Tested-by: Sujai Buvaneswaran <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
* ice: track port representors in xarrayMichal Swiatkowski2023-11-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Instead of assuming that each VF has pointer to port representor save it in xarray. It will allow adding port representor for other device types. Drop reference to VF where it is use only to get port representor. Get it from xarray instead. The functions will no longer by specific for VF, rename them. Track id assigned by xarray in port representor structure. The id can't be used as ::q_id, because it is fixed during port representor lifetime. ::q_id can change after adding / removing other port representors. Side effect of removing VF pointer is that we are losing VF MAC information used in unrolling. Store it in port representor as parent MAC. Reviewed-by: Piotr Raczynski <[email protected]> Reviewed-by: Wojciech Drewek <[email protected]> Signed-off-by: Michal Swiatkowski <[email protected]> Tested-by: Sujai Buvaneswaran <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>