aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: Add xgmi API to set max speed/widthLijo Lazar2025-06-181-0/+7
| | | | | | | | Add an API to set the max possible xgmi speed/width. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Deprecate xgmi_link_speed enumLijo Lazar2025-06-181-2/+4
| | | | | | | | | xgmi doesn't have discrete max speeds defined. Speed numbers can be arbitrary based on SOC. Deprecate the enum. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/pm: Use external link order for xgmi dataLijo Lazar2025-05-221-0/+17
| | | | | | | | | | | | | xgmi_port_num interface reports external link number for port number. To be consistent, use the external link number for reporting other XGMI link data also. v2: For invalid link number return -EINVAL (Kevin) Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Yang Wang <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix query order of XGMI v6.4.1 statusLijo Lazar2025-04-301-2/+2
| | | | | | | | | | | Keep the register offsets as per link order for querying XGMI v6.4.1 link status. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Alex Deucher <[email protected]> Tested-by: Mangesh Gadre <[email protected]> Fixes: 6dee64e765c4 ("drm/amdgpu: Fix xgmi v6.4.1 link status reporting") Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix xgmi v6.4.1 link status reportingLijo Lazar2025-04-071-6/+18
| | | | | | | | Use the right register offsets for getting link status. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Parse all deferred errors with UMC aca handleXiang Liu2025-03-261-1/+1
| | | | | | | | We should only increase the deferred errors in UMC block. Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Calculate IP specific xgmi bandwidthLijo Lazar2025-03-141-1/+3
| | | | | | | | Use IP version specific xgmi speed/width for bandwidth calculation. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Jonathan Kim <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/amdgpu: Add support for xgmi_v6_4_1Asad Kamal2025-02-271-1/+9
| | | | | | | | | Add support for xgmi_v6_4_1 and use it appropriate places Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add xgmi speed/width related infoLijo Lazar2025-02-271-0/+23
| | | | | | | | | | Add APIs to initialize XGMI speed, width details and get to max bandwidth supported. It is assumed that a device only supports same generation of XGMI links with uniform width. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Move xgmi definitions to xgmi headerLijo Lazar2025-02-271-0/+8
| | | | | | | | Move definitions related to xgmi to amdgpu_xgmi header Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Decode deferred error type in aca bank parserXiang Liu2025-02-271-2/+2
| | | | | | | | | | | In the case of poison inband log, the error type need to be specified by checking the deferred or poison bit of status register. v2: check both deferred and poison bit Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: simplify xgmi peer info callsJonathan Kim2025-02-251-10/+51
| | | | | | | | | Deprecate KFD XGMI peer info calls in favour of calling directly from simplified XGMI peer info functions. Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Include ACA error type in aca bankHawking Zhang2025-02-171-0/+2
| | | | | | | | | | | | | | | ACA error types managed by driver a direct 1:1 correspondence with those managed by firmware. To address this, for each ACA bank, include both the ACA error type and the ACA SMU type. This addition is useful for creating CPER records. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/pm: Get xgmi link status for XGMI_v_6_4_0Asad Kamal2024-11-201-0/+41
| | | | | | | | | | | | | Get XGMI_v_6_4_0 link status and populate it to metrics v1_7 for SMU_v_13_0_6 v2: Get link status register value for each soc from separate function (Lijo) Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdkfd: sever xgmi io link if host driver has disable sharingJonathan Kim2024-10-241-0/+17
| | | | | | | | | | | | | | Host drivers can create partial hives per guest by disabling xgmi sharing between certain peers in the main hive. Typically, these partial hives are fully connected per guest session. In the event that the host makes a mistake by adding a non-shared node to a guest session, have the KFD reflect sharing disabled by severing the IO link. Signed-off-by: Jonathan Kim <[email protected]> Tested-by: James Yao <[email protected]> Reviewed-by: Harish Kasiviswanathan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix the logic for NPS request failureLijo Lazar2024-10-241-12/+16
| | | | | | | | | | | | | On a hive, NPS request is placed by the first one for all devices in the hive. If the request fails, mark the mode as UNKNOWN so that subsequent devices on unload don't request it. Also, fix the mutex double lock issue in error condition, should have been mutex_unlock. Fixes: ee52489d1210 ("drm/amdgpu: Place NPS mode request on unload") Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Rajneesh Bhardwaj <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Wait for reset on init completionLijo Lazar2024-10-151-1/+8
| | | | | | | | | | When reset on initialization is requested, wait for the reset to finish. In cases where module is loaded after boot, this makes sure all initialization work is done after a successful return of modprobe. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Ramesh Errabolu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Place NPS mode request on unloadLijo Lazar2024-10-151-0/+38
| | | | | | | | | | | | If a user has requested NPS mode switch, place the request through PSP during unload of the driver. For devices which are part of a hive, all requests are placed together. If one of them fails, revert back to the current NPS mode. Signed-off-by: Lijo Lazar <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add gmc interface to request NPS modeLijo Lazar2024-10-071-0/+1
| | | | | | | | | | Add a common interface in GMC to request NPS mode through PSP. Also add a variable in hive and gmc control to track the last requested mode. Signed-off-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix spelling mistake "initializtion" -> "initialization"Colin Ian King2024-10-071-1/+1
| | | | | | | There is a spelling mistake in a dev_err message. Fix it. Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Refactor XGMI reset on init handlingLijo Lazar2024-09-261-5/+68
| | | | | | | | | | | | Use XGMI hive information to rely on resetting XGMI devices on initialization rather than using mgpu structure. mgpu structure may have other devices as well. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use init level for pending_reset flagLijo Lazar2024-09-261-3/+3
| | | | | | | | | | Drop pending_reset flag in gmc block. Instead use init level to determine which type of init is preferred - in this case MINIMAL. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: remove RAS unused paramter 'err_addr'Yang Wang2024-08-061-2/+2
| | | | | | | | | | | | | - amdgpu_ras_error_statistic_ue_count() - amdgpu_ras_error_statistic_ce_count() - amdgpu_ras_error_statistic_de_count() The parameter 'err_addr' is no longer used since following patch. Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/pm: Remove legacy interface for xgmi plpdLijo Lazar2024-05-171-2/+2
| | | | | | | | | | | | Replace the legacy interface with amdgpu_dpm_set_pm_policy to set XGMI PLPD mode. Also, xgmi_plpd_policy sysfs node is not used by any client. Remove that as well. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix uninitialized variable warning for amdgpu_xgmiTim Huang2024-05-081-0/+3
| | | | | | | | Clear warning that using uninitialized variable current_node. Signed-off-by: Tim Huang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: update check condition for XGMI ACA UETao Zhou2024-04-101-1/+3
| | | | | | | | Check more possible ext error codes. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: retire unused aca_bank_report data structureYang Wang2024-03-201-3/+3
| | | | | | | | retire unused aca_bank_report data structure. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refine aca error cache for xgmi v6.4.0Yang Wang2024-03-201-4/+8
| | | | | | | | refine aca error cache for xgmi v6.4.0 Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add new aca_smu_type supportYang Wang2024-03-201-5/+13
| | | | | | | | | | | | | Add new types to distinguish between ACA error type and smu mca type. e.g.: the ACA_ERROR_TYPE_DEFERRED is not matched any smu mca valid bank channel, so add new type 'aca_smu_type' to distinguish aca error type and smu mca type. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: replace MCA macro with ACA for XGMIYang Wang2024-01-151-12/+12
| | | | | | | | use new ACA macro to instead of MCA Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add xgmi v6.4.0 ACA supportYang Wang2024-01-151-1/+60
| | | | | | | | add xgmi v6.4.0 ACA driver support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: MCA supports recording umc address informationYiPeng Chai2023-12-191-2/+2
| | | | | | | | | | | | MCA supports recording umc address information. V2: Move err_addr variable from struct ras_err_node to struct ras_err_info. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: xgmi_fill_topology_infoVignesh Chander2023-12-131-8/+50
| | | | | | | | | | | | | | 1. Use the mirrored topology info to fill links for VF. The new solution is required to simplify and optimize host driver logic. Only use the new solution for VFs that support full duplex and extended_peer_link_info otherwise the info would be incomplete. 2. avoid calling extended_link_info on VF as its not supported Signed-off-by: Vignesh Chander <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Reviewed-by: Jonathan Kim <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: expose the connected port num info through sysfsShiwu Zhang2023-11-171-0/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | By catting the xgmi_port_num sysfs node, it prints out the info in the format of <src node id>:<src port num> -> <dst node id>:<dst port num> for one xgmi link. For example, in case of 4 sockets fully and evenly connected setup, it would be like as below for the first node in the hive. 01:02 -> 02:03 01:03 -> 02:02 01:07 -> 03:04 01:04 -> 03:07 01:06 -> 04:05 01:05 -> 04:06 Based on the fact that there is two xgmi links between each socket pair, "01:02 -> 02:03" means that the current socket in question use the port 2 to connect with port 3 of the second node in the hive and so on. v2: print out the src/dst node id for each xgmi link (lijo) v3: replace the current_node++ with +1 to align with dst node (le) and use the dev_err instead of pr_err (lijo) v4: fix checkpatch warning (alex) Signed-off-by: Shiwu Zhang <[email protected]> Acked-by: Lijo Lazar <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add pcs xgmi v6.4.0 ras supportYang Wang2023-11-091-3/+155
| | | | | | | | add pcs xgmi v6.4.0 ras support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Don't warn for unsupported set_xgmi_plpd_modeTao Zhou2023-11-091-6/+8
| | | | | | | | | | | | set_xgmi_plpd_mode may be unsupported and this isn't error, no need to print warning for it. v2: add ret2 to save the status of psp_ras_trigger_error. Suggested-by: [email protected] Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add RAS reset/query operations for XGMI v6_4Tao Zhou2023-11-071-3/+43
| | | | | | | | | | Reset/query RAS error status and count. v2: use XGMI IP version instead of WAFL version. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: replace reset_error_count with amdgpu_ras_reset_error_countTao Zhou2023-10-201-2/+2
| | | | | | | | Simplify the code. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/pm: deprecate allow_xgmi_power_down interfaceLe Ma2023-09-281-2/+2
| | | | | | | | Replace with set_plpd_mode uniformly for places to use. Signed-off-by: Le Ma <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu:Expose physical id of device in XGMI hiveMangesh Gadre2023-09-261-0/+20
| | | | | | | | | | | This identifies the physical ordering of devices in the hive v2: fix compilation issue Signed-off-by: Mangesh Gadre <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use function for IP version checkLijo Lazar2023-09-201-1/+2
| | | | | | | | | Use an inline function for version check. Gives more flexibility to handle any format changes. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add -ENOMEM error handling when there is no memorySrinivasan Shanmugam2023-07-251-0/+1
| | | | | | | | | | Return -ENOMEM, when there is no sufficient dynamically allocated memory Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: expose num_hops and num_links xgmi info through dev attrShiwu Zhang2023-06-151-0/+46
| | | | | | | | | | | | | | | | | | | Add these two dev attrs for xgmi info details which is helpful for developers checking the xgmi topology by catting the sys file directly. Take 4 cards with xgmi connection as an example, get the num_hops for each device or node through xmig_hive_info dir like, cat /sys/bus/pci/devices/0000:41:00.0/xgmi_hive_info/node1/num_hops will return "00 41 41 41" where "00" stands for the hops to node1 itself and "41" is the hops in hex format to every other node in the same hive. There are node1/node2/node3/node4 representing 4 cards in the hive. The same for num_links dev attr. Signed-off-by: Shiwu Zhang <[email protected]> Acked-by: Le Ma <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add instance mask for RAS injectTao Zhou2023-06-091-2/+3
| | | | | | | | | | | | | | | User can specify injected instances by the mask. For backward compatibility, the mask value is incorporated into sub block index without interface change of RAS TA. User uses logical mask and driver should convert it to physical value before sending it to RAS TA. v2: update parameter name. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Correct xgmi_wafl block nameHawking Zhang2023-03-311-1/+1
| | | | | | | | | Fix backward compatibility issue to stay with the old name of xgmi_wafl node. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Rework xgmi_wafl_pcs ras sw_initHawking Zhang2023-03-151-5/+23
| | | | | | | | | To align with other IP blocks. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: make kobj_type structures constantThomas Weißschuh2023-02-231-1/+1
| | | | | | | | | | | | Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Reviewed-by: Christian König <[email protected]> Signed-off-by: Thomas Weißschuh <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: correct query xgmi3x16 pcs error statusStanley.Yang2023-01-171-3/+69
| | | | | | | | | | There is xgmi3x16 pcs error status for aldebaran, driver should check xgmi3x16 pcs error status field instead of gopx16 pcs error status field. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: support check xgmi/walf error mask bit for aldebaranStanley.Yang2023-01-171-38/+64
| | | | | | | | | | | | | | | | | | The pcs error count should be determined by PCS ERROR status and PCS ERROR MASK registers, only PCS ERROR status register can not refect error counts accurately. Changed from V1: remove clean noncorrectable mask registers optimize query pcs error status Changed from V2: remove check mask_value bits correct set value corresponding bit Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix potential double free and null pointer dereferenceLiang He2022-11-291-2/+0
| | | | | | | | | | | | In amdgpu_get_xgmi_hive(), we should not call kfree() after kobject_put() as the PUT will call kfree(). In amdgpu_device_ip_init(), we need to check the returned *hive* which can be NULL before we dereference it. Signed-off-by: Liang He <[email protected]> Reviewed-by: Luben Tuikov <[email protected]> Signed-off-by: Alex Deucher <[email protected]>