aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_countTao Zhou2023-11-031-1/+10
| | | | | | | | | Handle xgmi hive case. Suggested-by: Hawking Zhang <[email protected]> Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: check RAS supported first in ras_reset_error_countTao Zhou2023-10-311-4/+4
| | | | | | | | | Not all platforms support RAS. Fixes: 73582be11ac8 ("drm/amdgpu: bypass RAS error reset in some conditions") Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: bypass RAS error reset in some conditionsTao Zhou2023-10-261-1/+9
| | | | | | | | | | | | | | PMFW is responsible for RAS error reset in some conditions, driver can skip the operation. v2: add check for ras->in_recovery, it's set earlier than amdgpu_in_reset. v3: fix error in gpu reset check. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: enable RAS poison mode for APUTao Zhou2023-10-261-1/+2
| | | | | | | | Enable it by default on APU platform. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refine ras error kernel log printYang Wang2023-10-261-35/+81
| | | | | | | | refine ras error kernel log to avoid user-ridden ambiguity. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix find ras error node errorYang Wang2023-10-261-4/+3
| | | | | | | | | the origin function might return the wrong node. Fixes: 5b1270beb380 ("drm/amdgpu: add ras_err_info to identify RAS error source") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add set/get mca debug mode operationsTao Zhou2023-10-201-0/+21
| | | | | | | | Record the debug mode status in RAS. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Enable RAS feature by default for APUStanley.Yang2023-10-201-12/+2
| | | | | | | | | Enable RAS feature by default for aqua vanjaram on apu platform. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix typo for amdgpu ras error data printYang Wang2023-10-201-2/+2
| | | | | | | | | typo fix. Fixes: 5b1270beb380 ("drm/amdgpu: add ras_err_info to identify RAS error source") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Candice Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix delete nodes that have been relesedStanley.Yang2023-10-201-3/+1
| | | | | | | | | Fix delete nodes that it has been freed. Fixes: 5b1270beb380 ("drm/amdgpu: add ras_err_info to identify RAS error source") Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Enable software RAS in vcn v4_0_3Hawking Zhang2023-10-201-1/+3
| | | | | | | | | Set VCN/JPEG RAS masks to enable software RAS for VCN and JPEG. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: define ras_reset_error_count functionTao Zhou2023-10-201-4/+15
| | | | | | | | | | Make the code architecture more simple. v2: reuse ras_reset_error_count in ras_reset_error_status. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu : Add hive ras recovery checkAsad Kamal2023-10-191-2/+7
| | | | | | | | | | | | | If one of the devices in the hive detects a fatal error, need to send ras recovery reset message to PMFW of all devices in the hive. For that add a flag in hive to indicate that it's undergoing ras recovery Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add ras_err_info to identify RAS error sourceYang Wang2023-10-131-45/+249
| | | | | | | | | | | | | | | | | | | | | | introduced "ras_err_info" to better identify a RAS ERROR source. NOTE: For legacy chips, keep the original RAS error print format. v1: RAS errors may come from different dies during a RAS error query, therefore, need a new data structure to identify the source of RAS ERROR. v2: - use new data structure 'amdgpu_smuio_mcm_config_info' instead of ras_err_id (in v1 patch) - refine ras error dump function name - refine ras error dump log format Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Expose ras version & schema infoAsad Kamal2023-10-131-3/+48
| | | | | | | | | | | Expose ras table version & schema info to sysfs v2: Updated schema to get poison support info from ras context, removed asic specific checks Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix false positive error logStanley.Yang2023-09-201-5/+5
| | | | | | | | | | | | | It should first check block ras obj whether be set, it should return 0 directly if block ras obj or hw_ops is not set. If block doesn't support RAS just return 0 is fine. Changed from V1: return 0 directly if block ras obj or hw ops is not set Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix a memory leak in amdgpu_ras_feature_enableCong Liu2023-09-201-0/+1
| | | | | | | | | | | | This patch fixes a memory leak in the amdgpu_ras_feature_enable() function. The leak occurs when the function sends a command to the firmware to enable or disable a RAS feature for a GFX block. If the command fails, the kfree() function is not called to free the info memory. Fixes: 9f051d6ff13f ("drm/amdgpu: Free ras cmd input buffer properly") Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Cong Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add amdgpu mca debug sysfs supportYang Wang2023-09-201-0/+2
| | | | | | | | add amdgpu mca debug sysfs support. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use function for IP version checkLijo Lazar2023-09-201-16/+22
| | | | | | | | | Use an inline function for version check. Gives more flexibility to handle any format changes. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fallback to old RAS error message for aqua_vanjaramHawking Zhang2023-09-111-2/+4
| | | | | | | | | So driver doesn't generate incorrect message until the new format is settled down for aqua_vanjaram Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Free ras cmd input buffer properlyHawking Zhang2023-08-301-7/+7
| | | | | | | | | Do not access the pointer for ras input cmd buffer if it is even not allocated. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Allow issue disable gfx ras cmd to firmwareHawking Zhang2023-08-301-3/+4
| | | | | | | | Disable gfx ras command is needed in some use cases Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Enable ras for mp0 v13_0_6 sriovYiPeng Chai2023-08-301-0/+1
| | | | | | | | Enable ras for mp0 v13_0_6 sriov Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: mode1 reset needs to recover mp1 for mp0 v13_0_10YiPeng Chai2023-08-151-0/+2
| | | | | | | | | | | | Mode1 reset needs to recover mp1 in fatal error case for mp0 v13_0_10. v2: Define a macro to wrap psp function calls. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Remove unnecessary ras cap checkHawking Zhang2023-08-151-4/+0
| | | | | | | | | | | | | RAS global isr will only be invoked by hardware interrupt. Don't need to query ras capability in isr In addition, amdgpu_ras_interrupt_fatal_error_handler ensures the isr won't be called from guest linux side by accident. The RAS cap check in isr that introduced to fix sriov crash is not needed any more Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Extend poison mode check to SDMA/VCN/JPEGCandice Li2023-08-091-1/+4
| | | | | | | | Treat SDMA/VCN/JPEG as RAS capable IP blocks in poison mode. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add RAS fatal error handler for NBIO v7.9Tao Zhou2023-08-091-0/+5
| | | | | | | | | | | | Register RAS fatal error interrupt and add handler. v2: only register NBIO RAS for dGPU platform. change nbio_v7_9_set_ras_controller_irq_state and nbio_v7_9_set_ras_err_event_athub_irq_state to dummy functions. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Issue ras enable_feature for gfx ip onlyHawking Zhang2023-08-071-20/+10
| | | | | | | | | | | For non-GFX IP blocks, set up ras obj if ras feature is allowed. For GFX IP blocks, force issue ras enable_feature command to firmware and only set up ras obj if ras feature is allowed Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Apply poison mode check to GFX IP onlyHawking Zhang2023-08-071-0/+1
| | | | | | | | | | | | | For GFX IP that only supports poison consumption, GFX RAS won't be marked as enabled. i.e., hardware doesn't support gfx sram ecc. But driver still needs to issue firmware to enable poison consumption mode for GFX IP. In such case, check poison mode and treat GFX IP as RAS capable IP block. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Only create err_count sysfs when hw_op is supportedHawking Zhang2023-08-071-13/+18
| | | | | | | | | | | Some IP blocks only support partial ras feature and don't have ras counter and/or ras error status register at all. Driver should not create err_count sysfs node for those IP blocks. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Check APU flag to disable RASStanley.Yang2023-07-251-1/+2
| | | | | | | | Only disable RAS by default for aqua vanjaram on APU platform. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd: Avoid reading the VBIOS part number twiceMario Limonciello2023-07-211-4/+4
| | | | | | | | | | | | | The VBIOS part number is read both in amdgpu_atom_parse() as well as in atom_get_vbios_pn() and stored twice in the `struct atom_context` structure. Remove the first unnecessary read and move the `pr_info` line from that read into the second. v2: squash in unused variable removal Signed-off-by: Mario Limonciello <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Disable RAS by default on APU flatformStanley.Yang2023-07-181-2/+11
| | | | | | | | | | | | | | | | Disable RAS feature by default for aqua vanjaram on APU platform. Changed from V1: Splite Disable RAS by default on APU platform into a separated patch. Changed from V2: Avoid to modify global variable amdgpu_ras_enable. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Enable aqua vanjaram RASStanley.Yang2023-07-181-0/+1
| | | | | | | | | | | | | | | | Enable RAS for aqua vanjaram. Changed from V1: Split the change in amdgpu_ras_asic_supported into a separated patch. Changed from V2: Avoid to modify global variable amdgpu_ras_enable. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: skip address adjustment for GFX RAS injectionTao Zhou2023-07-071-1/+2
| | | | | | | | | The address parameter of GFX RAS injection isn't related to XGMI node number, keep it unchanged. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Candice Li <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: gpu recovers from fatal error in poison modeYiPeng Chai2023-06-301-0/+11
| | | | | | | | | Fatal error occurs in ras poison mode, mode1 reset is used to recover gpu. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add checking mc_vram_sizeStanley.Yang2023-06-151-1/+2
| | | | | | | | | Do not compare injection address with mc_vram_size if mc_vram_size is zero. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Optimize checking ras supportedStanley.Yang2023-06-151-5/+3
| | | | | | | | | Using "is_app_apu" to identify device in the native APU mode or carveout mode. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix usage of UMC fill record in RASLuben Tuikov2023-06-151-2/+1
| | | | | | | | | | | | | | | | | | | | | | The fixed commit listed in the Fixes tag below, introduced a bug in amdgpu_ras.c::amdgpu_reserve_page_direct(), in that when introducing the new amdgpu_umc_fill_error_record() and internally in that new function the physical address (argument "uint64_t retired_page"--wrong name) is right-shifted by AMDGPU_GPU_PAGE_SHIFT. Thus, in amdgpu_reserve_page_direct() when we pass "address" to that new function, we should NOT right-shift it, since this results, erroneously, in the page address to be 0 for first 2^(2*AMDGPU_GPU_PAGE_SHIFT) memory addresses. This commit fixes this bug. Cc: Tao Zhou <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Alex Deucher <[email protected]> Fixes: 400013b268cb ("drm/amdgpu: add umc_fill_error_record to make code more simple") Signed-off-by: Luben Tuikov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Report ras_num_recs in debugfsLuben Tuikov2023-06-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Report the number of records stored in the RAS EEPROM table in debugfs. This can be used by user-space to calculate the capacity of the RAS EEPROM table since "bad_page_cnt_threshold" is also reported in the same place in debugfs. See commit 7fb640714547 ("drm/amdgpu: Add bad_page_cnt_threshold to debugfs"). ras_num_recs can already be inferred by dumping the RAS EEPROM table, also in the same debugfs location, see commit reference c65b0805e77919 (drm/amdgpu: RAS EEPROM table is now in debugfs, 2021-04-08). This commit makes it an integer value easily shown in a single file. Cc: Alex Deucher <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Tao Zhou <[email protected]> Cc: Stanley Yang <[email protected]> Cc: John Clements <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add support EEPROM table v2.1Stanley.Yang2023-06-091-1/+1
| | | | | | | | | Add ras info to EEPROM table, app can analyse device ECC status without GPU driver through EEPROM table ras info. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: support check vcn jpeg block maskStanley.Yang2023-06-091-1/+5
| | | | | | | | | | | | | Support VCN/JPEG instance mask checking, pass logical mask directly except GFX/SDMA/VCN/JPEG blocks. Changed from V1: correct a typo Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: perform mode2 reset for sdma fed error on gfx v11_0_3YiPeng Chai2023-06-091-1/+7
| | | | | | | | perform mode2 reset for sdma fed error on gfx v11_0_3. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add check for RAS instance maskTao Zhou2023-06-091-0/+38
| | | | | | | | | | | | | | | | | The mask is only needed to be set when RAS block instance number is more than 1 and invalid bits should be also masked out. We only check valid bits for GFX and SDMA block for now, and will add check for other RAS blocks in the future. v2: move the check under injection operation since the mask is only used by RAS error inject. v3: add valid bits handling for SDMA. v4: print message if the mask is adjusted. Signed-off-by: Tao Zhou <[email protected]> Hawking Zhang <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: reorganize RAS injection flowTao Zhou2023-06-091-7/+6
| | | | | | | | | | So GFX RAS injection could use default function if it doesn't define its own injection interface. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add instance mask for RAS injectTao Zhou2023-06-091-7/+16
| | | | | | | | | | | | | | | User can specify injected instances by the mask. For backward compatibility, the mask value is incorporated into sub block index without interface change of RAS TA. User uses logical mask and driver should convert it to physical value before sending it to RAS TA. v2: update parameter name. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Adjust the sequence to query ras error infoHawking Zhang2023-06-091-6/+7
| | | | | | | | | | | | It turns out STATUS_VALID_FLAG needs to be checked ahead of any other fields. ADDRESS_VALID_FLAG and ERR_INFO_VALID_FLAG only manages ADDRESS and ERR_INFO field respectively. driver should continue poll ERR CNT field even ERR_INFO_VALD_FLAG is not set. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Enable persistent edc harvesting in APP APUHawking Zhang2023-06-091-1/+2
| | | | | | | | Persistent edc harvesting is supported in APP APU Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add common helper to reset ras errorHawking Zhang2023-06-091-0/+20
| | | | | | | | | | | | | | Add common helper to reset ras error status. It applies to IP blocks that follow the new ras error logging register design, and need to write 0 to reset the error status. For IP blocks that don't support the new design, please still implement ip specific helper. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add common helper to query ras error (v2)Hawking Zhang2023-06-091-0/+119
| | | | | | | | | | | | | | | | | | Add common helper to query ras error status and log error information, including memory block id and erorr count. The helpers are applicable to IP blocks that follow the new ras error logging design. For IP blocks that don't support the new design, please still implement ip specific helper to query ras error. v2: optimize struct amdgpu_ras_err_status_reg_entry and the implementaion in helper (Lijo/Tao) Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>