aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: Fix error handling in amdgpu_ras_add_bad_pagesSrinivasan Shanmugam2025-01-061-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | It ensures that appropriate error codes are returned when an error condition is detected Fixes the below; drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2849 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_umc_pages_in_a_row()' failed. drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2884 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_ras_mca2pa()' failed. v2: s/-EIO/-EINVAL, retained the use of -EINVAL from amdgpu_umc_pages_in_a_row & and amdgpu_ras_mca2pa_by_idx, when the RAS context is not initialized or the convert_ras_err_addr function is unavailable. (Thomas) V3: Returning 0 as the absence of eh_data is acceptable. (Tao) Fixes: a8d133e625ce ("drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes") Reported-by: Dan Carpenter <[email protected]> Cc: YiPeng Chai <[email protected]> Cc: Tao Zhou <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Enable psp v14_0_3 RAS support for non-SRIOV configurations.Candice Li2024-12-181-1/+1
| | | | | | | | Enable psp v14_0_3 RAS support for non-SRIOV configurations. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Support nbif v6_3_1 fatal error handlingCandice Li2024-12-101-0/+12
| | | | | | | | Add nbif v6_3_1 fatal error handling support. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add psp v14_0_3 ras supportCandice Li2024-12-101-0/+1
| | | | | | | | Add psp v14_0_3 ras support. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Enable RAS for psp v13_0_12Hawking Zhang2024-12-101-0/+5
| | | | | | | | | Enable RAS Cap check and initialize RAS funcs for psp v13_0_12 Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: correct the calculation of RAS bad pageTao Zhou2024-12-101-8/+2
| | | | | | | | | | After the introduction of NPS RAS, one bad page record on eeprom may be related to 1 or 16 bad pages, so the bad page record and bad page are two different concepts, define a new variable to store bad page number. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: split ras_eeprom_init into init and check functionsTao Zhou2024-12-101-4/+11
| | | | | | | | | | Init function is for ras table header read and check function is responsible for the validation of the header. Call them in different stages. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: remove is_mca_add for ras_add_bad_pagesTao Zhou2024-12-101-11/+5
| | | | | | | | Remove unnecessary variable and simplify the logic. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modesTao Zhou2024-12-101-15/+81
| | | | | | | | | | | | All legacy RAS bad pages are generated in NPS1 mode, but new bad page can be generated in any NPS mode, so we can't use retired_page stored on eeprom directly in non-nps1 mode even for legacy data. We need to take different actions for different data, new data can be identified from old data by UMC_CHANNEL_IDX_V2 flag. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: support to find RAS bad pages via old TATao Zhou2024-12-101-3/+25
| | | | | | | | | | | Old version of RAS TA doesn't support to convert MCA address stored on eeprom to physical address (PA), support to find all bad pages in one memory row by PA with old RAS TA. This approach is only suitable for nps1 mode. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: store only one RAS bad page record for all pages in one rowTao Zhou2024-12-101-8/+27
| | | | | | | | So eeprom space can be saved, compatible with legacy way. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Prefer RAS recovery for scheduler hangLijo Lazar2024-12-101-2/+53
| | | | | | | | | | | | | | | | | | | Before scheduling a recovery due to scheduler/job hang, check if a RAS error is detected. If so, choose RAS recovery to handle the situation. A scheduler/job hang could be the side effect of a RAS error. In such cases, it is required to go through the RAS error recovery process. A RAS error recovery process in certains cases also could avoid a full device device reset. An error state is maintained in RAS context to detect the block affected. Fatal Error state uses unused block id. Set the block id when error is detected. If the interrupt handler detected a poison error, it's not required to look for a fatal error. Skip fatal error checking in such cases. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: do RAS MCA2PA conversion in device init phaseTao Zhou2024-12-101-12/+82
| | | | | | | | | | NPS mode is introduced, the value of memory physical address (PA) related to a MCA address varies per nps mode. We need to rely on MCA address and convert it into PA accroding to nps mode. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add flag to indicate the type of RAS eeprom recordTao Zhou2024-12-101-7/+26
| | | | | | | | | | | | One UMC MCA address could map to multiply physical address (PA): AMDGPU_RAS_EEPROM_REC_PA: one record store one PA AMDGPU_RAS_EEPROM_REC_MCA: one record store one MCA address, PA is not cared about Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use reset recovery state checksLijo Lazar2024-11-201-5/+5
| | | | | | | | | | Some in_reset checks are infact checking whether the state is reinitialization after reset. Replace with reset_in_recovery calls to identify that it's really checking for recovery stage after reset. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Implement virt req_ras_err_countVictor Skvortsov2024-11-111-7/+65
| | | | | | | | | | | | | | | | | | | | | | | Enable RAS late init if VF RAS Telemetry is supported. When enabled, the VF can use this interface to query total RAS error counts from the host. The VF FB access may abruptly end due to a fatal error, therefore the VF must cache and sanitize the input. The Host allows 15 Telemetry messages every 60 seconds, afterwhich the host will ignore any more in-coming telemetry messages. The VF will rate limit its msg calling to once every 5 seconds (12 times in 60 seconds). While the VF is rate limited, it will continue to report the last good cached data. v2: Flip generate report & update statistics order for VF Signed-off-by: Victor Skvortsov <[email protected]> Acked-by: Tao Zhou <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: VF Query RAS Caps from Host if supportedVictor Skvortsov2024-11-111-0/+5
| | | | | | | | | If VF RAS Capability support is enabled, guest is able to retrieve the real RAS support from the host. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Skip IP coredump for RAS errorsLijo Lazar2024-11-041-0/+1
| | | | | | | | | For RAS errors, source of error is known. Skip the core dump of IP states. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Refactor XGMI reset on init handlingLijo Lazar2024-09-261-6/+0
| | | | | | | | | | | | Use XGMI hive information to rely on resetting XGMI devices on initialization rather than using mgpu structure. mgpu structure may have other devices as well. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add helper to initialize badpage infoLijo Lazar2024-09-261-18/+38
| | | | | | | | | | | | | | Add a separate function to read badpage data during initialization. Reading bad pages will need hardware access and cannot be done during reset. Hence in cases where device needs a full reset during init itself, attempting to read will cause a deadlock. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use init level for pending_reset flagLijo Lazar2024-09-261-1/+1
| | | | | | | | | | Drop pending_reset flag in gmc block. Instead use init level to determine which type of init is preferred - in this case MINIMAL. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* amd/amdgpu: Reduce unnecessary repetitive GPU resetsYiPeng Chai2024-09-261-1/+20
| | | | | | | | | | In multiple GPUs case, after a GPU has started resetting all GPUs on hive, other GPUs do not need to trigger GPU reset again. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix typo in the commentYan Zhen2024-09-181-1/+1
| | | | | | | | | | | | | | | | | | Correctly spelled comments make it easier for the reader to understand the code. Replace 'udpate' with 'update' in the comment & replace 'recieved' with 'received' in the comment & replace 'dsiable' with 'disable' in the comment & replace 'Initiailize' with 'Initialize' in the comment & replace 'disble' with 'disable' in the comment & replace 'Disbale' with 'Disable' in the comment & replace 'enogh' with 'enough' in the comment & replace 'availabe' with 'available' in the comment. Acked-by: Christian König <[email protected]> Signed-off-by: Yan Zhen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: disable GPU RAS bad page feature for specific ASICTao Zhou2024-09-171-0/+5
| | | | | | | | | | | The feature is not applicable to specific app platform. v2: update the disablement condition and commit description v3: move the setting to amdgpu_ras_check_supported Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: remove RAS unused paramter 'err_addr'Yang Wang2024-08-061-9/+9
| | | | | | | | | | | | | - amdgpu_ras_error_statistic_ue_count() - amdgpu_ras_error_statistic_ce_count() - amdgpu_ras_error_statistic_de_count() The parameter 'err_addr' is no longer used since following patch. Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: create function to check RAS RMA statusTao Zhou2024-08-061-6/+16
| | | | | | | | In the convenience of calling it globally. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add more types for boot time error reportingHawking Zhang2024-08-061-0/+10
| | | | | | | | Data abort exception and unknown errors are supported. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Remove unused codeYiPeng Chai2024-07-231-23/+0
| | | | | | | | Remove unused code. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completedYiPeng Chai2024-07-101-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | The problem case is as follows: 1. GPU A triggers a gpu ras reset, and GPU A drives GPU B to also perform a gpu ras reset. 2. After gpu B ras reset started, gpu B queried a DE data. Since the DE data was queried in the ras reset thread instead of the page retirement thread, bad page retirement work would not be triggered. Then even if all gpu resets are completed, the bad pages will be cached in RAM until GPU B's bad page retirement work is triggered again and then saved to eeprom. This patch can save the bad pages to eeprom in time after gpu ras reset is completed. v2: 1. Add the above description to code comments. 2. Reuse existing function. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: flush all cached ras bad pages to eepromYiPeng Chai2024-07-101-6/+29
| | | | | | | | | | | | Before uninstalling gpu driver, flush all cached ras bad pages to eeprom. v2: Put the same code into a function and reuse the function. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add ras event state device attribute supportYang Wang2024-07-081-4/+52
| | | | | | | | | add amdgpu ras 'event_state' sysfs device attribute support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add ras POSION_CONSUMPTION event id supportYang Wang2024-07-081-3/+13
| | | | | | | | | add amdgpu ras POSION_CONSUMPTION event id support. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add ras POSION_CREATION event id supportYang Wang2024-07-081-3/+14
| | | | | | | | | add amdgpu ras POSION_CREATION event id support. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refine amdgpu ras event id core codeYang Wang2024-07-081-18/+84
| | | | | | | | | | | | | | | | | | | | | | | v1: - use unified event id to manage ras events - add a new function amdgpu_ras_query_error_status_with_event() to accept event type as parameter. v2: add a warn log to show the location of function failure when calling amdgpu_ras_mark_event(). (Tao Zhou) v3: change RAS_EVENT_TYPE_ISR to RAS_EVENT_TYPE_FATAL. v4: rename amdgpu_ras_get_recovery_event() to amdgpu_ras_get_fatal_error_event(). Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: sysfs node disable query error count during gpu resetYiPeng Chai2024-07-081-0/+3
| | | | | | | | Sysfs node disable query error count during gpu reset. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix hbm stack id in boot error reportHawking Zhang2024-07-011-1/+1
| | | | | | | | | To align with firmware, hbm id field 0x1 refers to hbm stack 0, 0x2 refers to hbm statck 1. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add gpu reset check and exception handlingYiPeng Chai2024-06-271-0/+53
| | | | | | | | | | | | | Add gpu reset check and exception handling for page retirement. v2: Clear poison consumption messages cached in fifo after non mode-1 reset. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refine poison consumption interrupt handlerYiPeng Chai2024-06-271-18/+37
| | | | | | | | | | | 1. The poison fifo is only used for poison consumption requests. 2. Merge reset requests when poison fifo caches multiple poison consumption messages Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refine poison creation interrupt handlerYiPeng Chai2024-06-271-22/+17
| | | | | | | | | | | | | | In order to apply to the case where a large number of ras poison interrupts: 1. Change to use variable to record poison creation requests to avoid fifo full. 2. Prioritize handling poison creation requests instead of following the order of requests received by the driver. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add variable to record the deferred error number read by driverYiPeng Chai2024-06-271-18/+44
| | | | | | | | | Add variable to record the deferred error number read by driver. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: set RAS fed status for more casesTao Zhou2024-06-141-0/+1
| | | | | | | | Indicate fatal error for each RAS block and NBIO. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: create amdgpu_ras_in_recovery to simplify codeTao Zhou2024-06-141-12/+19
| | | | | | | | | Reduce redundant code and user doesn't need to pay attention to RAS details. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: trigger mode1 reset for RAS RMA statusTao Zhou2024-06-141-6/+22
| | | | | | | | | | Check RMA status in bad page retirement flow. v2: fix coding bugs in v1. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: move aca/mca init functions into ras_init() stageYang Wang2024-06-141-23/+50
| | | | | | | | adjust the function position to better match aca/mca fini code in ras_fini(). Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add reset source in various casesEric Huang2024-06-141-0/+1
| | | | | | | | | To fullfill the reset event description. Suggested-by: Lijo Lazar <[email protected]> Signed-off-by: Eric Huang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add RAS is_rma flagTao Zhou2024-06-051-5/+4
| | | | | | | | Set the flag to true if bad page number reaches threshold. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Update programming for boot error reportingHawking Zhang2024-06-051-54/+45
| | | | | | | | | | AMDGPU_RAS_GPU_ERR_BOOT_STATUS field is no longer valid. The polling sequence is also simplifed according to the latest firmware change. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Estimate RAS reservation when report capacity v2Hawking Zhang2024-06-051-0/+20
| | | | | | | | | | | Add estimate of how much vram we need to reserve for RAS when caculating the total available vram. v2: apply the change to MP0 v13_0_2 and v13_0_14 Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix typo in amdgpu_ras_aca_sysfs_read() functionYang Wang2024-05-291-1/+1
| | | | | | | | | fix typo "info.ue_count" in amdgpu_ras_aca_sysfs_read() function. Fixes: 865d3397630b ("drm/amdgpu: add aca deferred error type support") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: skip to create ras xxx_err_count node when ACA is enabledYang Wang2024-05-231-0/+6
| | | | | | | | skip to create 'xxx_err_count' node when ACA is enabled. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>