aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: add reset source in various casesEric Huang2024-06-141-0/+1
| | | | | | | | | To fullfill the reset event description. Suggested-by: Lijo Lazar <[email protected]> Signed-off-by: Eric Huang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add RAS is_rma flagTao Zhou2024-06-051-5/+4
| | | | | | | | Set the flag to true if bad page number reaches threshold. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Update programming for boot error reportingHawking Zhang2024-06-051-54/+45
| | | | | | | | | | AMDGPU_RAS_GPU_ERR_BOOT_STATUS field is no longer valid. The polling sequence is also simplifed according to the latest firmware change. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Estimate RAS reservation when report capacity v2Hawking Zhang2024-06-051-0/+20
| | | | | | | | | | | Add estimate of how much vram we need to reserve for RAS when caculating the total available vram. v2: apply the change to MP0 v13_0_2 and v13_0_14 Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix typo in amdgpu_ras_aca_sysfs_read() functionYang Wang2024-05-291-1/+1
| | | | | | | | | fix typo "info.ue_count" in amdgpu_ras_aca_sysfs_read() function. Fixes: 865d3397630b ("drm/amdgpu: add aca deferred error type support") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: skip to create ras xxx_err_count node when ACA is enabledYang Wang2024-05-231-0/+6
| | | | | | | | skip to create 'xxx_err_count' node when ACA is enabled. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix ACA no query result after gpu resetYang Wang2024-05-171-5/+4
| | | | | | | | fix ACA no query result after gpu reset. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix compiler 'side-effect' check issue for RAS_EVENT_LOG()Yang Wang2024-05-171-0/+18
| | | | | | | | | create a new helper function to avoid compiler 'side-effect' check about RAS_EVENT_LOG() macro. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix the null pointer dereference to ras_managerMa Jun2024-05-171-2/+5
| | | | | | | | Check ras_manager before using it Signed-off-by: Ma Jun <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Remove dead code in amdgpu_ras_add_mca_err_addrMa Jun2024-05-171-13/+0
| | | | | | | | Remove dead code in amdgpu_ras_add_mca_err_addr Signed-off-by: Ma Jun <[email protected]> Reviewed-by: YiPeng Chai <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: change log levelYiPeng Chai2024-05-081-1/+1
| | | | | | | | Change log level. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix RAS unload driver issue in SRIOVYang Wang2024-05-081-6/+8
| | | | | | | | | | | Fix null pointer issue when unload driver in SRIOV mode. Adjust the function position to ensure that the amdgpu_mca/aca_xxx_init() related functions can be initialized properly. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add psp v13_0_14 ip blockHawking Zhang2024-05-021-0/+2
| | | | | | | | Add psp v13_0_14 ip block support. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Remove redundant function callYiPeng Chai2024-04-301-16/+6
| | | | | | | | Remove redundant function call. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add MCA smu cache supportYang Wang2024-04-301-0/+9
| | | | | | | | | | | | | | | | | v1: because SMU CE valid mca bank will be cleared after reading, this patch adds mca cache at the driver level to ensure that the mca bank is not lost. v2: refine amdgpu_mca_init/fini/reset() function name. v3: add mca_cache.lock support only add CE bank to mca bank cache. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix ras mode2 reset failure in ras aca modeYiPeng Chai2024-04-261-0/+4
| | | | | | | | Fix ras mode2 reset failure in ras aca mode. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use new interface to reserve bad pageYiPeng Chai2024-04-261-3/+1
| | | | | | | | Use new interface to reserve bad page. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix address translation defectYiPeng Chai2024-04-261-1/+1
| | | | | | | | | retired_page is page frame and should be expanded to the full address when querying status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add poison consumption handlerYiPeng Chai2024-04-261-4/+39
| | | | | | | | Add poison consumption handler. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add delay work to retire bad pagesYiPeng Chai2024-04-261-1/+35
| | | | | | | | Add delay work to retire bad pages. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add interface to update umc v12_0 ecc statusYiPeng Chai2024-04-261-0/+2
| | | | | | | | Add interface to update umc v12_0 ecc status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add poison creation handlerYiPeng Chai2024-04-261-7/+69
| | | | | | | | Add poison creation handler. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: prepare for logging ecc errorsYiPeng Chai2024-04-261-0/+32
| | | | | | | | Prepare for logging ecc errors. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add message fifo to handle RAS poison eventsYiPeng Chai2024-04-261-0/+35
| | | | | | | | Add message fifo to handle RAS poison events. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add interface to reserve bad pageYiPeng Chai2024-04-261-0/+19
| | | | | | | | | Add interface to reserve bad page. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Set fatal errror detected flag earlierLijo Lazar2024-04-101-13/+28
| | | | | | | | | | In case of fatal errors, set FED status when interrupt is received. Set the flag on other devices in the hive before RAS recovery work. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add ras event id support for ACAYang Wang2024-03-221-5/+6
| | | | | | | | | add ras event id support for ACA. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add aca deferred error type supportYang Wang2024-03-201-2/+6
| | | | | | | | add aca deferred error type support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: make reset method configurable for RAS poisonTao Zhou2024-03-201-2/+2
| | | | | | | | | | | | Each RAS block has different requirement for gpu reset in poison consumption handling. Add support for mmhub RAS poison consumption handling. v2: remove the mmhub poison support for kfd int v10. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add ras event id supportYang Wang2024-03-201-71/+136
| | | | | | | | | | | | | | | | add amdgpu ras event id support to better distinguish different error information sources in dmesg logs. the following log will be identify by event id: {event_id} interrupt to inform RAS event {event_id} ACA logs {event_id} errors statistic since from current injection/error query {event_id} errors statistic since from gpu load Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix ineffective ras_mask settingsStanley.Yang2024-02-261-0/+1
| | | | | | | | | | Check amdgpu_ras_mask to fix ineffective ras_mask setting due to special asic without sram ecc enable but with poison supported. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add fatal error detected flagLijo Lazar2024-02-261-0/+32
| | | | | | | | | For a RAS error that needs a full reset to recover, set the fatal error status. Clear the status once the device is reset. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: disable RAS feature when finiTao Zhou2024-01-311-1/+1
| | | | | | | | Send RAS disable feature command in fini. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Update boot time errors polling sequenceHawking Zhang2024-01-311-1/+13
| | | | | | | | | Update boot time errors polling sequence to align with the latest firmware change. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Frank Min <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Support passing poison consumption ras block to SRIOVYiPeng Chai2024-01-251-1/+1
| | | | | | | | | Support passing poison consumption ras blocks to SRIOV. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: adjust aca init/fini sequence to match gpu resetYang Wang2024-01-251-2/+13
| | | | | | | | | | - move aca init/fini function into ras init/fini to adapt gpu reset sequence. - add new function amdgpu_aca_reset() Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix module unload hang with RAS enabledMukul Joshi2024-01-251-0/+4
| | | | | | | | | | | | | The driver unload hangs because the page retirement kthread cannot be stopped as it is sleeping and waiting on page retirement event to occur. Add kthread_should_stop() to the event condition to wake up the kthread when kthread stop is called during driver unload. Fixes: 3fdcd0a31d7a ("drm/amdgpu: Prepare for asynchronous processing of umc page retirement") Signed-off-by: Mukul Joshi <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: skip call ras_late_init if ras block is not supportedYang Wang2024-01-221-2/+5
| | | | | | | | skip call ras_late_init callback if ras block is not supported. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu:Support retiring multiple MCA error address pagesYiPeng Chai2024-01-221-8/+35
| | | | | | | | | Support retiring multiple MCA error address pages in one in-band query for umc v12_0. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use asynchronous polling to handle umc_v12_0 poisoningYiPeng Chai2024-01-221-0/+5
| | | | | | | | | | | | Use asynchronous polling to handle umc_v12_0 poisoning. v2: 1. Change function name. 2. Change the debugging information content. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix ras features value calltraceStanley.Yang2024-01-221-5/+6
| | | | | | | | | | The high three bits of ras features mask indicate socket id, it should skip to check high three bits of ras features mask before disable all ras features. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Prepare for asynchronous processing of umc page retirementYiPeng Chai2024-01-221-0/+34
| | | | | | | | Preparing for asynchronous processing of umc page retirement. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Show deferred error count for UMCStanley.Yang2024-01-181-2/+6
| | | | | | | | | Show deferred error count for UMC syfs node Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix UBSAN array-index-out-of-bounds for ras_block_string[]Yang Wang2024-01-181-1/+4
| | | | | | | | | fix array index out of bounds issue for ras_block_string[] array. Fixes: 30df05fb74f6 ("drm/amdgpu: Align ras block enum with firmware") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Centralize ras cap query to amdgpu_ras_check_supportedHawking Zhang2024-01-151-77/+93
| | | | | | | | | | Move ras capablity check to amdgpu_ras_check_supported. Driver will query ras capablity through psp interace, or vbios interface, or specific ip callbacks. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Log deferred error separatelyCandice Li2024-01-151-20/+96
| | | | | | | | | Separate deferred error from UE and CE and log it individually. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add aca sysfs supportYang Wang2024-01-151-0/+15
| | | | | | | | add aca sysfs node support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add amdgpu ras aca query interfaceYang Wang2024-01-151-15/+90
| | | | | | | | | | | | v1: add ACA error query interface v2: Add a new helper function to determine whether to use ACA or MCA. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add ACA bank dump debugfs supportYang Wang2024-01-151-0/+14
| | | | | | | | add ACA bank dump debugfs support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add ras helper to query boot errors v2Hawking Zhang2024-01-151-0/+95
| | | | | | | | | | | Add ras helper function to query boot time gpu errors. v2: use aqua_vanjaram smn addressing pattern Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>