| Commit message (Collapse) | Author | Age | Files | Lines | |
|---|---|---|---|---|---|
| * | drm/amdgpu: add reset source in various cases | Eric Huang | 2024-06-14 | 1 | -0/+1 |
| | | | | | | | | | | To fullfill the reset event description. Suggested-by: Lijo Lazar <[email protected]> Signed-off-by: Eric Huang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add RAS is_rma flag | Tao Zhou | 2024-06-05 | 1 | -5/+4 |
| | | | | | | | | | Set the flag to true if bad page number reaches threshold. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Update programming for boot error reporting | Hawking Zhang | 2024-06-05 | 1 | -54/+45 |
| | | | | | | | | | | | AMDGPU_RAS_GPU_ERR_BOOT_STATUS field is no longer valid. The polling sequence is also simplifed according to the latest firmware change. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Estimate RAS reservation when report capacity v2 | Hawking Zhang | 2024-06-05 | 1 | -0/+20 |
| | | | | | | | | | | | | Add estimate of how much vram we need to reserve for RAS when caculating the total available vram. v2: apply the change to MP0 v13_0_2 and v13_0_14 Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: fix typo in amdgpu_ras_aca_sysfs_read() function | Yang Wang | 2024-05-29 | 1 | -1/+1 |
| | | | | | | | | | | fix typo "info.ue_count" in amdgpu_ras_aca_sysfs_read() function. Fixes: 865d3397630b ("drm/amdgpu: add aca deferred error type support") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: skip to create ras xxx_err_count node when ACA is enabled | Yang Wang | 2024-05-23 | 1 | -0/+6 |
| | | | | | | | | | skip to create 'xxx_err_count' node when ACA is enabled. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: fix ACA no query result after gpu reset | Yang Wang | 2024-05-17 | 1 | -5/+4 |
| | | | | | | | | | fix ACA no query result after gpu reset. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: fix compiler 'side-effect' check issue for RAS_EVENT_LOG() | Yang Wang | 2024-05-17 | 1 | -0/+18 |
| | | | | | | | | | | create a new helper function to avoid compiler 'side-effect' check about RAS_EVENT_LOG() macro. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Fix the null pointer dereference to ras_manager | Ma Jun | 2024-05-17 | 1 | -2/+5 |
| | | | | | | | | | Check ras_manager before using it Signed-off-by: Ma Jun <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Remove dead code in amdgpu_ras_add_mca_err_addr | Ma Jun | 2024-05-17 | 1 | -13/+0 |
| | | | | | | | | | Remove dead code in amdgpu_ras_add_mca_err_addr Signed-off-by: Ma Jun <[email protected]> Reviewed-by: YiPeng Chai <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: change log level | YiPeng Chai | 2024-05-08 | 1 | -1/+1 |
| | | | | | | | | | Change log level. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: fix RAS unload driver issue in SRIOV | Yang Wang | 2024-05-08 | 1 | -6/+8 |
| | | | | | | | | | | | | Fix null pointer issue when unload driver in SRIOV mode. Adjust the function position to ensure that the amdgpu_mca/aca_xxx_init() related functions can be initialized properly. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Add psp v13_0_14 ip block | Hawking Zhang | 2024-05-02 | 1 | -0/+2 |
| | | | | | | | | | Add psp v13_0_14 ip block support. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Remove redundant function call | YiPeng Chai | 2024-04-30 | 1 | -16/+6 |
| | | | | | | | | | Remove redundant function call. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add MCA smu cache support | Yang Wang | 2024-04-30 | 1 | -0/+9 |
| | | | | | | | | | | | | | | | | | | v1: because SMU CE valid mca bank will be cleared after reading, this patch adds mca cache at the driver level to ensure that the mca bank is not lost. v2: refine amdgpu_mca_init/fini/reset() function name. v3: add mca_cache.lock support only add CE bank to mca bank cache. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Fix ras mode2 reset failure in ras aca mode | YiPeng Chai | 2024-04-26 | 1 | -0/+4 |
| | | | | | | | | | Fix ras mode2 reset failure in ras aca mode. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Use new interface to reserve bad page | YiPeng Chai | 2024-04-26 | 1 | -3/+1 |
| | | | | | | | | | Use new interface to reserve bad page. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Fix address translation defect | YiPeng Chai | 2024-04-26 | 1 | -1/+1 |
| | | | | | | | | | | retired_page is page frame and should be expanded to the full address when querying status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add poison consumption handler | YiPeng Chai | 2024-04-26 | 1 | -4/+39 |
| | | | | | | | | | Add poison consumption handler. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Add delay work to retire bad pages | YiPeng Chai | 2024-04-26 | 1 | -1/+35 |
| | | | | | | | | | Add delay work to retire bad pages. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add interface to update umc v12_0 ecc status | YiPeng Chai | 2024-04-26 | 1 | -0/+2 |
| | | | | | | | | | Add interface to update umc v12_0 ecc status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add poison creation handler | YiPeng Chai | 2024-04-26 | 1 | -7/+69 |
| | | | | | | | | | Add poison creation handler. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: prepare for logging ecc errors | YiPeng Chai | 2024-04-26 | 1 | -0/+32 |
| | | | | | | | | | Prepare for logging ecc errors. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add message fifo to handle RAS poison events | YiPeng Chai | 2024-04-26 | 1 | -0/+35 |
| | | | | | | | | | Add message fifo to handle RAS poison events. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Add interface to reserve bad page | YiPeng Chai | 2024-04-26 | 1 | -0/+19 |
| | | | | | | | | | | Add interface to reserve bad page. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Set fatal errror detected flag earlier | Lijo Lazar | 2024-04-10 | 1 | -13/+28 |
| | | | | | | | | | | | In case of fatal errors, set FED status when interrupt is received. Set the flag on other devices in the hive before RAS recovery work. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add ras event id support for ACA | Yang Wang | 2024-03-22 | 1 | -5/+6 |
| | | | | | | | | | | add ras event id support for ACA. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add aca deferred error type support | Yang Wang | 2024-03-20 | 1 | -2/+6 |
| | | | | | | | | | add aca deferred error type support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: make reset method configurable for RAS poison | Tao Zhou | 2024-03-20 | 1 | -2/+2 |
| | | | | | | | | | | | | | Each RAS block has different requirement for gpu reset in poison consumption handling. Add support for mmhub RAS poison consumption handling. v2: remove the mmhub poison support for kfd int v10. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add ras event id support | Yang Wang | 2024-03-20 | 1 | -71/+136 |
| | | | | | | | | | | | | | | | | | add amdgpu ras event id support to better distinguish different error information sources in dmesg logs. the following log will be identify by event id: {event_id} interrupt to inform RAS event {event_id} ACA logs {event_id} errors statistic since from current injection/error query {event_id} errors statistic since from gpu load Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Fix ineffective ras_mask settings | Stanley.Yang | 2024-02-26 | 1 | -0/+1 |
| | | | | | | | | | | | Check amdgpu_ras_mask to fix ineffective ras_mask setting due to special asic without sram ecc enable but with poison supported. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Add fatal error detected flag | Lijo Lazar | 2024-02-26 | 1 | -0/+32 |
| | | | | | | | | | | For a RAS error that needs a full reset to recover, set the fatal error status. Clear the status once the device is reset. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: disable RAS feature when fini | Tao Zhou | 2024-01-31 | 1 | -1/+1 |
| | | | | | | | | | Send RAS disable feature command in fini. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Update boot time errors polling sequence | Hawking Zhang | 2024-01-31 | 1 | -1/+13 |
| | | | | | | | | | | Update boot time errors polling sequence to align with the latest firmware change. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Frank Min <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Support passing poison consumption ras block to SRIOV | YiPeng Chai | 2024-01-25 | 1 | -1/+1 |
| | | | | | | | | | | Support passing poison consumption ras blocks to SRIOV. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: adjust aca init/fini sequence to match gpu reset | Yang Wang | 2024-01-25 | 1 | -2/+13 |
| | | | | | | | | | | | - move aca init/fini function into ras init/fini to adapt gpu reset sequence. - add new function amdgpu_aca_reset() Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Fix module unload hang with RAS enabled | Mukul Joshi | 2024-01-25 | 1 | -0/+4 |
| | | | | | | | | | | | | | | The driver unload hangs because the page retirement kthread cannot be stopped as it is sleeping and waiting on page retirement event to occur. Add kthread_should_stop() to the event condition to wake up the kthread when kthread stop is called during driver unload. Fixes: 3fdcd0a31d7a ("drm/amdgpu: Prepare for asynchronous processing of umc page retirement") Signed-off-by: Mukul Joshi <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: skip call ras_late_init if ras block is not supported | Yang Wang | 2024-01-22 | 1 | -2/+5 |
| | | | | | | | | | skip call ras_late_init callback if ras block is not supported. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu:Support retiring multiple MCA error address pages | YiPeng Chai | 2024-01-22 | 1 | -8/+35 |
| | | | | | | | | | | Support retiring multiple MCA error address pages in one in-band query for umc v12_0. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Use asynchronous polling to handle umc_v12_0 poisoning | YiPeng Chai | 2024-01-22 | 1 | -0/+5 |
| | | | | | | | | | | | | | Use asynchronous polling to handle umc_v12_0 poisoning. v2: 1. Change function name. 2. Change the debugging information content. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Fix ras features value calltrace | Stanley.Yang | 2024-01-22 | 1 | -5/+6 |
| | | | | | | | | | | | The high three bits of ras features mask indicate socket id, it should skip to check high three bits of ras features mask before disable all ras features. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Prepare for asynchronous processing of umc page retirement | YiPeng Chai | 2024-01-22 | 1 | -0/+34 |
| | | | | | | | | | Preparing for asynchronous processing of umc page retirement. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Show deferred error count for UMC | Stanley.Yang | 2024-01-18 | 1 | -2/+6 |
| | | | | | | | | | | Show deferred error count for UMC syfs node Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: fix UBSAN array-index-out-of-bounds for ras_block_string[] | Yang Wang | 2024-01-18 | 1 | -1/+4 |
| | | | | | | | | | | fix array index out of bounds issue for ras_block_string[] array. Fixes: 30df05fb74f6 ("drm/amdgpu: Align ras block enum with firmware") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Centralize ras cap query to amdgpu_ras_check_supported | Hawking Zhang | 2024-01-15 | 1 | -77/+93 |
| | | | | | | | | | | | Move ras capablity check to amdgpu_ras_check_supported. Driver will query ras capablity through psp interace, or vbios interface, or specific ip callbacks. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Log deferred error separately | Candice Li | 2024-01-15 | 1 | -20/+96 |
| | | | | | | | | | | Separate deferred error from UE and CE and log it individually. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add aca sysfs support | Yang Wang | 2024-01-15 | 1 | -0/+15 |
| | | | | | | | | | add aca sysfs node support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add amdgpu ras aca query interface | Yang Wang | 2024-01-15 | 1 | -15/+90 |
| | | | | | | | | | | | | | v1: add ACA error query interface v2: Add a new helper function to determine whether to use ACA or MCA. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: add ACA bank dump debugfs support | Yang Wang | 2024-01-15 | 1 | -0/+14 |
| | | | | | | | | | add ACA bank dump debugfs support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
| * | drm/amdgpu: Add ras helper to query boot errors v2 | Hawking Zhang | 2024-01-15 | 1 | -0/+95 |
| | | | | | | | | | | | | Add ras helper function to query boot time gpu errors. v2: use aqua_vanjaram smn addressing pattern Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]> | ||||
