aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: set flip bits for RAS bad pagesTao Zhou2025-05-131-13/+7
| | | | | | | | | Make the code more general, user doesn't need to pay attention to the detail of flip bits setting. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Modify the count method of defer errorCe Sun2025-05-131-2/+6
| | | | | | | | | The number of newly added de counts and the number of newly added error addresses remain consistent Signed-off-by: Ce Sun <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgu: get RAS retire flip bits for new type of HBMTao Zhou2025-05-131-10/+25
| | | | | | | | | Get RAS retire flip bits for HBM with different types in various NPS modes. Also set flip row bit and MCA R13 bit in PA in different NPS modes. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: implement get_retire_flip_bits for UMC v12Tao Zhou2025-05-131-29/+53
| | | | | | | | | The RAS bad page retire flip bits can be set per vram type, vram vendor and NPS mode. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: adjust high bits for RAS retired pageTao Zhou2025-05-131-2/+11
| | | | | | | | | | Per UMC address conversion algorithm, the high row bits of UMC MCA address are changed when they're converted into normalized address on specific ASICs. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add loop bits for NPS2 page retirementTao Zhou2025-04-081-0/+10
| | | | | | | | Support NPS2 RAS. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Parse all deferred errors with UMC aca handleXiang Liu2025-03-261-1/+2
| | | | | | | | We should only increase the deferred errors in UMC block. Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Include ACA error type in aca bankHawking Zhang2025-02-171-0/+1
| | | | | | | | | | | | | | | ACA error types managed by driver a direct 1:1 correspondence with those managed by firmware. To address this, for each ACA bank, include both the ACA error type and the ACA SMU type. This addition is useful for creating CPER records. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: set UMC PA per NPS mode when PA is 0Tao Zhou2024-12-101-1/+8
| | | | | | | | | The shift bit of PA varys according to NPS mode due to different address format. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add interface to get die id from memory addressTao Zhou2024-12-101-0/+26
| | | | | | | | | | | | | And implement it for UMC v12_0. The die id is calculated from IPID register in bad page retirement flow, but we don't store it on eeprom and it can be also gotten from physical address. v2: get PA_C4 and PA_R13 from MCA address since they may be cleared in retired page. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: save UMC global channel index to eepromTao Zhou2024-12-101-5/+8
| | | | | | | | | Save the global channel index returned by RAS TA to eeprom. We can get memory physical address by MCA address and channel index. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: retire RAS bad pages in different NPS modesTao Zhou2024-12-101-23/+41
| | | | | | | | | There are some changes in format of memory normalized address per NPS mode, need to adjust bit mapping according to NPS mode. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add return value for convert_ras_err_addrTao Zhou2024-12-101-4/+8
| | | | | | | | So upper layer can return failure directly if address conversion fails. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: make convert_ras_err_addr visible outside UMC blockTao Zhou2024-12-101-58/+4
| | | | | | | | | And change some UMC v12 specific functions to generic version, so the code can be shared. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: store PA with column bits cleared for RAS bad pageTao Zhou2024-12-101-1/+3
| | | | | | | | | So the code can be simplified, and no need to expose the detail of PA format outside address conversion. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: remove redundant RAS error address coversion codeTao Zhou2024-12-101-76/+58
| | | | | | | | Only one interface is responsible for the conversion. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: simplify RAS page retirement in one memory rowTao Zhou2024-12-101-34/+23
| | | | | | | | Take R13 and column bits as a whole for UMC v12. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: remove RAS unused paramter 'err_addr'Yang Wang2024-08-061-3/+3
| | | | | | | | | | | | | - amdgpu_ras_error_statistic_ue_count() - amdgpu_ras_error_statistic_ce_count() - amdgpu_ras_error_statistic_de_count() The parameter 'err_addr' is no longer used since following patch. Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: optimize logging deferred error infoYiPeng Chai2024-07-231-34/+31
| | | | | | | | | | 1. Use pa_pfn as the radix-tree key index to log deferred error info. 2. Use local array to store a row of bad pages. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: optimize umc v12 address conversion functionYiPeng Chai2024-07-231-39/+77
| | | | | | | | | | | Split into 3 parts: 1. Convert soc physical address via ras ta. 2. Expand bad pages from soc physical address. 3. Dump bad address info. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completedYiPeng Chai2024-07-101-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | The problem case is as follows: 1. GPU A triggers a gpu ras reset, and GPU A drives GPU B to also perform a gpu ras reset. 2. After gpu B ras reset started, gpu B queried a DE data. Since the DE data was queried in the ras reset thread instead of the page retirement thread, bad page retirement work would not be triggered. Then even if all gpu resets are completed, the bad pages will be cached in RAM until GPU B's bad page retirement work is triggered again and then saved to eeprom. This patch can save the bad pages to eeprom in time after gpu ras reset is completed. v2: 1. Add the above description to code comments. 2. Reuse existing function. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add variable to record the deferred error number read by driverYiPeng Chai2024-06-271-2/+2
| | | | | | | | | Add variable to record the deferred error number read by driver. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: change log levelYiPeng Chai2024-05-081-2/+2
| | | | | | | | Change log level. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* rm/amdgpu: Remove unused codeYiPeng Chai2024-04-301-71/+0
| | | | | | | | Remove unused code. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: support ACA logging ecc errorsYiPeng Chai2024-04-261-0/+5
| | | | | | | | support ACA logging ecc errors. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: retire bad pages for umc v12_0YiPeng Chai2024-04-261-2/+57
| | | | | | | | Retire bad pages for umc v12_0. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: umc v12_0 logs ecc errorsYiPeng Chai2024-04-261-1/+40
| | | | | | | | | | | 1. umc v12_0 logs ecc errors. 2. Reserve newly detected ecc error pages. 3. Add tag for bad pages, so that they can be retired later. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: umc v12_0 converts error addressYiPeng Chai2024-04-261-1/+93
| | | | | | | | Umc v12_0 converts error address. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add interface to update umc v12_0 ecc statusYiPeng Chai2024-04-261-0/+24
| | | | | | | | Add interface to update umc v12_0 ecc status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: retire UMC v12 mca_addr_to_paTao Zhou2024-04-101-99/+6
| | | | | | | | RAS TA will handle it, the function is useless. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: simplify convert_error_address interface for UMC v12Tao Zhou2024-03-221-27/+30
| | | | | | | | Replace separate parameters with struct ta_ras_query_address_input. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add socket id parameter for psp query address cmdTao Zhou2024-03-221-3/+11
| | | | | | | | And set the socket id. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: retrieve umc odecc error count for aca umc v12.0Yang Wang2024-03-221-2/+7
| | | | | | | | retrieve umc odecc error count for aca umc v12.0 Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add umc v12.0.0 deferred error supportYang Wang2024-03-201-24/+13
| | | | | | | | add umc v12.0.0 deferred error support. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: retire unused aca_bank_report data structureYang Wang2024-03-201-3/+3
| | | | | | | | retire unused aca_bank_report data structure. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refine aca error cache for umc v12.0Yang Wang2024-03-201-3/+10
| | | | | | | | refine aca error cache for umc v12.0 Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add new aca_smu_type supportYang Wang2024-03-201-5/+5
| | | | | | | | | | | | | Add new types to distinguish between ACA error type and smu mca type. e.g.: the ACA_ERROR_TYPE_DEFERRED is not matched any smu mca valid bank channel, so add new type 'aca_smu_type' to distinguish aca error type and smu mca type. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add ras event id supportYang Wang2024-03-201-2/+8
| | | | | | | | | | | | | | | | add amdgpu ras event id support to better distinguish different error information sources in dmesg logs. the following log will be identify by event id: {event_id} interrupt to inform RAS event {event_id} ACA logs {event_id} errors statistic since from current injection/error query {event_id} errors statistic since from gpu load Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add deferred error check for UMC v12 address queryTao Zhou2024-03-011-1/+2
| | | | | | | | Both RAS UE and deferred errors need page retirement. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: use PSP address query commandTao Zhou2024-01-311-7/+39
| | | | | | | | Get UMC physical address from PSP in RAS error address coversion. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu:Support retiring multiple MCA error address pagesYiPeng Chai2024-01-221-26/+30
| | | | | | | | | Support retiring multiple MCA error address pages in one in-band query for umc v12_0. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add interface to check mca umc statusYiPeng Chai2024-01-221-0/+20
| | | | | | | | Add interface to check mca umc status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add log info for umc_v12_0YiPeng Chai2024-01-221-0/+11
| | | | | | | | | | | Add log info for umc_v12_0. v2: Delete redundant logs. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: update error condition check for umc_v12_0_query_error_addressTao Zhou2024-01-181-4/+1
| | | | | | | | Deferred error is also taken into account. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Log deferred error separatelyCandice Li2024-01-151-35/+25
| | | | | | | | | Separate deferred error from UE and CE and log it individually. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add umc v12.0 ACA supportYang Wang2024-01-151-0/+58
| | | | | | | | add umc v12.0 ACA driver support Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add umc page retirement for umc v12_0YiPeng Chai2023-12-191-0/+56
| | | | | | | | | | | | | | Add umc page retirement for umc v12_0. V2: 1. Changed umc page retirement check condition to call umc_v12_0_is_uncorrectable_error. 2. Use memset to clear the contents of the umc error address structure. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add poison mode check error condition for umc v12_0YiPeng Chai2023-12-191-5/+15
| | | | | | | | Add poison mode check error condition for umc v12_0. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: MCA supports recording umc address informationYiPeng Chai2023-12-191-2/+2
| | | | | | | | | | | | MCA supports recording umc address information. V2: Move err_addr variable from struct ras_err_node to struct ras_err_info. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: correct smu v13.0.6 umc ras error checkYang Wang2023-11-091-2/+2
| | | | | | | | correct smu v13.0.0 umc ras error check Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>