aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: Modify the count method of defer errorCe Sun2025-05-131-0/+1
| | | | | | | | | The number of newly added de counts and the number of newly added error addresses remain consistent Signed-off-by: Ce Sun <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use active umc info from discoveryLijo Lazar2025-02-131-0/+42
| | | | | | | | | | | | | | There could be configs where some UMC instances are harvested. This information is obtained through discovery data and populated in umc.active_mask. Avoid reassigning this as AID mask, instead use the mask directly while iterating through umc instances. This is to avoid accesses to harvested UMC instances. v2: fix warning (Alex) Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: correct the calculation of RAS bad pageTao Zhou2024-12-101-1/+2
| | | | | | | | | | After the introduction of NPS RAS, one bad page record on eeprom may be related to 1 or 16 bad pages, so the bad page record and bad page are two different concepts, define a new variable to store bad page number. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modesTao Zhou2024-12-101-2/+2
| | | | | | | | | | | | All legacy RAS bad pages are generated in NPS1 mode, but new bad page can be generated in any NPS mode, so we can't use retired_page stored on eeprom directly in non-nps1 mode even for legacy data. We need to take different actions for different data, new data can be identified from old data by UMC_CHANNEL_IDX_V2 flag. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: save UMC global channel index to eepromTao Zhou2024-12-101-5/+2
| | | | | | | | | Save the global channel index returned by RAS TA to eeprom. We can get memory physical address by MCA address and channel index. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add function to find all memory pages in one physical rowTao Zhou2024-12-101-15/+23
| | | | | | | | And the function can be reused across amdgpu driver. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add return value for convert_ras_err_addrTao Zhou2024-12-101-6/+13
| | | | | | | | So upper layer can return failure directly if address conversion fails. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: reduce memory usage for umc_lookup_bad_pages_in_a_rowTao Zhou2024-12-101-3/+2
| | | | | | | | | The function handles one page in one time, allocating umc.retire_unit bad page records is enough. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: make convert_ras_err_addr visible outside UMC blockTao Zhou2024-12-101-0/+63
| | | | | | | | | And change some UMC v12 specific functions to generic version, so the code can be shared. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Implement virt req_ras_err_countVictor Skvortsov2024-11-111-0/+3
| | | | | | | | | | | | | | | | | | | | | | | Enable RAS late init if VF RAS Telemetry is supported. When enabled, the VF can use this interface to query total RAS error counts from the host. The VF FB access may abruptly end due to a fatal error, therefore the VF must cache and sanitize the input. The Host allows 15 Telemetry messages every 60 seconds, afterwhich the host will ignore any more in-coming telemetry messages. The VF will rate limit its msg calling to once every 5 seconds (12 times in 60 seconds). While the VF is rate limited, it will continue to report the last good cached data. v2: Flip generate report & update statistics order for VF Signed-off-by: Victor Skvortsov <[email protected]> Acked-by: Tao Zhou <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: create function to check RAS RMA statusTao Zhou2024-08-061-1/+1
| | | | | | | | In the convenience of calling it globally. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Remove unused codeYiPeng Chai2024-07-231-86/+0
| | | | | | | | Remove unused code. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: optimize logging deferred error infoYiPeng Chai2024-07-231-11/+3
| | | | | | | | | | 1. Use pa_pfn as the radix-tree key index to log deferred error info. 2. Use local array to store a row of bad pages. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refine poison consumption interrupt handlerYiPeng Chai2024-06-271-5/+6
| | | | | | | | | | | 1. The poison fifo is only used for poison consumption requests. 2. Merge reset requests when poison fifo caches multiple poison consumption messages Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: trigger mode1 reset for RAS RMA statusTao Zhou2024-06-141-4/+4
| | | | | | | | | | Check RMA status in bad page retirement flow. v2: fix coding bugs in v1. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix double free err_addr pointer warningsBob Zhou2024-04-261-0/+1
| | | | | | | | | | In amdgpu_umc_bad_page_polling_timeout, the amdgpu_umc_handle_bad_pages will be run many times so that double free err_addr in some special case. So set the err_addr to NULL to avoid the warnings. Signed-off-by: Bob Zhou <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: prepare to handle pasid poison consumptionYiPeng Chai2024-04-261-7/+13
| | | | | | | | Prepare to handle pasid poison consumption. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add condition check for amdgpu_umc_fill_error_recordYiPeng Chai2024-04-261-3/+17
| | | | | | | | Add condition check for amdgpu_umc_fill_error_record. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add delay work to retire bad pagesYiPeng Chai2024-04-261-1/+1
| | | | | | | | Add delay work to retire bad pages. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: umc v12_0 logs ecc errorsYiPeng Chai2024-04-261-0/+67
| | | | | | | | | | | 1. umc v12_0 logs ecc errors. 2. Reserve newly detected ecc error pages. 3. Add tag for bad pages, so that they can be retired later. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add interface to update umc v12_0 ecc statusYiPeng Chai2024-04-261-0/+9
| | | | | | | | Add interface to update umc v12_0 ecc status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: make reset method configurable for RAS poisonTao Zhou2024-03-201-9/+7
| | | | | | | | | | | | Each RAS block has different requirement for gpu reset in poison consumption handling. Add support for mmhub RAS poison consumption handling. v2: remove the mmhub poison support for kfd int v10. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Support passing poison consumption ras block to SRIOVYiPeng Chai2024-01-251-2/+3
| | | | | | | | | Support passing poison consumption ras blocks to SRIOV. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: update check condition of query for ras page retireTao Zhou2024-01-221-2/+8
| | | | | | | | | | Support page retirement handling in debug mode. v2: revert smu_v13_0_6_get_ecc_info directly. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use asynchronous polling to handle umc_v12_0 poisoningYiPeng Chai2024-01-221-28/+105
| | | | | | | | | | | | Use asynchronous polling to handle umc_v12_0 poisoning. v2: 1. Change function name. 2. Change the debugging information content. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Log deferred error separatelyCandice Li2024-01-151-0/+1
| | | | | | | | | Separate deferred error from UE and CE and log it individually. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Do bad page retirement for deferred errorsCandice Li2024-01-151-6/+4
| | | | | | | | | | | Needs to do bad page retirement for deferred errors. v2: Drop unused dev_info. Signed-off-by: YiPeng Chai <[email protected]> Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: use mode-2 reset for RAS poison consumptionTao Zhou2023-11-031-1/+5
| | | | | | | | Switch from mode-1 reset to mode-2 for poison consumption. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add ras_err_info to identify RAS error sourceYang Wang2023-10-131-6/+21
| | | | | | | | | | | | | | | | | | | | | | introduced "ras_err_info" to better identify a RAS ERROR source. NOTE: For legacy chips, keep the original RAS error print format. v1: RAS errors may come from different dies during a RAS error query, therefore, need a new data structure to identify the source of RAS ERROR. v2: - use new data structure 'amdgpu_smuio_mcm_config_info' instead of ras_err_id (in v1 patch) - refine ras error dump function name - refine ras error dump log format Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use function for IP version checkLijo Lazar2023-09-201-1/+1
| | | | | | | | | Use an inline function for version check. Gives more flexibility to handle any format changes. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Optimize checking ras supportedStanley.Yang2023-06-151-15/+19
| | | | | | | | | Using "is_app_apu" to identify device in the native APU mode or carveout mode. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: optimize redundant code in umc_v8_10YiPeng Chai2023-04-111-0/+31
| | | | | | | | Optimize redundant code in umc_v8_10 Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Initialize umc ras callbackHawking Zhang2023-03-221-1/+1
| | | | | | | | | Fix a coding error which results to null interrupt handler for umc ras. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Move umc ras block init to gmc ras sw_initHawking Zhang2023-03-131-0/+30
| | | | | | | | | | | | Initialize umc ras block only when umc ip block supports ras. Driver queries ras capabilities after early_init, ras block init needs to be moved to sw_init. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: exclude duplicate pages from UMC RAS UE countTao Zhou2023-02-231-2/+2
| | | | | | | | | | | | | If a UMC bad page is reserved but not freed by an application, the application may trigger uncorrectable error repeatly by accessing the page. v2: add specific function to do the check. v3: remove duplicate pages, calculate new added bad page number. v4: reuse save_bad_pages to calculate new added bad page number. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add RAS poison consumption handler for SRIOVTao Zhou2022-12-151-16/+24
| | | | | | | | Send message to PF if VF receives RAS poison consumption interrupt. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: remove ras_error_status parameter for UMC poison handlerTao Zhou2022-10-271-8/+5
| | | | | | | | Make the code simpler. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add RAS poison handling for MCATao Zhou2022-10-271-11/+20
| | | | | | | | | | | For MCA poison, if unmap queue fails, only gpu reset should be triggered without page retirement handling, MCA notifier will do it. v2: handle MCA poison consumption in umc_poison_handler directly. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add RAS page retirement functions for MCATao Zhou2022-10-271-0/+53
| | | | | | | | | | | | | Define page retirement functions for MCA platform. v2: remove page retirement handling from MCA poison handler, let MCA notifier do page retirement. v3: remove specific poison handler for MCA to simplify code. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: message smu to update bad channel infoStanley.Yang2022-03-151-0/+5
| | | | | | | | | It should notice SMU to update bad channel info when detected uncorrectable error in UMC block Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Remove redundant calls of amdgpu_ras_block_late_fini in umc ras ↵yipechai2022-03-021-7/+0
| | | | | | | | | | block Remove redundant calls of amdgpu_ras_block_late_fini in umc ras block. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Optimize xxx_ras_fini function of each ras blockyipechai2022-03-021-2/+2
| | | | | | | | | | | | 1. Move the variables of ras block instance members from specific xxx_ras_fini to general ras_fini call. 2. Function calls inside the modules only use parameters passed from xxx_ras_fini instead of ras block instance members. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Modify .ras_fini function pointer parameteryipechai2022-03-021-1/+1
| | | | | | | | | | Modify .ras_fini function pointer parameter so that we can remove redundant intermediate calls in some ras blocks. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Optimize xxx_ras_late_init function of each ras blockyipechai2022-02-171-3/+3
| | | | | | | | | | | 1. Move calling ras block instance members from module internal function to the top calling xxx_ras_late_init. 2. Module internal function calls can only use parameter variables of xxx_ras_late_init instead of ras block instance members. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Modify .ras_late_init function pointer parameteryipechai2022-02-171-1/+1
| | | | | | | | | Modify .ras_late_init function pointer parameter so that it can remove redundant intermediate calls in some ras blocks. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Optimize amdgpu_umc_ras_late_init/amdgpu_umc_ras_fini function codeyipechai2022-02-141-38/+6
| | | | | | | | Optimize amdgpu_umc_ras_late_init/amdgpu_umc_ras_fini function code. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add umc_fill_error_record to make code more simpleTao Zhou2022-01-271-0/+21
| | | | | | | | | Create common amdgpu_umc_fill_error_record function for all versions of UMC and clean up related codes. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Modify umc block to fit for the unified ras block data and opsyipechai2022-01-141-16/+16
| | | | | | | | | | | | | | | 1.Modify umc block to fit for the unified ras block data and ops. 2.Change amdgpu_umc_ras_funcs to amdgpu_umc_ras, and the corresponding variable name remove _funcs suffix. 3.Remove the const flag of umc ras variable so that umc ras block can be able to be inserted into amdgpu device ras block link list. 4.Invoke amdgpu_ras_register_ras_block function to register umc ras block into amdgpu device ras block link list. 5.Remove the redundant code about umc in amdgpu_ras.c after using the unified ras block. 6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of umc versions. If .ras_late_init and .ras_fini had been defined by the selected umc version, the defined functions will take effect; if not defined, default fill them with amdgpu_umc_ras_late_init and amdgpu_umc_ras_fini. Signed-off-by: yipechai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: John Clements <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Modify the compilation failed problem when other ras blocks' .h ↵yipechai2022-01-141-1/+1
| | | | | | | | | | | | | | include amdgpu_ras.h Modify the compilation failed problem when other ras blocks' .h include amdgpu_ras.h. v2: squash in forward declaration warning fix (Alex) Signed-off-by: yipechai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: John Clements <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/pm: do not expose implementation details to other blocks out of powerEvan Quan2022-01-141-3/+2
| | | | | | | | | | Those implementation details(whether swsmu supported, some ppt_funcs supported, accessing internal statistics ...)should be kept internally. It's not a good practice and even error prone to expose implementation details. Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>