aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: Get mca address for old eeprom recordsganglxie2025-05-291-0/+9
| | | | | | | | | after getting mca address for old eeprom records with 'address==0', it can be correctly parsed under none-nps1, or it will be dropped. Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: handle old RAS eeprom data in non-nps1 modeganglxie2025-05-291-2/+14
| | | | | | | | | Get MCA address from PA in nps1, then convert MCA address to PA in specific nps mode. Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: update ras support checkMangesh Gadre2025-05-221-1/+2
| | | | | | | | | update ras support check for vcn 5.0.1 Signed-off-by: Mangesh Gadre <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Log RAS errors during loadLijo Lazar2025-05-131-1/+4
| | | | | | | | | | During driver load, RAS event manager may not be initialized. This will cause any ATHUB event during driver load to be skipped in dmesg log. Log the error in dmesg log for easier diagnosis. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add get_retire_flip_bits for UMCTao Zhou2025-05-131-0/+4
| | | | | | | | Add the general interface to get flip bits for RAS bad page retirement. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* Refine RAS bad page records counting and parsing in eeprom V3ganglxie2025-05-131-26/+36
| | | | | | | | | | there is only MCA records in V3, no need to care about PA records. recalculate the value of ras_num_bad_pages when parsing failed and go on with the left records instead of quit. Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* Merge tag 'amd-drm-next-6.16-2025-05-09' of ↵Dave Airlie2025-05-111-5/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.16-2025-05-09: amdgpu: - IPS fixes - DSC cleanup - DC Scaling updates - DC FP fixes - Fused I2C-over-AUX updates - SubVP fixes - Freesync fix - DMUB AUX fixes - VCN fix - Hibernation fixes - HDP fixes - DCN 2.1 fixes - DPIA fixes - DMUB updates - Use drm_file_err in amdgpu - Enforce isolation updates - Use new dma_fence helpers - USERQ fixes - Documentation updates - Misc code cleanups - SR-IOV updates - RAS updates - PSP 12 cleanups amdkfd: - Update error messages for SDMA - Userptr updates drm: - Add drm_file_err function dma-buf: - Add a helper to sort and deduplicate dma_fence arrays From: Alex Deucher <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dave Airlie <[email protected]>
| * drm/amdgpu: Fix comment styleLijo Lazar2025-04-301-1/+1
| | | | | | | | | | | | | | | | | | | | Fix code comment style Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Direct ret in ras_reset_err_cnt on VFEllen Pan2025-04-111-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | With adding sriov_vf check, we directly return EOPNOTSUPP in ras_reset_error_count as we should not do anything on VF to reset RAS error count. This also fixes the issue that loading guest driver causes register violations. Reviewed-by: Ahmad Rehman <[email protected]> Signed-off-by: Ellen Pan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Disable ACA on VFsVictor Skvortsov2025-04-081-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | VFs query RAS error counts directly from host with AMDGPU_RAS_VIRT_ERROR_COUNT_QUERY. When ACA is enabled, an unusable aca_sysfs is created rather than amdgpu_ras_sysfs_create() Likewise, VFs depend on host support to query CPERs, rather than ACA component. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | Merge tag 'drm-next-2025-04-05' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds2025-04-051-0/+1
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull drm fixes from Dave Airlie: "Weekly fixes, mostly from the end of last week, this week was very quiet, maybe you scared everyone away. It's mostly amdgpu, and xe, with some i915, adp and bridge bits, since I think this is overly quiet I'd expect rc2 to be a bit more lively. bridge: - tda998x: Select CONFIG_DRM_KMS_HELPER amdgpu: - Guard against potential division by 0 in fan code - Zero RPM support for SMU 14.0.2 - Properly handle SI and CIK support being disabled - PSR fixes - DML2 fixes - DP Link training fix - Vblank fixes - RAS fixes - Partitioning fix - SDMA fix - SMU 13.0.x fixes - Rom fetching fix - MES fixes - Queue reset fix xe: - Fix NULL pointer dereference on error path - Add missing HW workaround for BMG - Fix survivability mode not triggering - Fix build warning when DRM_FBDEV_EMULATION is not set i915: - Bounds check for scalers in DSC prefill latency computation - Fix build by adding a missing include adp: - Fix error handling in plane setup" # -----BEGIN PGP SIGNATURE----- * tag 'drm-next-2025-04-05' of https://gitlab.freedesktop.org/drm/kernel: (34 commits) drm/i2c: tda998x: select CONFIG_DRM_KMS_HELPER drm/amdgpu/gfx12: fix num_mec drm/amdgpu/gfx11: fix num_mec drm/amd/pm: Add gpu_metrics_v1_8 drm/amdgpu: Prefer shadow rom when available drm/amd/pm: Update smu metrics table for smu_v13_0_6 drm/amd/pm: Remove host limit metrics support Remove unnecessary firmware version check for gc v9_4_2 drm/amdgpu: stop unmapping MQD for kernel queues v3 Revert "drm/amdgpu/sdma_v4_4_2: update VM flush implementation for SDMA" drm/amdgpu: Parse all deferred errors with UMC aca handle drm/amdgpu: Update ta ras block drm/amdgpu: Add NPS2 to DPX compatible mode drm/amdgpu: Use correct gfx deferred error count drm/amd/display: Actually do immediate vblank disable drm/amd/display: prevent hang on link training fail Revert "drm/amd/display: dml2 soc dscclk use DPM table clk setting" drm/amd/display: Increase vblank offdelay for PSR panels drm/amd: Handle being compiled without SI or CIK support better drm/amd/pm: Add zero RPM enabled OD setting support for SMU14.0.2 ...
| * drm/amdgpu: Update ta ras blockStanley.Yang2025-03-261-0/+1
| | | | | | | | | | | | | | | | Update ta ra block to keep sync with RAS TA. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | Merge tag 'driver-core-6.15-rc1' of ↵Linus Torvalds2025-04-011-7/+6
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updatesk from Greg KH: "Here is the big set of driver core updates for 6.15-rc1. Lots of stuff happened this development cycle, including: - kernfs scaling changes to make it even faster thanks to rcu - bin_attribute constify work in many subsystems - faux bus minor tweaks for the rust bindings - rust binding updates for driver core, pci, and platform busses, making more functionaliy available to rust drivers. These are all due to people actually trying to use the bindings that were in 6.14. - make Rafael and Danilo full co-maintainers of the driver core codebase - other minor fixes and updates" * tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (52 commits) rust: platform: require Send for Driver trait implementers rust: pci: require Send for Driver trait implementers rust: platform: impl Send + Sync for platform::Device rust: pci: impl Send + Sync for pci::Device rust: platform: fix unrestricted &mut platform::Device rust: pci: fix unrestricted &mut pci::Device rust: device: implement device context marker rust: pci: use to_result() in enable_device_mem() MAINTAINERS: driver core: mark Rafael and Danilo as co-maintainers rust/kernel/faux: mark Registration methods inline driver core: faux: only create the device if probe() succeeds rust/faux: Add missing parent argument to Registration::new() rust/faux: Drop #[repr(transparent)] from faux::Registration rust: io: fix devres test with new io accessor functions rust: io: rename `io::Io` accessors kernfs: Move dput() outside of the RCU section. efi: rci2: mark bin_attribute as __ro_after_init rapidio: constify 'struct bin_attribute' firmware: qemu_fw_cfg: constify 'struct bin_attribute' powerpc/perf/hv-24x7: Constify 'struct bin_attribute' ...
| * drm/amdgpu: Constify 'struct bin_attribute'Thomas Weißschuh2025-02-211-7/+6
| | | | | | | | | | | | | | | | | | | | | | The sysfs core now allows instances of 'struct bin_attribute' to be moved into read-only memory. Make use of that to protect them against accidental or malicious modifications. Signed-off-by: Thomas Weißschuh <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Link: https://lore.kernel.org/r/20241216-sysfs-const-bin_attr-drm-v1-4-210f2b36b9bf@weissschuh.net Signed-off-by: Greg Kroah-Hartman <[email protected]>
* | drm/amdgpu: format old RAS eeprom data into V3 versionTao Zhou2025-03-181-0/+7
| | | | | | | | | | | | | | | | | | | | Clear old data and save it in V3 format. v2: only format eeprom data for new ASICs. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14Xiang Liu2025-03-141-2/+4
| | | | | | | | | | | | | | | | Enable ACA by default for psp v13_0_6/v13_0_14. Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | drm/amdgpu: Save PA of bad pages for old asicsganglxie2025-03-141-3/+21
| | | | | | | | | | | | | | | | | | for old asics that do not support mca translating, we just save PA for them Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | drm/amdgpu: Report generic instead of unknown boot time errorsXiang Liu2025-02-271-2/+2
| | | | | | | | | | | | | | | | | | | | Change the DMESG reporting of unknown errors to "Boot Controller Generic Error" to align with the RAS SPEC and provide more clarity to customers. Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | drm/amdgpu: Change page/record number calculation based on npsganglxie2025-02-251-27/+22
| | | | | | | | | | | | | | | | | | save only one record to save eeprom space,and bad_page_num = pa_rec_num + mca_rec_num*16 Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | drm/amdgpu: Refine bad page addingganglxie2025-02-251-91/+103
| | | | | | | | | | | | | | | | bad page adding can be simpler with nps info Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | drm/amdgpu: Enable ACA by default for psp v13_0_12Candice Li2025-02-171-2/+3
| | | | | | | | | | | | | | | | | | Enable ACA by default for psp v13_0_12. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | drm/amdgpu: Skip err_count sysfs creation on VF unsupported RAS blocksVictor Skvortsov2025-02-131-0/+3
| | | | | | | | | | | | | | | | | | | | VFs are not able to query error counts for all RAS blocks. Rather than returning error for queries on these blocks, skip sysfs the creation all together. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | drm/amdgpu: Update usage for bad page thresholdHawking Zhang2025-02-131-20/+20
|/ | | | | | | | | The driver's behavior varies based on the configuration of amdgpu_bad_page_threshold setting Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix error handling in amdgpu_ras_add_bad_pagesSrinivasan Shanmugam2025-01-061-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | It ensures that appropriate error codes are returned when an error condition is detected Fixes the below; drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2849 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_umc_pages_in_a_row()' failed. drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2884 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_ras_mca2pa()' failed. v2: s/-EIO/-EINVAL, retained the use of -EINVAL from amdgpu_umc_pages_in_a_row & and amdgpu_ras_mca2pa_by_idx, when the RAS context is not initialized or the convert_ras_err_addr function is unavailable. (Thomas) V3: Returning 0 as the absence of eh_data is acceptable. (Tao) Fixes: a8d133e625ce ("drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes") Reported-by: Dan Carpenter <[email protected]> Cc: YiPeng Chai <[email protected]> Cc: Tao Zhou <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Enable psp v14_0_3 RAS support for non-SRIOV configurations.Candice Li2024-12-181-1/+1
| | | | | | | | Enable psp v14_0_3 RAS support for non-SRIOV configurations. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Support nbif v6_3_1 fatal error handlingCandice Li2024-12-101-0/+12
| | | | | | | | Add nbif v6_3_1 fatal error handling support. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add psp v14_0_3 ras supportCandice Li2024-12-101-0/+1
| | | | | | | | Add psp v14_0_3 ras support. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Enable RAS for psp v13_0_12Hawking Zhang2024-12-101-0/+5
| | | | | | | | | Enable RAS Cap check and initialize RAS funcs for psp v13_0_12 Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: correct the calculation of RAS bad pageTao Zhou2024-12-101-8/+2
| | | | | | | | | | After the introduction of NPS RAS, one bad page record on eeprom may be related to 1 or 16 bad pages, so the bad page record and bad page are two different concepts, define a new variable to store bad page number. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: split ras_eeprom_init into init and check functionsTao Zhou2024-12-101-4/+11
| | | | | | | | | | Init function is for ras table header read and check function is responsible for the validation of the header. Call them in different stages. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: remove is_mca_add for ras_add_bad_pagesTao Zhou2024-12-101-11/+5
| | | | | | | | Remove unnecessary variable and simplify the logic. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modesTao Zhou2024-12-101-15/+81
| | | | | | | | | | | | All legacy RAS bad pages are generated in NPS1 mode, but new bad page can be generated in any NPS mode, so we can't use retired_page stored on eeprom directly in non-nps1 mode even for legacy data. We need to take different actions for different data, new data can be identified from old data by UMC_CHANNEL_IDX_V2 flag. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: support to find RAS bad pages via old TATao Zhou2024-12-101-3/+25
| | | | | | | | | | | Old version of RAS TA doesn't support to convert MCA address stored on eeprom to physical address (PA), support to find all bad pages in one memory row by PA with old RAS TA. This approach is only suitable for nps1 mode. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: store only one RAS bad page record for all pages in one rowTao Zhou2024-12-101-8/+27
| | | | | | | | So eeprom space can be saved, compatible with legacy way. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Prefer RAS recovery for scheduler hangLijo Lazar2024-12-101-2/+53
| | | | | | | | | | | | | | | | | | | Before scheduling a recovery due to scheduler/job hang, check if a RAS error is detected. If so, choose RAS recovery to handle the situation. A scheduler/job hang could be the side effect of a RAS error. In such cases, it is required to go through the RAS error recovery process. A RAS error recovery process in certains cases also could avoid a full device device reset. An error state is maintained in RAS context to detect the block affected. Fatal Error state uses unused block id. Set the block id when error is detected. If the interrupt handler detected a poison error, it's not required to look for a fatal error. Skip fatal error checking in such cases. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: do RAS MCA2PA conversion in device init phaseTao Zhou2024-12-101-12/+82
| | | | | | | | | | NPS mode is introduced, the value of memory physical address (PA) related to a MCA address varies per nps mode. We need to rely on MCA address and convert it into PA accroding to nps mode. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add flag to indicate the type of RAS eeprom recordTao Zhou2024-12-101-7/+26
| | | | | | | | | | | | One UMC MCA address could map to multiply physical address (PA): AMDGPU_RAS_EEPROM_REC_PA: one record store one PA AMDGPU_RAS_EEPROM_REC_MCA: one record store one MCA address, PA is not cared about Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use reset recovery state checksLijo Lazar2024-11-201-5/+5
| | | | | | | | | | Some in_reset checks are infact checking whether the state is reinitialization after reset. Replace with reset_in_recovery calls to identify that it's really checking for recovery stage after reset. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Implement virt req_ras_err_countVictor Skvortsov2024-11-111-7/+65
| | | | | | | | | | | | | | | | | | | | | | | Enable RAS late init if VF RAS Telemetry is supported. When enabled, the VF can use this interface to query total RAS error counts from the host. The VF FB access may abruptly end due to a fatal error, therefore the VF must cache and sanitize the input. The Host allows 15 Telemetry messages every 60 seconds, afterwhich the host will ignore any more in-coming telemetry messages. The VF will rate limit its msg calling to once every 5 seconds (12 times in 60 seconds). While the VF is rate limited, it will continue to report the last good cached data. v2: Flip generate report & update statistics order for VF Signed-off-by: Victor Skvortsov <[email protected]> Acked-by: Tao Zhou <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: VF Query RAS Caps from Host if supportedVictor Skvortsov2024-11-111-0/+5
| | | | | | | | | If VF RAS Capability support is enabled, guest is able to retrieve the real RAS support from the host. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Skip IP coredump for RAS errorsLijo Lazar2024-11-041-0/+1
| | | | | | | | | For RAS errors, source of error is known. Skip the core dump of IP states. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Refactor XGMI reset on init handlingLijo Lazar2024-09-261-6/+0
| | | | | | | | | | | | Use XGMI hive information to rely on resetting XGMI devices on initialization rather than using mgpu structure. mgpu structure may have other devices as well. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add helper to initialize badpage infoLijo Lazar2024-09-261-18/+38
| | | | | | | | | | | | | | Add a separate function to read badpage data during initialization. Reading bad pages will need hardware access and cannot be done during reset. Hence in cases where device needs a full reset during init itself, attempting to read will cause a deadlock. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use init level for pending_reset flagLijo Lazar2024-09-261-1/+1
| | | | | | | | | | Drop pending_reset flag in gmc block. Instead use init level to determine which type of init is preferred - in this case MINIMAL. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* amd/amdgpu: Reduce unnecessary repetitive GPU resetsYiPeng Chai2024-09-261-1/+20
| | | | | | | | | | In multiple GPUs case, after a GPU has started resetting all GPUs on hive, other GPUs do not need to trigger GPU reset again. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix typo in the commentYan Zhen2024-09-181-1/+1
| | | | | | | | | | | | | | | | | | Correctly spelled comments make it easier for the reader to understand the code. Replace 'udpate' with 'update' in the comment & replace 'recieved' with 'received' in the comment & replace 'dsiable' with 'disable' in the comment & replace 'Initiailize' with 'Initialize' in the comment & replace 'disble' with 'disable' in the comment & replace 'Disbale' with 'Disable' in the comment & replace 'enogh' with 'enough' in the comment & replace 'availabe' with 'available' in the comment. Acked-by: Christian König <[email protected]> Signed-off-by: Yan Zhen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: disable GPU RAS bad page feature for specific ASICTao Zhou2024-09-171-0/+5
| | | | | | | | | | | The feature is not applicable to specific app platform. v2: update the disablement condition and commit description v3: move the setting to amdgpu_ras_check_supported Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: remove RAS unused paramter 'err_addr'Yang Wang2024-08-061-9/+9
| | | | | | | | | | | | | - amdgpu_ras_error_statistic_ue_count() - amdgpu_ras_error_statistic_ce_count() - amdgpu_ras_error_statistic_de_count() The parameter 'err_addr' is no longer used since following patch. Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code") Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: create function to check RAS RMA statusTao Zhou2024-08-061-6/+16
| | | | | | | | In the convenience of calling it globally. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add more types for boot time error reportingHawking Zhang2024-08-061-0/+10
| | | | | | | | Data abort exception and unknown errors are supported. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>