aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: sysfs node disable query error count during gpu resetYiPeng Chai2024-07-081-1/+2
| | | | | | | | Sysfs node disable query error count during gpu reset. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: process RAS fatal error MB notificationVignesh Chander2024-06-271-1/+8
| | | | | | | | | | | | | | | | For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset completion message first to avoid accidentally spamming multiple reset requests to host. v2: add another mailbox check for handling case where kfd detects timeout first v3: set host_flr bit and use wait_for_reset Signed-off-by: Vignesh Chander <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix pci state save during mode-1 resetLijo Lazar2024-06-271-2/+5
| | | | | | | | | | | | Cache the PCI state before bus master is disabled. The saved state is later used for other cases like restoring config space after mode-2 reset. Fixes: 5c03e5843e6b ("drm/amdgpu:add smu mode1/2 support for aldebaran") Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix using the reserved VMID with gang submitChristian König2024-06-191-4/+17
| | | | | | | | | We need to ensure that even when using a reserved VMID that the gang members can still run in parallel. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: create amdgpu_ras_in_recovery to simplify codeTao Zhou2024-06-141-11/+2
| | | | | | | | | Reduce redundant code and user doesn't need to pay attention to RAS details. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refine gpu_info firmware loadingYang Wang2024-06-141-5/+4
| | | | | | | | refine gpu_info firmware loading Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix sriov host flr handlerYunxiang Li2024-06-141-0/+2
| | | | | | | | | | | | | | | | | We send back the ready to reset message before we stop anything. This is wrong. Move it to when we are actually ready for the FLR to happen. In the current state since we take tens of seconds to stop everything, it is very likely that host would give up waiting and reset the GPU before we send ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us tell how slow ready to reset actually is from the host. The ready to reset speed can be improved later. Signed-off-by: Yunxiang Li <[email protected]> Acked-by: Christian König <[email protected]> Reviewed-by: Emily Deng <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdkfd: add reset cause in gpu pre-reset smi eventEric Huang2024-06-051-1/+1
| | | | | | | | | | | reset cause is requested by customer as additional info for gpu reset smi event. v2: integerate reset sources suggested by Lijo Lazar Signed-off-by: Eric Huang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add lock around VF RLCG interfaceVictor Skvortsov2024-05-291-0/+1
| | | | | | | | | | | | | flush_gpu_tlb may be called from another thread while device_gpu_recover is running. Both of these threads access registers through the VF RLCG interface during VF Full Access. Add a lock around this interface to prevent race conditions between these threads. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Adjust logic in amdgpu_device_partner_bandwidth()Alex Deucher2024-05-291-7/+12
| | | | | | | | | Use current speed/width on devices which don't support dynamic PCIe switching. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3289 Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add prints in IP State dumpSunil Khatri2024-05-231-0/+2
| | | | | | | | | | | add prints before and after ip state is dumped. It avoids user to think of system being stuck/hung as dump could take some time after a gpu hang. Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Sunil Khatri <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/amdgpu: fix the inst passed to amdgpu_virt_rlcg_reg_rwVictor Zhao2024-05-231-2/+2
| | | | | | | | | the inst passed to amdgpu_virt_rlcg_reg_rw should be physical instance. Fix the miss matched code. Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Check if NBIO funcs are NULL in amdgpu_device_baco_exitFriedrich Vock2024-05-171-1/+1
| | | | | | | | | | | | | The special case for VM passthrough doesn't check adev->nbio.funcs before dereferencing it. If GPUs that don't have an NBIO block are passed through, this leads to a NULL pointer dereference on startup. Signed-off-by: Friedrich Vock <[email protected]> Fixes: 1bece222eabe ("drm/amdgpu: Clear doorbell interrupt status for Sienna Cichlid") Cc: Alex Deucher <[email protected]> Cc: Christian König <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix dereference after null checkJesse Zhang2024-05-131-1/+1
| | | | | | | | check the pointer hive before use. Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Tim Huang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/amdgpu: Check tbo resource pointerAsad Kamal2024-05-021-1/+2
| | | | | | | | | Validate tbo resource pointer, skip if NULL Signed-off-by: Asad Kamal <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add gfx v9_4_4 ip blockHawking Zhang2024-05-021-2/+5
| | | | | | | | Add gfx v9_4_4 ip block support Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Move ras resume into SRIOV functionYunxiang Li2024-05-021-7/+5
| | | | | | | | | This is part of the reset, move it into the reset function. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Emily Deng <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix amdgpu_device_reset_sriov retry logicYunxiang Li2024-05-021-25/+22
| | | | | | | | | | | | | | | | | | | | | The retry loop for SRIOV reset have refcount and memory leak issue. Depending on which function call fails it can potentially call amdgpu_amdkfd_pre/post_reset different number of times and causes kfd_locked count to be wrong. This will block all future attempts at opening /dev/kfd. The retry loop also leakes resources by calling amdgpu_virt_init_data_exchange multiple times without calling the corresponding fini function. Align with the bare-metal reset path which doesn't have these issues. This means taking the amdgpu_amdkfd_pre/post_reset functions out of the reset loop and calling amdgpu_device_pre_asic_reset each retry which properly free the resources from previous try by calling amdgpu_virt_fini_data_exchange. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Emily Deng <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add reset_context flag for host FLRYunxiang Li2024-05-021-5/+8
| | | | | | | | | | | | | | There are other reset sources that pass NULL as the job pointer, such as amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the FLR comes from the host does not work. Add a flag in reset_context to explicitly mark host triggered reset, and set this flag when we receive host reset notification. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Emily Deng <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix two reset triggered in a rowYunxiang Li2024-05-021-9/+10
| | | | | | | | | | | | | | | | | Some times a hang GPU causes multiple reset sources to schedule resets. The second source will be able to trigger an unnecessary reset if they schedule after we call amdgpu_device_stop_pending_resets. Move amdgpu_device_stop_pending_resets to after the reset is done. Since at this point the GPU is supposedly in a good state, any reset scheduled after this point would be a legitimate reset. Remove unnecessary and incorrect checks for amdgpu_in_reset that was kinda serving this purpose. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: skip ip dump if devcoredump flag is setSunil Khatri2024-04-261-6/+7
| | | | | | | | | | Do not dump the ip registers during driver reload in passthrough environment. Signed-off-by: Sunil Khatri <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add a spinlock to wb allocationAlex Deucher2024-04-261-1/+10
| | | | | | | | As we use wb slots more dynamically, we need to lock access to avoid racing on allocation or free. Reviewed-by: Shaoyun.liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: dump ip state before reset for each ipSunil Khatri2024-04-261-0/+7
| | | | | | | | | | | Invoke the dump_ip_state function for each ip before the asic resets and save the register values for debugging via devcoredump. Signed-off-by: Sunil Khatri <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Skip the coredump collection on reset during driver reloadAhmad Rehman2024-04-191-2/+5
| | | | | | | | | | | | | | In passthrough environment, the driver triggers the mode-1 reset on reload. The reset causes the core dump collection which is delayed task and prevents driver from unloading until it is completed. Since we do not need to collect data on "reset on reload" case, we can skip core dump collection. v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well. Signed-off-by: Ahmad Rehman <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refactoring the runtime pm mode detection codeMa Jun2024-04-171-0/+75
| | | | | | | | | | refactor the code of runtime pm mode detection to support amdgpu_runtime_pm =2 and 1 two cases Signed-off-by: Ma Jun <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add missing space to DRM_WARN() messageThorsten Blum2024-04-171-1/+1
| | | | | | | | s/,please/, please/ Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Thorsten Blum <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* amd/amdgpu: improve VF recover timeZhigang Luo2024-04-101-0/+1
| | | | | | | | | 1. change AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT from 30 to 5. 2. set fatel error detected flag. Signed-off-by: Zhigang Luo <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu/pm: Add support for MACO flag checkingMa Jun2024-04-101-3/+5
| | | | | | | | | Add support for MACO flag checking. MACO mode only works if BACO is supported. Signed-off-by: Ma Jun <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/pm: fix the high voltage issue after unloadKenneth Feng2024-04-101-11/+15
| | | | | | | | fix the high voltage issue after unload on smu 13.0.10 Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: use vm_update_mode=0 as default in sriov for gfx10.3 onwardsDanijel Slivka2024-04-101-0/+7
| | | | | | | | | | | | | | Apply this rule to all newer asics in sriov case. For asic with VF MMIO access protection avoid using CPU for VM table updates. CPU pagetable updates have issues with HDP flush as VF MMIO access protection blocks write to BIF_BX_DEV0_EPF0_VF0_HDP_MEM_COHERENCY_FLUSH_CNTL register during sriov runtime. Moved the check to amdgpu_device_init() to ensure it is done after amdgpu_device_ip_early_init() where the IP versions are discovered. Signed-off-by: Danijel Slivka <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd: Flush GFXOFF requests in prepare stageMario Limonciello2024-03-221-0/+2
| | | | | | | | | | | | | | | | | | | | If the system hasn't entered GFXOFF when suspend starts it can cause hangs accessing GC and RLC during the suspend stage. Cc: <[email protected]> # 6.1.y: 5095d5418193 ("drm/amd: Evict resources during PM ops prepare() callback") Cc: <[email protected]> # 6.1.y: cb11ca3233aa ("drm/amd: Add concept of running prepare_suspend() sequence for IP blocks") Cc: <[email protected]> # 6.1.y: 2ceec37b0e3d ("drm/amd: Add missing kernel doc for prepare_suspend()") Cc: <[email protected]> # 6.1.y: 3a9626c816db ("drm/amd: Stop evicting resources on APUs in suspend") Cc: <[email protected]> # 6.6.y: 5095d5418193 ("drm/amd: Evict resources during PM ops prepare() callback") Cc: <[email protected]> # 6.6.y: cb11ca3233aa ("drm/amd: Add concept of running prepare_suspend() sequence for IP blocks") Cc: <[email protected]> # 6.6.y: 2ceec37b0e3d ("drm/amd: Add missing kernel doc for prepare_suspend()") Cc: <[email protected]> # 6.6.y: 3a9626c816db ("drm/amd: Stop evicting resources on APUs in suspend") Cc: <[email protected]> # 6.1+ Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3132 Fixes: ab4750332dbe ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks") Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refactor code to split devcoredump codeSunil Khatri2024-03-221-0/+1
| | | | | | | | | | | | | | | | Refractor devcoredump code into new files since its functionality is expanded further and better to slit and devcoredump to have its own file. v2: Fix the build failure caught by arm compiler of implicit function declaration with #ifdef v3: squash in fix for implicit declaration error Cc: Ivan Lipski <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Sunil Khatri <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: trigger flr_work if reading pf2vf data failedZhigang Luo2024-03-201-5/+10
| | | | | | | | | | | | | if reading pf2vf data failed 30 times continuously, it means something is wrong. Need to trigger flr_work to recover the issue. also use dev_err to print the error message to get which device has issue and add warning message if waiting IDH_FLR_NOTIFICATION_CMPL timeout. Signed-off-by: Zhigang Luo <[email protected]> Acked-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Do a basic health check before resetLijo Lazar2024-03-201-0/+24
| | | | | | | | | | Check if the device is present in the bus before trying to recover. It could be that device itself is lost from the bus in some hang situations. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* Revert "drm/amd/amdgpu: Fix potential ioremap() memory leaks in ↵Ma Jun2024-03-201-10/+6
| | | | | | | | | | | | | | | | | amdgpu_device_init()" This patch causes the following iounmap erorr and calltrace iounmap: bad address 00000000d0b3631f The original patch was unjustified because amdgpu_device_fini_sw() will always cleanup the rmmio mapping. This reverts commit eb4f139888f636614dab3bcce97ff61cefc4b3a7. Signed-off-by: Ma Jun <[email protected]> Suggested-by: Christian König <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: disable ring_muxer if mcbp is offPierre-Eric Pelloux-Prayer2024-03-061-2/+2
| | | | | | | | | | | | | | | | | | | | Using the ring_muxer without preemption adds overhead for no reason since mcbp cannot be triggered. Moving back to a single queue in this case also helps when high priority app are used: in this case the gpu_scheduler priority handling will work as expected - much better than ring_muxer with its 2 independant schedulers competing for the same hardware queue. This change requires moving amdgpu_device_set_mcbp above amdgpu_device_ip_early_init because we use adev->gfx.mcbp. Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Acked-by: Christian König <[email protected]> Acked-by: Jiadong Zhu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/amdgpu: Fix potential ioremap() memory leaks in amdgpu_device_init()Srinivasan Shanmugam2024-02-271-6/+10
| | | | | | | | | | | | | | | | This ensures that the memory mapped by ioremap for adev->rmmio, is properly handled in amdgpu_device_init(). If the function exits early due to an error, the memory is unmapped. If the function completes successfully, the memory remains mapped. Reported by smatch: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4337 amdgpu_device_init() warn: 'adev->rmmio' from ioremap() not released on lines: 4035,4045,4051,4058,4068,4337 Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add fatal error detected flagLijo Lazar2024-02-261-0/+1
| | | | | | | | | For a RAS error that needs a full reset to recover, set the fatal error status. Clear the status once the device is reset. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* Revert "drm/amd: flush any delayed gfxoff on suspend entry"Mario Limonciello2024-02-131-1/+0
| | | | | | | | | | | | | | | | | commit ab4750332dbe ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks") caused GFXOFF control to be used more heavily and the codepath that was removed from commit 0dee72639533 ("drm/amd: flush any delayed gfxoff on suspend entry") now can be exercised at suspend again. Users report that by using GNOME to suspend the lockscreen trigger will cause SDMA traffic and the system can deadlock. This reverts commit 0dee726395333fea833eaaf838bc80962df886c8. Acked-by: Alex Deucher <[email protected]> Fixes: ab4750332dbe ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks") Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd: Stop evicting resources on APUs in suspendMario Limonciello2024-02-131-2/+9
| | | | | | | | | | | | | | | | | | | | commit 5095d5418193 ("drm/amd: Evict resources during PM ops prepare() callback") intentionally moved the eviction of resources to earlier in the suspend process, but this introduced a subtle change that it occurs before adev->in_s0ix or adev->in_s3 are set. This meant that APUs actually started to evict resources at suspend time as well. Explicitly set s0ix or s3 in the prepare() stage, and unset them if the prepare() stage failed. v2: squash in warning fix from Stephen Rothwell Reported-by: Jürg Billeter <[email protected]> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3132#note_2271038 Fixes: 5095d5418193 ("drm/amd: Evict resources during PM ops prepare() callback") Acked-by: Alex Deucher <[email protected]> Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Need to resume ras during gpu reset for gfx v9_4_3 sriovYiPeng Chai2024-01-311-0/+1
| | | | | | | | | Need to resume ras during gpu reset for gfx v9_4_3 sriov Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix the warning info in mode1 resetMa Jun2024-01-311-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the warning info below during mode1 reset. [ +0.000004] Call Trace: [ +0.000004] <TASK> [ +0.000006] ? show_regs+0x6e/0x80 [ +0.000011] ? __flush_work.isra.0+0x2e8/0x390 [ +0.000005] ? __warn+0x91/0x150 [ +0.000009] ? __flush_work.isra.0+0x2e8/0x390 [ +0.000006] ? report_bug+0x19d/0x1b0 [ +0.000013] ? handle_bug+0x46/0x80 [ +0.000012] ? exc_invalid_op+0x1d/0x80 [ +0.000011] ? asm_exc_invalid_op+0x1f/0x30 [ +0.000014] ? __flush_work.isra.0+0x2e8/0x390 [ +0.000007] ? __flush_work.isra.0+0x208/0x390 [ +0.000007] ? _prb_read_valid+0x216/0x290 [ +0.000008] __cancel_work_timer+0x11d/0x1a0 [ +0.000007] ? try_to_grab_pending+0xe8/0x190 [ +0.000012] cancel_work_sync+0x14/0x20 [ +0.000008] amddrm_sched_stop+0x3c/0x1d0 [amd_sched] [ +0.000032] amdgpu_device_gpu_recover+0x29a/0xe90 [amdgpu] This warning info was printed after applying the patch "drm/sched: Convert drm scheduler to use a work queue rather than kthread". The root cause is that amdgpu driver tries to use the uninitialized work_struct in the struct drm_gpu_scheduler v2: - Rename the function to amdgpu_ring_sched_ready and move it to amdgpu_ring.c (Alex) v3: - Fix a few more checks based on Vitaly's patch (Alex) v4: - squash in fix noticed by Bert in https://gitlab.freedesktop.org/drm/amd/-/issues/3139 Fixes: 11b3b9f461c5 ("drm/sched: Check scheduler ready before calling timeout handling") Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Vitaly Prosyak <[email protected]> Signed-off-by: Ma Jun <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: adjust aca init/fini sequence to match gpu resetYang Wang2024-01-251-6/+0
| | | | | | | | | | - move aca init/fini function into ras init/fini to adapt gpu reset sequence. - add new function amdgpu_aca_reset() Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* Revert "drm/amd/pm: fix the high voltage and temperature issue"Mario Limonciello2024-01-221-17/+7
| | | | | | | | | | | | | This reverts commit 5f38ac54e60562323ea4abb1bfb37d043ee23357. This causes issues with rebooting and the 7800XT. Cc: Kenneth Feng <[email protected]> Cc: [email protected] Fixes: 5f38ac54e605 ("drm/amd/pm: fix the high voltage and temperature issue") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3062 Signed-off-by: Mario Limonciello <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Skip do PCI error slot reset during RAS recoveryStanley.Yang2024-01-181-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Why: The PCI error slot reset maybe triggered after inject ue to UMC multi times, this caused system hang. [ 557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume [ 557.373718] [drm] PCIE GART of 512M enabled. [ 557.373722] [drm] PTB located at 0x0000031FED700000 [ 557.373788] [drm] VRAM is lost due to GPU reset! [ 557.373789] [drm] PSP is resuming... [ 557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset [ 557.547067] [drm] PCI error: detected callback, state(1)!! [ 557.547069] [drm] No support for XGMI hive yet... [ 557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter [ 557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations [ 557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered [ 557.610492] [drm] PCI error: slot reset callback!! ... [ 560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded! [ 560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded! [ 560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI [ 560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G OE 5.15.0-91-generic #101-Ubuntu [ 560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023 [ 560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu] [ 560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] [ 560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff <48> 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00 [ 560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202 [ 560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0 [ 560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010 [ 560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08 [ 560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000 [ 560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000 [ 560.803889] FS: 0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000 [ 560.812973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0 [ 560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 560.843444] PKRU: 55555554 [ 560.846480] Call Trace: [ 560.849225] <TASK> [ 560.851580] ? show_trace_log_lvl+0x1d6/0x2ea [ 560.856488] ? show_trace_log_lvl+0x1d6/0x2ea [ 560.861379] ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] [ 560.867778] ? show_regs.part.0+0x23/0x29 [ 560.872293] ? __die_body.cold+0x8/0xd [ 560.876502] ? die_addr+0x3e/0x60 [ 560.880238] ? exc_general_protection+0x1c5/0x410 [ 560.885532] ? asm_exc_general_protection+0x27/0x30 [ 560.891025] ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] [ 560.898323] amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] [ 560.904520] process_one_work+0x228/0x3d0 How: In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: revert "Adjust removal control flow for smu v13_0_2"Christian König2024-01-181-31/+1
| | | | | | | | | | | | | | | | | | | | Calling amdgpu_device_ip_resume_phase1() during shutdown leaves the HW in an active state and is an unbalanced use of the IP callbacks. Using the IP callbacks like this can lead to memory leaks, double free and imbalanced reference counters. Leaving the HW in an active state can lead to DMA accesses to memory now freed by the driver. Both is a complete no-go for driver unload so completely revert the workaround for now. This reverts commit f5c7e7797060255dbc8160734ccc5ad6183c5e04. Signed-off-by: Christian König <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Check extended configuration space register when system uses ↵Ma Jun2024-01-151-0/+4
| | | | | | | | | | | | | | | | | | | | | large bar Some customer platforms do not enable mmconfig for various reasons, such as bios bug, and therefore cannot access the GPU extend configuration space through mmio. When the system enters the d3cold state and resumes, the amdgpu driver fails to resume because the extend configuration space registers of GPU can't be restored. At this point, Usually we only see some failure dmesg log printed by amdgpu driver, it is difficult to find the root cause. Therefor print a warnning message if the system can't access the extended configuration space register when using large bar. Signed-off-by: Ma Jun <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: implement RAS ACA driver frameworkYang Wang2024-01-151-0/+6
| | | | | | | | | | | | | | | | v1: implement new RAS ACA driver code framework. v2: - rename aca_bank_set to aca_banks. - rename aca_source_xxx to aca_handle_xxx. v3: Optimize some function implementation details. (from Hawking's suggestion) Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Init pcie_index/data address as fallback (v2)Hawking Zhang2024-01-151-5/+18
| | | | | | | | | | | | | To allow using this helper for indirect access when nbio funcs is not available. For instance, in ip discovery phase. v2: define macro for pcie_index/data/index_hi fallback. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: drop psp v13 query_boot_status implementationHawking Zhang2024-01-151-2/+0
| | | | | | | | | | Will replace it with new implementation to cover boot fails in ip discovery phase. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>