aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: rename enforce isolation variablesAlex Deucher2025-04-211-1/+1
| | | | | | | | | Since they will be used for both KFD and KGD user queues, rename them from kfd to userq. No intended functional change. Acked-by: Sunil Khatri <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu/userq: handle system suspend and resumeAlex Deucher2025-04-211-1/+13
| | | | | | | Unmap user queues on suspend and map them on resume. Reviewed-by: Sunil Khatri <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: adjust enforce_isolation handlingAlex Deucher2025-04-111-2/+20
| | | | | | | | | | | | Switch from a bool to an enum and allow more options for enforce isolation. There are now 3 modes of operation: - Disabled (0) - Enabled (serialization and cleaner shader) (1) - Enabled in legacy mode (no serialization or cleaner shader) (2) This provides better flexibility for more use cases. Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Replace tmp_adev with hive in amdgpu_pci_slot_resetCe Sun2025-04-111-3/+3
| | | | | | | | | | | | | Checking hive is more readable. The following smatch warning: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6820 amdgpu_pci_slot_reset() warn: iterator used outside loop: 'tmp_adev' Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Ce Sun <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix warning of drm_mm_cleanZhenGuo Yin2025-04-111-1/+1
| | | | | | | | | | Kernel doorbell BOs needs to be freed before ttm_fini. Fixes: 54c30d2a8def ("drm/amdgpu: create kernel doorbell pages") Acked-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: ZhenGuo Yin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/amdgpu: disable ASPM in some situationsKenneth Feng2025-04-111-0/+32
| | | | | | | | | disable ASPM with some ASICs on some specific platforms. required from PCIe controller owner. Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: store userq_managers in a list in adevAlex Deucher2025-04-081-0/+3
| | | | | | | | | | So we can iterate across them when we need to manage all user queues. v2: add uq_mgr to adev list in amdgpu_userq_mgr_init Reviewed-by: Arunpravin Paneer Selvam <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Remove the MES self testArunpravin Paneer Selvam2025-04-081-3/+0
| | | | | | | | | | | | Remove MES self test as this conflicts the userqueue fence interrupts. v2:(Christian) - remove the amdgpu_mes_self_test() function and any now unused code. Signed-off-by: Arunpravin Paneer Selvam <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Enable userq fence interrupt supportArunpravin Paneer Selvam2025-04-081-0/+2
| | | | | | | | | | | | | | | | | | | | Add support to handle the userqueue protected fence signal hardware interrupt. Create a xarray which maps the doorbell index to the fence driver address. This would help to retrieve the fence driver information when an userq fence interrupt is triggered. Firmware sends the doorbell offset value and this info is compared with the queue's mqd doorbell offset value. If they are same, we process the userq fence interrupt. v1:(Christian): - use xa_load to extract the fence driver. - move the amdgpu_userq_fence_driver_process call within the xa_lock as there is a chance that fence_drv might be freed. Signed-off-by: Arunpravin Paneer Selvam <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/amdgpu: decouple ASPM with pcie dpmKenneth Feng2025-04-081-2/+0
| | | | | | | | | ASPM doesn't need to be disabled if pcie dpm is disabled. So ASPM can be independantly enabled. Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: cancel gfx idle work in device suspend for s0ixAlex Deucher2025-04-081-0/+7
| | | | | | | | | | | | This is normally handled in the gfx IP suspend callbacks, but for S0ix, those are skipped because we don't want to touch gfx. So handle it in device suspend. Fixes: b9467983b774 ("drm/amdgpu: add dynamic workload profile switching for gfx10") Fixes: 963537ca2325 ("drm/amdgpu: add dynamic workload profile switching for gfx11") Fixes: 5f95a1549555 ("drm/amdgpu: add dynamic workload profile switching for gfx12") Reviewed-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdkfd: Drop workaround for GC v9.4.3 revID 0Apurv Mishra2025-04-071-1/+7
| | | | | | | | | Remove workaround code for the early engineering samples GC v9.4.3 SOCs with revID 0 Reviewed-by: Amber Lin <[email protected]> Signed-off-by: Apurv Mishra <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add rebar parameterAlex Deucher2025-04-071-0/+3
| | | | | | | | | | | | | Add a new parameter to disable BAR resizing. Note that this only disables the driver from attempting to resize the BAR, The BIOS may have resized the BAR at boot. Some teams have found this useful in debugging P2P DMA issues on systems where the available MMIO space did not allow for all of the GPUs present to resize their BARs. Reviewed-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Multi-GPU DPC recovery supportCe Sun2025-04-071-55/+113
| | | | | | | | Add support for DPC recover based on refactored code Signed-off-by: Ce Sun <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refactor amdgpu_device_gpu_recoverCe Sun2025-04-071-90/+144
| | | | | | | | | | | Split amdgpu_device_gpu_recover into the following stages: halt activities,asic reset,schedule resume and amdgpu resume. The reason is that the subsequent addition of dpc recover code will have a high similarity with gpu reset Signed-off-by: Ce Sun <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add isolation trace pointChristian König2025-03-211-0/+1
| | | | | | | | Note when we switch from one isolation owner to another. Signed-off-by: Christian König <[email protected]> Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: rework how isolation is enforced v2Christian König2025-03-211-1/+97
| | | | | | | | | | | | | | | | | | Limiting the number of available VMIDs to enforce isolation causes some issues with gang submit and applying certain HW workarounds which require multiple VMIDs to work correctly. So instead start to track all submissions to the relevant engines in a per partition data structure and use the dma_fences of the submissions to enforce isolation similar to what a VMID limit does. v2: use ~0l for jobs without isolation to distinct it from kernel submissions which uses NULL for the owner. Add some warning when we are OOM. Signed-off-by: Christian König <[email protected]> Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Enable amdgpu_ras_resume for gfx 9.5.0Ellen Pan2025-03-211-0/+1
| | | | | | | | This enables ras to be resumed after gpu recovery on mi350 sriov. Signed-off-by: Ellen Pan <[email protected]> Reviewed-by: Ahmad Rehman <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Skip pcie_replay_count sysfs creation for VFVictor Skvortsov2025-03-191-7/+20
| | | | | | | | | VFs cannot read the NAK_COUNTER register. This information is only available through PMFW metrics. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu/vcn: fix ref counting for ring based profile handlingAlex Deucher2025-03-191-0/+1
| | | | | | | | | | | | | | | | | We need to make sure the workload profile ref counts are balanced. This isn't currently the case because we can increment the count on submissions, but the decrement may be delayed as work comes in. Track when we enable the workload profile so the references are balanced. v2: switch to a mutex and active flag v3: fix mutex init Fixes: 1443dd3c67f6 ("drm/amd/pm: fix and simplify workload handling") Cc: Yang Wang <[email protected]> Cc: Kenneth Feng <[email protected]> Reviewed-by: Kenneth Feng <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu/gfx: fix ref counting for ring based profile handlingAlex Deucher2025-03-191-0/+1
| | | | | | | | | | | | | | | | | | We need to make sure the workload profile ref counts are balanced. This isn't currently the case because we can increment the count on submissions, but the decrement may be delayed as work comes in. Track when we enable the workload profile so the references are balanced. v2: switch to a mutex and active flag v3: fix mutex init Fixes: 8fdb3958e396 ("drm/amdgpu/gfx: add ring helpers for setting workload profile") Cc: Yang Wang <[email protected]> Cc: Kenneth Feng <[email protected]> Tested-by: Kenneth Feng <[email protected]> Reviewed-by: Kenneth Feng <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: grab an additional reference on the gang fence v2Christian König2025-03-191-1/+9
| | | | | | | | | | | We keep the gang submission fence around in adev, make sure that it stays alive. v2: fix memory leak on retry Signed-off-by: Christian König <[email protected]> Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: release xcp_mgr on exitFlora Cui2025-03-181-0/+3
| | | | | | | | Free on driver cleanup. Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Flora Cui <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: don't free conflicting apertures for non-display devicesAlex Deucher2025-03-181-4/+11
| | | | | | | | | PCI_CLASS_ACCELERATOR_PROCESSING devices won't ever be the sysfb, so there is no need to free conflicting apertures. Reviewed-by: Kent Russell <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Calculate IP specific xgmi bandwidthLijo Lazar2025-03-141-0/+3
| | | | | | | | Use IP version specific xgmi speed/width for bandwidth calculation. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Jonathan Kim <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* Merge tag 'amd-drm-next-6.15-2025-03-07' of ↵Dave Airlie2025-03-091-4/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://gitlab.freedesktop.org/agd5f/linux into drm-next amdgpu: - Fix spelling typos - RAS updates - VCN 5.0.1 updates - SubVP fixes - DCN 4.0.1 fixes - MSO DPCD fixes - DIO encoder refactor - PCON fixes - Misc cleanups - DMCUB fixes - USB4 DP fixes - DM cleanups - Backlight cleanups and fixes - Support platform backlight curves - Misc code cleanups - SMU 14 fixes - JPEG 4.0.3 reset updates - SR-IOV fixes - SVM fixes - GC 12 DCC fixes - DC DCE 6.x fix - Hiberation fix amdkfd: - Fix possible NULL pointer in queue validation - Remove unnecessary CP domain validation - SDMA queue reset support - Add per process flags radeon: - Fix spelling typos - RS400 hyperZ fix UAPI: - Add KFD per process flags for setting precision Proposed user space: https://github.com/ROCm/ROCR-Runtime/commit/2a64fa5e06e80e0af36df4ce0c76ae52eeec0a9d Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
| * drm/amdgpu: Add support for CPERs on virtualizationTony Yi2025-03-051-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for CPERs on VFs. VFs do not receive PMFW messages directly; as such, they need to query them from the host. To avoid hitting host event guard, CPER queries need to be rate limited. CPER queries share the same RAS telemetry buffer as error count query, so a mutex protecting the shared buffer was added as well. For readability, the amdgpu_detect_virtualization was refactored into multiple individual functions. Signed-off-by: Tony Yi <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: disable BAR resize on Dell G5 SEAlex Deucher2025-02-251-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a quirk added to add a workaround for a Sapphire RX 5600 XT Pulse that didn't allow BAR resizing. However, the quirk caused a regression with runtime pm on Dell laptops using those chips, rather than narrowing the scope of the resizing quirk, add a quirk to prevent amdgpu from resizing the BAR on those Dell platforms unless runtime pm is disabled. v2: update commit message, add runpm check Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1707 Fixes: 907830b0fc9e ("PCI: Add a REBAR size quirk for Sapphire RX 5600 XT Pulse") Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | Merge tag 'amd-drm-next-6.15-2025-02-21' of ↵Dave Airlie2025-02-261-31/+71
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.15-2025-02-21: amdgpu: - Add OEM i2c support for RGB lights, etc. - Add support for GC 11.5.3 - Add support for GC 11.5.2 - Add support for SDMA 6.1.3 - Add support for NBIO 7.11.2 - Add support for NBIO 7.9.1 - Add support for MMHUB 3.3.2 - Add support for MMHUB 1.8.1 - Add support for SMU 14.0.5 - Add support for SMUIO 13.0.11 - Add support for PSP 14.0.5 - Add support for UMC 12.5.0 - Add support for DCN 3.6.0 - JPEG 4.0.3 updates - Add dynamic workload profile switching for GC 10-12 - support larger vbios sizes - GC 9.5.0 updates - SMU 13.0.12 updates - SMU 13.0.6 updates - IP discovery updates - GC 10 queue reset updates - DCN 4.0.1 updates - UHBR link rate fixes - Aborted suspend fix - Mark gttsize parameter as deprecated - GC 10 cleaner shader updates - PSR-SU fixes - Clean up PM4 headers - Cursor fixes - Enable devcoredump for JPEG - Misc cleanups - Runpm cleanups - MES updates - GC 9 gfxoff fixes - Vbios fetching cleanups - Documentation updates - Update secondary plane handling - DML2 updates - SDMA fixes for MI - Cleaner shader fixes for GC 11/12 - ACA updates - Initial JPEG queue reset support - RAS updates - Initial RAS CPER support - DCN/DCE panic screen handling cleanup - BT2020 fixes - SR-IOV fixes amdkfd: - synchronize pasid values between KGD and KFD - Misc cleanups - Improve GTT/VRAM handling for APUs - Topology updates - Fix user queue validation on GC 7/8 UAPI: - Enable "Broadcast RGB" drm property - Add INFO IOCTL query for virtualization mode Proposed userspace: https://github.com/ROCm/amdsmi/commit/e663bed7d6b3df79f5959e73981749b1f22ec698 From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Dave Airlie <[email protected]>
| * drm/amdgpu: update the handle ptr in get_clockgating_stateSunil Khatri2025-02-191-1/+2
| | | | | | | | | | | | | | | | | | Update the *handle to amdgpu_ip_block ptr for all functions pointers of get_clockgating_state. Signed-off-by: Sunil Khatri <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Check aca enabled inside cper init/fini funcXiang Liu2025-02-191-4/+2
| | | | | | | | | | | | | | | | | | Move code about checking aca enabled to the cper init/fini function to make code clean. Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Replace Mutex with Spinlock for RLCG register access to avoid ↵Srinivasan Shanmugam2025-02-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Priority Inversion in SRIOV RLCG Register Access is a way for virtual functions to safely access GPU registers in a virtualized environment., including TLB flushes and register reads. When multiple threads or VFs try to access the same registers simultaneously, it can lead to race conditions. By using the RLCG interface, the driver can serialize access to the registers. This means that only one thread can access the registers at a time, preventing conflicts and ensuring that operations are performed correctly. Additionally, when a low-priority task holds a mutex that a high-priority task needs, ie., If a thread holding a spinlock tries to acquire a mutex, it can lead to priority inversion. register access in amdgpu_virt_rlcg_reg_rw especially in a fast code path is critical. The call stack shows that the function amdgpu_virt_rlcg_reg_rw is being called, which attempts to acquire the mutex. This function is invoked from amdgpu_sriov_wreg, which in turn is called from gmc_v11_0_flush_gpu_tlb. The [ BUG: Invalid wait context ] indicates that a thread is trying to acquire a mutex while it is in a context that does not allow it to sleep (like holding a spinlock). Fixes the below: [ 253.013423] ============================= [ 253.013434] [ BUG: Invalid wait context ] [ 253.013446] 6.12.0-amdstaging-drm-next-lol-050225 #14 Tainted: G U OE [ 253.013464] ----------------------------- [ 253.013475] kworker/0:1/10 is trying to lock: [ 253.013487] ffff9f30542e3cf8 (&adev->virt.rlcg_reg_lock){+.+.}-{3:3}, at: amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu] [ 253.013815] other info that might help us debug this: [ 253.013827] context-{4:4} [ 253.013835] 3 locks held by kworker/0:1/10: [ 253.013847] #0: ffff9f3040050f58 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680 [ 253.013877] #1: ffffb789c008be40 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680 [ 253.013905] #2: ffff9f3054281838 (&adev->gmc.invalidate_lock){+.+.}-{2:2}, at: gmc_v11_0_flush_gpu_tlb+0x198/0x4f0 [amdgpu] [ 253.014154] stack backtrace: [ 253.014164] CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Tainted: G U OE 6.12.0-amdstaging-drm-next-lol-050225 #14 [ 253.014189] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 253.014203] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/18/2024 [ 253.014224] Workqueue: events work_for_cpu_fn [ 253.014241] Call Trace: [ 253.014250] <TASK> [ 253.014260] dump_stack_lvl+0x9b/0xf0 [ 253.014275] dump_stack+0x10/0x20 [ 253.014287] __lock_acquire+0xa47/0x2810 [ 253.014303] ? srso_alias_return_thunk+0x5/0xfbef5 [ 253.014321] lock_acquire+0xd1/0x300 [ 253.014333] ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu] [ 253.014562] ? __lock_acquire+0xa6b/0x2810 [ 253.014578] __mutex_lock+0x85/0xe20 [ 253.014591] ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu] [ 253.014782] ? sched_clock_noinstr+0x9/0x10 [ 253.014795] ? srso_alias_return_thunk+0x5/0xfbef5 [ 253.014808] ? local_clock_noinstr+0xe/0xc0 [ 253.014822] ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu] [ 253.015012] ? srso_alias_return_thunk+0x5/0xfbef5 [ 253.015029] mutex_lock_nested+0x1b/0x30 [ 253.015044] ? mutex_lock_nested+0x1b/0x30 [ 253.015057] amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu] [ 253.015249] amdgpu_sriov_wreg+0xc5/0xd0 [amdgpu] [ 253.015435] gmc_v11_0_flush_gpu_tlb+0x44b/0x4f0 [amdgpu] [ 253.015667] gfx_v11_0_hw_init+0x499/0x29c0 [amdgpu] [ 253.015901] ? __pfx_smu_v13_0_update_pcie_parameters+0x10/0x10 [amdgpu] [ 253.016159] ? srso_alias_return_thunk+0x5/0xfbef5 [ 253.016173] ? smu_hw_init+0x18d/0x300 [amdgpu] [ 253.016403] amdgpu_device_init+0x29ad/0x36a0 [amdgpu] [ 253.016614] amdgpu_driver_load_kms+0x1a/0xc0 [amdgpu] [ 253.017057] amdgpu_pci_probe+0x1c2/0x660 [amdgpu] [ 253.017493] local_pci_probe+0x4b/0xb0 [ 253.017746] work_for_cpu_fn+0x1a/0x30 [ 253.017995] process_one_work+0x21e/0x680 [ 253.018248] worker_thread+0x190/0x330 [ 253.018500] ? __pfx_worker_thread+0x10/0x10 [ 253.018746] kthread+0xe7/0x120 [ 253.018988] ? __pfx_kthread+0x10/0x10 [ 253.019231] ret_from_fork+0x3c/0x60 [ 253.019468] ? __pfx_kthread+0x10/0x10 [ 253.019701] ret_from_fork_asm+0x1a/0x30 [ 253.019939] </TASK> v2: s/spin_trylock/spin_lock_irqsave to be safe (Christian). Fixes: e864180ee49b ("drm/amdgpu: Add lock around VF RLCG interface") Cc: lin cao <[email protected]> Cc: Jingwen Chen <[email protected]> Cc: Victor Skvortsov <[email protected]> Cc: Zhigang Luo <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Suggested-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: add RAS CPER ring bufferTao Zhou2025-02-171-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | And initialize it, this is a pure software ring to store RAS CPER data. v2: change ring size to 0x100000 v2: update the initialization of count_dw of cper ring, it's dword variable v3: skip VM inv eng for cper v3: init/fini when aca enabled Signed-off-by: Tao Zhou <[email protected]> Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Introduce funcs for populating CPERHawking Zhang2025-02-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | Introduce utility functions designed to assist in populating CPER records. v2: call cper_init/fini in device_ip_init/fini. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Make VBIOS image read optionalLijo Lazar2025-02-131-0/+3
| | | | | | | | | | | | | | | | | | Keep VBIOS image read optional for select SOCs in passthrough mode. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Add flag to make VBIOS read optionalLijo Lazar2025-02-131-20/+49
| | | | | | | | | | | | | | | | | | | | | | Certain SOCs may not need much data from VBIOS. Some data like VBIOS version used will be missed but it doesn't affect functionality. Add a flag to make VBIOS image optional. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Add VBIOS flagsLijo Lazar2025-02-131-7/+13
| | | | | | | | | | | | | | | | | | | | Instead of read_bios, use get_bios_flags to get various options around reading VBIOS. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Add wrapper for freeing vbios memoryLijo Lazar2025-02-131-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Use bios_release wrapper to release memory allocated for vbios image and reset the variables. v2: Use the same wrapper for clean up in sw_fini (Alex Deucher) Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: rework i2c init and finiAlex Deucher2025-02-131-4/+2
| | | | | | | | | | | | | | | | | | No functional change. Rework the code to allow for adding some additional i2c buses in conjunction with DC in the future. Reviewed-by: Harry Wentland <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | drm/amdgpu: Use device wedged eventAndré Almeida2025-02-131-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use DRM's device wedged event to notify userspace that a reset had happened. For now, only use `none` method meant for telemetry capture. In the future we might want to report a recovery method if the reset didn't succeed. Acked-by: Shashank Sharma <[email protected]> Signed-off-by: André Almeida <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Rodrigo Vivi <[email protected]>
* | drm/sched: Use struct for drm_sched_init() paramsPhilipp Stanner2025-02-121-6/+12
|/ | | | | | | | | | | | | | | | | | | | | | | | | drm_sched_init() has a great many parameters and upcoming new functionality for the scheduler might add even more. Generally, the great number of parameters reduces readability and has already caused one missnaming, addressed in: commit 6f1cacf4eba7 ("drm/nouveau: Improve variable name in nouveau_sched_init()"). Introduce a new struct for the scheduler init parameters and port all users. Reviewed-by: Liviu Dudau <[email protected]> Acked-by: Matthew Brost <[email protected]> # for Xe Reviewed-by: Boris Brezillon <[email protected]> # for Panfrost and Panthor Reviewed-by: Christian Gmeiner <[email protected]> # for Etnaviv Reviewed-by: Frank Binns <[email protected]> # for Imagination Reviewed-by: Tvrtko Ursulin <[email protected]> # for Sched Reviewed-by: Maíra Canal <[email protected]> # for v3d Reviewed-by: Danilo Krummrich <[email protected]> Reviewed-by: Lizhi Hou <[email protected]> # for amdxdna Signed-off-by: Philipp Stanner <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
* drm/amd/amdgpu: Prevent null pointer dereference in GPU bandwidth calculationSrinivasan Shanmugam2025-01-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the parent is NULL, adev->pdev is used to retrieve the PCIe speed and width, ensuring that the function can still determine these capabilities from the device itself. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6193 amdgpu_device_gpu_bandwidth() error: we previously assumed 'parent' could be null (see line 6180) drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 6170 static void amdgpu_device_gpu_bandwidth(struct amdgpu_device *adev, 6171 enum pci_bus_speed *speed, 6172 enum pcie_link_width *width) 6173 { 6174 struct pci_dev *parent = adev->pdev; 6175 6176 if (!speed || !width) 6177 return; 6178 6179 parent = pci_upstream_bridge(parent); 6180 if (parent && parent->vendor == PCI_VENDOR_ID_ATI) { ^^^^^^ If parent is NULL 6181 /* use the upstream/downstream switches internal to dGPU */ 6182 *speed = pcie_get_speed_cap(parent); 6183 *width = pcie_get_width_cap(parent); 6184 while ((parent = pci_upstream_bridge(parent))) { 6185 if (parent->vendor == PCI_VENDOR_ID_ATI) { 6186 /* use the upstream/downstream switches internal to dGPU */ 6187 *speed = pcie_get_speed_cap(parent); 6188 *width = pcie_get_width_cap(parent); 6189 } 6190 } 6191 } else { 6192 /* use the device itself */ --> 6193 *speed = pcie_get_speed_cap(parent); ^^^^^^ Then we are toasted here. 6194 *width = pcie_get_width_cap(parent); 6195 } 6196 } Fixes: 757e8b951ce2 ("drm/amdgpu: cache gpu pcie link width") Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Suggested-by: Lijo Lazar <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: cache gpu pcie link widthAlex Deucher2025-01-241-32/+120
| | | | | | | | | | | Get the PCIe link with of the device itself (or it's integrated upstream bridge) and cache that. v2: fix typo Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3820 Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Refine ip detection log messageLijo Lazar2025-01-241-2/+2
| | | | | | | | | | 'add ip block' causes a confusion if the blocks are disabled later with ip_block_mask. Instead change to 'detected' and also add device context. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Suggested-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd: Require CONFIG_HOTPLUG_PCI_PCIE for BOCOMario Limonciello2024-12-181-0/+3
| | | | | | | | | | | | | | | If the kernel hasn't been compiled with PCIe hotplug support this can lead to problems with dGPUs that use BOCO because they effectively drop off the bus. To prevent issues, disable BOCO support when compiled without PCIe hotplug. Reported-by: Gabriel Marcano <[email protected]> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1707#note_2696862 Acked-by: Alex Deucher <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Avoid VF for RAS recovery source checkLijo Lazar2024-12-111-0/+1
| | | | | | | | | | | | | | VF device sets the RAS flag when mailbox data can't be read properly. There is no conclusive way to tell if the real source is RAS error. Therefore VF schedules a KFD based reset which doesn't set RAS source. SKip checking RAS source for any VF scheduled recovery. Signed-off-by: Lijo Lazar <[email protected]> Reported-by: Vojislav Tomasevic <[email protected]> Reviewed-by: Yiqing Yao <[email protected]> Tested-by: Yiqing Yao <[email protected]> Fixes: e1ee2111ca48 ("drm/amdgpu: Prefer RAS recovery for scheduler hang") Signed-off-by: Alex Deucher <[email protected]>
* drm/amd: Add the capability to mark certain firmware as "required"Mario Limonciello2024-12-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Some of the firmware that is loaded by amdgpu is not actually required. For example the ISP firmware on some SoCs is optional, and if it's not present the ISP IP block just won't be initialized. The firmware loader core however will show a warning when this happens like this: ``` Direct firmware load for amdgpu/isp_4_1_0.bin failed with error -2 ``` To avoid confusion for non-required firmware, adjust the amd-ucode helper to take an extra argument indicating if the firmware is required or optional. On optional firmware use firmware_request_nowarn() instead of request_firmware() to avoid the warnings. Reviewed-by: Alex Deucher <[email protected]> Link: https://lore.kernel.org/amd-gfx/[email protected]/T/#t Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add initial support for gfx950Le Ma2024-12-101-0/+1
| | | | | | | | add gfx950 basic support Signed-off-by: Le Ma <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: device: fix spellos and punctuationRandy Dunlap2024-12-101-15/+15
| | | | | | | | | | | | | Make spelling and punctuation changes to ease reading of the comments. Signed-off-by: Randy Dunlap <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Christian König <[email protected]> Cc: Xinhui Pan <[email protected]> Cc: [email protected] Cc: David Airlie <[email protected]> Cc: Simona Vetter <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd: Add Suspend/Hibernate notification callback supportMario Limonciello2024-12-101-1/+45
| | | | | | | | | | | | | | | | | | | | | | | As part of the suspend sequence VRAM needs to be evicted on dGPUs. In order to make suspend/resume more reliable we moved this into the pmops prepare() callback so that the suspend sequence would fail but the system could remain operational under high memory usage suspend. Another class of issues exist though where due to memory fragementation there isn't a large enough contiguous space and swap isn't accessible. Add support for a suspend/hibernate notification callback that could evict VRAM before tasks are frozen. This should allow paging out to swap if necessary. Link: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3476 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2362 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3781 Reviewed-by: Lijo Lazar <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>