aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: init return value in amdgpu_ttm_clear_bufferPierre-Eric Pelloux-Prayer2025-02-251-1/+1
| | | | | | | | | | | | | | | Otherwise an uninitialized value can be returned if amdgpu_res_cleared returns true for all regions. Possibly closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3812 Fixes: a68c7eaa7a8f ("drm/amdgpu: Enable clear page functionality") Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 7c62aacc3b452f73a1284198c81551035fac6d71) Cc: [email protected]
* drm/amdgpu/mes: keep enforce isolation up to dateAlex Deucher2025-02-255-11/+32
| | | | | | | | | | | | | Re-send the mes message on resume to make sure the mes state is up to date. Fixes: 8521e3c5f058 ("drm/amd/amdgpu: limit single process inside MES") Acked-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: Shaoyun Liu <[email protected]> Cc: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 27b791514789844e80da990c456c2465325e0851)
* drm/amdgpu/gfx: only call mes for enforce isolation if supportedAlex Deucher2025-02-251-2/+4
| | | | | | | | | | | | This should not be called on chips without MES so check if MES is enabled and if the cleaner shader is supported. Fixes: 8521e3c5f058 ("drm/amd/amdgpu: limit single process inside MES") Reviewed-by: Srinivasan Shanmugam <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: Shaoyun Liu <[email protected]> Cc: Srinivasan Shanmugam <[email protected]> (cherry picked from commit 80513e389765c8f9543b26d8fa4bbdf0e59ff8bc)
* drm/amdgpu: disable BAR resize on Dell G5 SEAlex Deucher2025-02-251-0/+7
| | | | | | | | | | | | | | | | | | There was a quirk added to add a workaround for a Sapphire RX 5600 XT Pulse that didn't allow BAR resizing. However, the quirk caused a regression with runtime pm on Dell laptops using those chips, rather than narrowing the scope of the resizing quirk, add a quirk to prevent amdgpu from resizing the BAR on those Dell platforms unless runtime pm is disabled. v2: update commit message, add runpm check Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1707 Fixes: 907830b0fc9e ("PCI: Add a REBAR size quirk for Sapphire RX 5600 XT Pulse") Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 5235053f443cef4210606e5fb71f99b915a9723d) Cc: [email protected]
* drm/amdgpu: bail out when failed to load fw in psp_init_cap_microcode()Jiang Liu2025-02-131-2/+3
| | | | | | | | | | In function psp_init_cap_microcode(), it should bail out when failed to load firmware, otherwise it may cause invalid memory access. Fixes: 07dbfc6b102e ("drm/amd: Use `amdgpu_ucode_*` helpers for PSP") Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Jiang Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: bump version for RV/PCO compute fixAlex Deucher2025-02-131-1/+2
| | | | | | | | | | Bump the driver version for RV/PCO compute stability fix so mesa can use this check to enable compute queues on RV/PCO. Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected] # 6.12.x
* drm/amdgpu/gfx9: manually control gfxoff for CS on RVAlex Deucher2025-02-131-2/+34
| | | | | | | | | | | | | | | | | | | | | | | | When mesa started using compute queues more often we started seeing additional hangs with compute queues. Disabling gfxoff seems to mitigate that. Manually control gfxoff and gfx pg with command submissions to avoid any issues related to gfxoff. KFD already does the same thing for these chips. v2: limit to compute v3: limit to APUs v4: limit to Raven/PCO v5: only update the compute ring_funcs v6: Disable GFX PG v7: adjust order Reviewed-by: Lijo Lazar <[email protected]> Suggested-by: Błażej Szczygieł <[email protected]> Suggested-by: Sergey Kovalenko <[email protected]> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3861 Link: https://lists.freedesktop.org/archives/amd-gfx/2025-January/119116.html Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected] # 6.12.x
* drm/amdgpu: add a BO metadata flag to disable write compression for VulkanMarek Olšák2025-02-034-5/+13
| | | | | | | | | | | | | Vulkan can't support DCC and Z/S compression on GFX12 without WRITE_COMPRESS_DISABLE in this commit or a completely different DCC interface. AMDGPU_TILING_GFX12_SCANOUT is added because it's already used by userspace. Cc: [email protected] # 6.12.x Signed-off-by: Marek Olšák <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/amdgpu: change the config of cgcg on gfx12Kenneth Feng2025-01-281-11/+0
| | | | | | | | | change the config of cgcg on gfx12 Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected] # 6.12.x
* drm/amd/amdgpu: Enable scratch data dump for mes 12Shaoyun Liu2025-01-242-14/+37
| | | | | | | | | | MES internal will check CP_MES_MSCRATCH_LO/HI register to set scratch data location during ucode start, driver side need to start the MES one by one with different setting for each pipe Signed-off-by: Shaoyun Liu <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd: Clarify kdoc for amdgpu.gttsizeMario Limonciello2025-01-241-1/+1
| | | | | | | | | Effectively amdgpu.gttsize gets set to ~1/2 of RAM, but that's controlled by what the TTM page limit is set to. Clarify the kdoc. Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd/amdgpu: Prevent null pointer dereference in GPU bandwidth calculationSrinivasan Shanmugam2025-01-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the parent is NULL, adev->pdev is used to retrieve the PCIe speed and width, ensuring that the function can still determine these capabilities from the device itself. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6193 amdgpu_device_gpu_bandwidth() error: we previously assumed 'parent' could be null (see line 6180) drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 6170 static void amdgpu_device_gpu_bandwidth(struct amdgpu_device *adev, 6171 enum pci_bus_speed *speed, 6172 enum pcie_link_width *width) 6173 { 6174 struct pci_dev *parent = adev->pdev; 6175 6176 if (!speed || !width) 6177 return; 6178 6179 parent = pci_upstream_bridge(parent); 6180 if (parent && parent->vendor == PCI_VENDOR_ID_ATI) { ^^^^^^ If parent is NULL 6181 /* use the upstream/downstream switches internal to dGPU */ 6182 *speed = pcie_get_speed_cap(parent); 6183 *width = pcie_get_width_cap(parent); 6184 while ((parent = pci_upstream_bridge(parent))) { 6185 if (parent->vendor == PCI_VENDOR_ID_ATI) { 6186 /* use the upstream/downstream switches internal to dGPU */ 6187 *speed = pcie_get_speed_cap(parent); 6188 *width = pcie_get_width_cap(parent); 6189 } 6190 } 6191 } else { 6192 /* use the device itself */ --> 6193 *speed = pcie_get_speed_cap(parent); ^^^^^^ Then we are toasted here. 6194 *width = pcie_get_width_cap(parent); 6195 } 6196 } Fixes: 757e8b951ce2 ("drm/amdgpu: cache gpu pcie link width") Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Suggested-by: Lijo Lazar <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix ring timeout issue in gfx10 sr-iov environmentLin.Cao2025-01-241-1/+1
| | | | | | | | | | | | | commit 26c95e838e63 ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare") set job->vm as NULL if there is no fence. It will cause emit switch buffer be skippen if job->vm set as NULL. Check job rather than vm could solve this problem. Fixes: 26c95e838e63 ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare") Signed-off-by: Lin.Cao <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix the PCIe lanes reporting in the INFO IOCTLAlex Deucher2025-01-241-8/+11
| | | | | | | | | | Combine the platform and GPU caps like we do for PCIe Gen. This aligns properly with expectations and documentation for the interface. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3820 Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: cache gpu pcie link widthAlex Deucher2025-01-241-32/+120
| | | | | | | | | | | Get the PCIe link with of the device itself (or it's integrated upstream bridge) and cache that. v2: fix typo Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3820 Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Refine ip detection log messageLijo Lazar2025-01-241-2/+2
| | | | | | | | | | 'add ip block' causes a confusion if the blocks are disabled later with ip_block_mask. Instead change to 'detected' and also add device context. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Suggested-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add handler for SDMA context emptyLijo Lazar2025-01-242-0/+23
| | | | | | | | | | | Context empty interrupt is enabled for SDMA 4.4.2. Add a handler for context empty interrupt so that it is disposed of fast, and not propagated to KFD layer. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Suggested-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu/gfx12: Add Cleaner Shader Support for GFX12.0 GPUsSrinivasan Shanmugam2025-01-141-0/+8
| | | | | | | | | | | | | | | | | | | This commit enables the cleaner shader feature for GFX12.0 and GFX12.0.1 GPUs. The cleaner shader is important for clearing GPU resources such as Local Data Share (LDS), Vector General Purpose Registers (VGPRs), and Scalar General Purpose Registers (SGPRs) between workloads. - This feature ensures that GPU resources are reset between workloads, preventing data leaks and ensuring accurate computation. By enabling the cleaner shader, this update enhances the security and reliability of GPU operations on GFX12.0 hardware. Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Mark debug KFD module params as unsafeKent Russell2025-01-141-5/+5
| | | | | | | | | | Mark options only meant to be used for debugging as unsafe so that the kernel is tainted when they are used. Signed-off-by: Kent Russell <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: fix fw attestation for MP0_14_0_{2/3}Gui Chengming2025-01-141-0/+4
| | | | | | | | | | | | | | FW attestation was disabled on MP0_14_0_{2/3}. V2: Move check into is_fw_attestation_support func. (Frank) Remove DRM_WARN log info. (Alex) Fix format. (Christian) Signed-off-by: Gui Chengming <[email protected]> Reviewed-by: Frank.Min <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: always sync the GFX pipe on ctx switchChristian König2025-01-141-2/+2
| | | | | | | | That is needed to enforce isolation between contexts. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: mark a bunch of module parameters unsafeChristian König2025-01-141-7/+7
| | | | | | | | | | | | | We sometimes have people trying to use debugging options in production environments. Mark options only meant to be used for debugging as unsafe so that the kernel is tainted when they are used. Signed-off-by: Christian König <[email protected]> Acked-by: Felix Kuehling <[email protected]> Acked-by: Simona Vetter <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use DRM scheduler API in amdgpu_xcp_release_schedTvrtko Ursulin2025-01-141-1/+1
| | | | | | | | | | | | | Lets use the existing helper instead of peeking into the structure directly. Reviewed-by: Christian König <[email protected]> Signed-off-by: Tvrtko Ursulin <[email protected]> Cc: Christian König <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Philipp Stanner <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: disable gfxoff with the compute workload on gfx12Kenneth Feng2025-01-141-2/+3
| | | | | | | | | Disable gfxoff with the compute workload on gfx12. This is a workaround for the opencl test failure. Signed-off-by: Kenneth Feng <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix Circular Locking Dependency in AMDGPU GFX IsolationSrinivasan Shanmugam2025-01-141-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit addresses a circular locking dependency issue within the GFX isolation mechanism. The problem was identified by a warning indicating a potential deadlock due to inconsistent lock acquisition order. - The `amdgpu_gfx_enforce_isolation_ring_begin_use` and `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`, leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is called while `enforce_isolation_mutex` is held, and `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is held, it can create a circular dependency. By ensuring consistent lock usage, this fix resolves the issue: [ 606.297333] ====================================================== [ 606.297343] WARNING: possible circular locking dependency detected [ 606.297353] 6.10.0-amd-mlkd-610-311224-lof #19 Tainted: G OE [ 606.297365] ------------------------------------------------------ [ 606.297375] kworker/u96:3/3825 is trying to acquire lock: [ 606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610 [ 606.297413] but task is already holding lock: [ 606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.297725] which lock already depends on the new lock. [ 606.297738] the existing dependency chain (in reverse order) is: [ 606.297749] -> #2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}: [ 606.297765] __mutex_lock+0x85/0x930 [ 606.297776] mutex_lock_nested+0x1b/0x30 [ 606.297786] amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.298007] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.298225] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.298412] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.298603] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.298866] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.298880] process_one_work+0x21e/0x680 [ 606.298890] worker_thread+0x190/0x350 [ 606.298899] kthread+0xe7/0x120 [ 606.298908] ret_from_fork+0x3c/0x60 [ 606.298919] ret_from_fork_asm+0x1a/0x30 [ 606.298929] -> #1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}: [ 606.298947] __mutex_lock+0x85/0x930 [ 606.298956] mutex_lock_nested+0x1b/0x30 [ 606.298966] amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu] [ 606.299190] process_one_work+0x21e/0x680 [ 606.299199] worker_thread+0x190/0x350 [ 606.299208] kthread+0xe7/0x120 [ 606.299217] ret_from_fork+0x3c/0x60 [ 606.299227] ret_from_fork_asm+0x1a/0x30 [ 606.299236] -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}: [ 606.299257] __lock_acquire+0x16f9/0x2810 [ 606.299267] lock_acquire+0xd1/0x300 [ 606.299276] __flush_work+0x250/0x610 [ 606.299286] cancel_delayed_work_sync+0x71/0x80 [ 606.299296] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.299509] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.299723] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.299909] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.300101] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.300355] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.300369] process_one_work+0x21e/0x680 [ 606.300378] worker_thread+0x190/0x350 [ 606.300387] kthread+0xe7/0x120 [ 606.300396] ret_from_fork+0x3c/0x60 [ 606.300406] ret_from_fork_asm+0x1a/0x30 [ 606.300416] other info that might help us debug this: [ 606.300428] Chain exists of: (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex [ 606.300458] Possible unsafe locking scenario: [ 606.300468] CPU0 CPU1 [ 606.300476] ---- ---- [ 606.300484] lock(&adev->gfx.kfd_sch_mutex); [ 606.300494] lock(&adev->enforce_isolation_mutex); [ 606.300508] lock(&adev->gfx.kfd_sch_mutex); [ 606.300521] lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)); [ 606.300536] *** DEADLOCK *** [ 606.300546] 5 locks held by kworker/u96:3/3825: [ 606.300555] #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680 [ 606.300577] #1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680 [ 606.300600] #2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu] [ 606.300837] #3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.301062] #4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610 [ 606.301083] stack backtrace: [ 606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G OE 6.10.0-amd-mlkd-610-311224-lof #19 [ 606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024 [ 606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched] [ 606.301140] Call Trace: [ 606.301146] <TASK> [ 606.301154] dump_stack_lvl+0x9b/0xf0 [ 606.301166] dump_stack+0x10/0x20 [ 606.301175] print_circular_bug+0x26c/0x340 [ 606.301187] check_noncircular+0x157/0x170 [ 606.301197] ? register_lock_class+0x48/0x490 [ 606.301213] __lock_acquire+0x16f9/0x2810 [ 606.301230] lock_acquire+0xd1/0x300 [ 606.301239] ? __flush_work+0x232/0x610 [ 606.301250] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301261] ? mark_held_locks+0x54/0x90 [ 606.301274] ? __flush_work+0x232/0x610 [ 606.301284] __flush_work+0x250/0x610 [ 606.301293] ? __flush_work+0x232/0x610 [ 606.301305] ? __pfx_wq_barrier_func+0x10/0x10 [ 606.301318] ? mark_held_locks+0x54/0x90 [ 606.301331] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301345] cancel_delayed_work_sync+0x71/0x80 [ 606.301356] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.301661] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.302050] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.302069] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.302452] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.302862] ? drm_sched_entity_error+0x82/0x190 [gpu_sched] [ 606.302890] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.303366] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.303388] process_one_work+0x21e/0x680 [ 606.303409] worker_thread+0x190/0x350 [ 606.303424] ? __pfx_worker_thread+0x10/0x10 [ 606.303437] kthread+0xe7/0x120 [ 606.303449] ? __pfx_kthread+0x10/0x10 [ 606.303463] ret_from_fork+0x3c/0x60 [ 606.303476] ? __pfx_kthread+0x10/0x10 [ 606.303489] ret_from_fork_asm+0x1a/0x30 [ 606.303512] </TASK> v2: Refactor lock handling to resolve circular dependency (Alex) - Introduced a `sched_work` flag to defer the call to `amdgpu_gfx_kfd_sch_ctrl` until after releasing `enforce_isolation_mutex`. - This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside the critical section, preventing the circular dependency and deadlock. - The `sched_work` flag is set within the mutex-protected section if conditions are met, and the actual function call is made afterward. - This approach ensures consistent lock acquisition order. Fixes: afefd6f24502 ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization") Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Suggested-by: Alex Deucher <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* Merge tag 'amd-drm-next-6.14-2025-01-10' of ↵Dave Airlie2025-01-1316-45/+269
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.14-2025-01-10: amdgpu: - Fix max surface handling in DC - clang fixes - DCN 3.5 fixes - DCN 4.0.1 fixes - DC CRC fixes - DML updates - DSC fixes - PSR fixes - DC add some divide by 0 checks - SMU13 updates - SR-IOV fixes - RAS fixes - Cleaner shader support for gfx10.3 dGPUs - fix drm buddy trim handling - SDMA engine reset updates _ Fix RB bitmap setup - Fix doorbell ttm cleanup - Add CEC notifier support - DPIA updates - MST fixes amdkfd: - Shader debugger fixes - Trap handler cleanup - Cleanup includes - Eviction fence wq fix Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
| * drm/amdgpu: fill the ucode bo during psp resume for SRIOVVictor Zhao2025-01-091-4/+5
| | | | | | | | | | | | | | | | | | refill the ucode bo during psp resume for SRIOV, otherwise ucode load will fail after VM hibernation and fb clean. Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu/gfx10: Enable cleaner shader for GFX10.3.2/10.3.4/10.3.5 GPUsSrinivasan Shanmugam2025-01-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable the cleaner shader for GFX10.3.2/10.3.4/10.3.5 GPUs to provide data isolation between GPU workloads. The cleaner shader is responsible for clearing the Local Data Store (LDS), Vector General Purpose Registers (VGPRs), and Scalar General Purpose Registers (SGPRs), which helps prevent data leakage and ensures accurate computation results. This update extends cleaner shader support to GFX10.3.2/10.3.4/10.3.5 GPUs, previously available for GFX10.3.0. It enhances security by clearing GPU memory between processes and maintains a consistent GPU state across KGD and KFD workloads. Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: fix gpu recovery disable with per queue resetJonathan Kim2025-01-091-0/+6
| | | | | | | | | | | | | | | | | | | | Per queue reset should be bypassed when gpu recovery is disabled with module parameter. Fixes: ee0a469cf917 ("drm/amdkfd: support per-queue reset on gfx9") Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: wrong array index to get ip block for PSPJiang Liu2025-01-091-2/+6
| | | | | | | | | | | | | | | | | | | | | | The adev->ip_blocks array is not indexed by AMD_IP_BLOCK_TYPE_xxx, instead we should call amdgpu_device_ip_get_ip_block() to get the corresponding IP block oject. Fix some checkpatch issues (Alex) Signed-off-by: Jiang Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: tear down ttm range manager for doorbell in amdgpu_ttm_fini()Jiang Liu2025-01-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | Tear down ttm range manager for doorbell in function amdgpu_ttm_fini(), to avoid memory leakage. Fixes: 792b84fb9038 ("drm/amdgpu: initialize ttm for doorbells") Signed-off-by: Jiang Liu <[email protected]> Signed-off-by: Kent Russell <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: fix incorrect number of active RBs for gfx12Tim Huang2025-01-091-2/+5
| | | | | | | | | | | | | | | | | | The RB bitmap should be global active RB bitmap & active RB bitmap based on active SA. Signed-off-by: Tim Huang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: fix incorrect active RB bitmap in setup RBsTim Huang2025-01-091-1/+4
| | | | | | | | | | | | | | | | | | | | The RB bitmap width per SA may be 0x1 for some ASICs. Use the actual bitmap of SA instead of 0x3 to determine the active RB bitmap. Signed-off-by: Tim Huang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Fix shift type in amdgpu_debugfs_sdma_sched_mask_set()Dan Carpenter2025-01-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "mask" and "val" variables are type u64. The problem is that the BIT() macros are type unsigned long which is just 32 bits on 32bit systems. It's unlikely that people will be using this driver on 32bit kernels and even if they did we only use the lower AMDGPU_MAX_SDMA_INSTANCES (16) bits. So this bug does not affect anything in real life. Still, for correctness sake, u64 bit masks should use BIT_ULL(). Fixes: d2e3961ae371 ("drm/amdgpu: add amdgpu_sdma_sched_mask debugfs") Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Mario Limonciello <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: enable gfx12 queue reset flagJesse Zhang2025-01-091-1/+9
| | | | | | | | | | | | | | | | Enable the kgq and kcq queue reset flag Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Tim Huang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu/sdma4.4.2: add apu support in sdma queue resetJesse Zhang2025-01-091-1/+1
| | | | | | | | | | | | | | | | Remove apu check in sdma queue reset. Signed-off-by: Jesse Zhang <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Remove unnecessary NULL checkKent Russell2025-01-061-4/+2
| | | | | | | | | | | | | | | | | | container_of cannot return NULL, so it is unnecessary to check for NULL after gem_to_amdgpu_bo, which is just a container_of call Signed-off-by: Kent Russell <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Add a lock when accessing the buddy trim functionArunpravin Paneer Selvam2025-01-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running YouTube videos and Steam games simultaneously, the tester found a system hang / race condition issue with the multi-display configuration setting. Adding a lock to the buddy allocator's trim function would be the solution. <log snip> [ 7197.250436] general protection fault, probably for non-canonical address 0xdead000000000108 [ 7197.250447] RIP: 0010:__alloc_range+0x8b/0x340 [amddrm_buddy] [ 7197.250470] Call Trace: [ 7197.250472] <TASK> [ 7197.250475] ? show_regs+0x6d/0x80 [ 7197.250481] ? die_addr+0x37/0xa0 [ 7197.250483] ? exc_general_protection+0x1db/0x480 [ 7197.250488] ? drm_suballoc_new+0x13c/0x93d [drm_suballoc_helper] [ 7197.250493] ? asm_exc_general_protection+0x27/0x30 [ 7197.250498] ? __alloc_range+0x8b/0x340 [amddrm_buddy] [ 7197.250501] ? __alloc_range+0x109/0x340 [amddrm_buddy] [ 7197.250506] amddrm_buddy_block_trim+0x1b5/0x260 [amddrm_buddy] [ 7197.250511] amdgpu_vram_mgr_new+0x4f5/0x590 [amdgpu] [ 7197.250682] amdttm_resource_alloc+0x46/0xb0 [amdttm] [ 7197.250689] ttm_bo_alloc_resource+0xe4/0x370 [amdttm] [ 7197.250696] amdttm_bo_validate+0x9d/0x180 [amdttm] [ 7197.250701] amdgpu_bo_pin+0x15a/0x2f0 [amdgpu] [ 7197.250831] amdgpu_dm_plane_helper_prepare_fb+0xb2/0x360 [amdgpu] [ 7197.251025] ? try_wait_for_completion+0x59/0x70 [ 7197.251030] drm_atomic_helper_prepare_planes.part.0+0x2f/0x1e0 [ 7197.251035] drm_atomic_helper_prepare_planes+0x5d/0x70 [ 7197.251037] drm_atomic_helper_commit+0x84/0x160 [ 7197.251040] drm_atomic_nonblocking_commit+0x59/0x70 [ 7197.251043] drm_mode_atomic_ioctl+0x720/0x850 [ 7197.251047] ? __pfx_drm_mode_atomic_ioctl+0x10/0x10 [ 7197.251049] drm_ioctl_kernel+0xb9/0x120 [ 7197.251053] ? srso_alias_return_thunk+0x5/0xfbef5 [ 7197.251056] drm_ioctl+0x2d4/0x550 [ 7197.251058] ? __pfx_drm_mode_atomic_ioctl+0x10/0x10 [ 7197.251063] amdgpu_drm_ioctl+0x4e/0x90 [amdgpu] [ 7197.251186] __x64_sys_ioctl+0xa0/0xf0 [ 7197.251190] x64_sys_call+0x143b/0x25c0 [ 7197.251193] do_syscall_64+0x7f/0x180 [ 7197.251197] ? srso_alias_return_thunk+0x5/0xfbef5 [ 7197.251199] ? amdgpu_display_user_framebuffer_create+0x215/0x320 [amdgpu] [ 7197.251329] ? drm_internal_framebuffer_create+0xb7/0x1a0 [ 7197.251332] ? srso_alias_return_thunk+0x5/0xfbef5 Signed-off-by: Arunpravin Paneer Selvam <[email protected]> Fixes: 4a5ad08f5377 ("drm/amdgpu: Add address alignment support to DCC buffers") Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: reduce RLC safe mode request for gfx clock gatingPrike Liang2025-01-062-20/+6
| | | | | | | | | | | | | | | | | | | | | | The driver can only request one time for the power safe mode instead of polling and disabling the power feature each time prior to program the GFX clock gating control registers. This update will reduce the latency on the GFX clock gating entry. Signed-off-by: Prike Liang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu/gfx10: Add cleaner shader for GFX10.3.0Srinivasan Shanmugam2025-01-063-0/+195
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds the cleaner shader microcode for GFX10.3.0 GPUs. The cleaner shader is a piece of GPU code that is used to clear or initialize certain GPU resources, such as Local Data Share (LDS), Vector General Purpose Registers (VGPRs), and Scalar General Purpose Registers (SGPRs). Clearing these resources is important for ensuring data isolation between different workloads running on the GPU. Without the cleaner shader, residual data from a previous workload could potentially be accessed by a subsequent workload, leading to data leaks and incorrect computation results. The cleaner shader microcode is represented as an array of 32-bit words (`gfx_10_3_0_cleaner_shader_hex`). This array is the binary representation of the cleaner shader code, which is written in a low-level GPU instruction set. When the cleaner shader feature is enabled, the AMDGPU driver loads this array into a specific location in the GPU memory. The GPU then reads this memory location to fetch and execute the cleaner shader instructions. The cleaner shader is executed automatically by the GPU at the end of each workload, before the next workload starts. This ensures that all GPU resources are in a clean state before the start of each workload. This addition is part of the cleaner shader feature implementation. The cleaner shader feature helps resource utilization by cleaning up GPU resources after they are used. It also enhances security and reliability by preventing data leaks between workloads. Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Fix error handling in amdgpu_ras_add_bad_pagesSrinivasan Shanmugam2025-01-061-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It ensures that appropriate error codes are returned when an error condition is detected Fixes the below; drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2849 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_umc_pages_in_a_row()' failed. drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2884 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_ras_mca2pa()' failed. v2: s/-EIO/-EINVAL, retained the use of -EINVAL from amdgpu_umc_pages_in_a_row & and amdgpu_ras_mca2pa_by_idx, when the RAS context is not initialized or the convert_ras_err_addr function is unavailable. (Thomas) V3: Returning 0 as the absence of eh_data is acceptable. (Tao) Fixes: a8d133e625ce ("drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes") Reported-by: Dan Carpenter <[email protected]> Cc: YiPeng Chai <[email protected]> Cc: Tao Zhou <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * drm/amdgpu: Fix for MEC SJT FW Load Fail on VFyfeng12025-01-061-2/+7
| | | | | | | | | | | | | | | | | | Users might switch to ROCM build does not include MEC SJT FW and driver needs to consider this case.w Signed-off-by: yfeng1 <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* | Merge tag 'drm-misc-next-2025-01-06' of ↵Dave Airlie2025-01-0910-149/+238
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.14: UAPI Changes: - Clarify drm memory stats documentation Cross-subsystem Changes: Core Changes: - sched: Documentation fixes, Driver Changes: - amdgpu: Track BO memory stats at runtime - amdxdna: Various fixes - hisilicon: New HIBMC driver - bridges: - Provide default implementation of atomic_check for HDMI bridges - it605: HDCP improvements, MCCS Support Signed-off-by: Dave Airlie <[email protected]> From: Maxime Ripard <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/20250106-augmented-kakapo-of-action-0cf000@houat
| * | drm/amdgpu: track bo memory stats at runtimeYunxiang Li2024-12-199-139/+232
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before, every time fdinfo is queried we try to lock all the BOs in the VM and calculate memory usage from scratch. This works okay if the fdinfo is rarely read and the VMs don't have a ton of BOs. If either of these conditions is not true, we get a massive performance hit. In this new revision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. With this new approach however, we would no longer be able to track active buffers. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Christian König <[email protected]>
| * | drm/amdgpu: remove unused function parameterYunxiang Li2024-12-196-12/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all callers have a reference to adev handy, so remove it for cleanliness. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Christian König <[email protected]>
| * | drm: make drm-active- stats optionalYunxiang Li2024-12-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When memory stats is generated fresh everytime by going though all the BOs, their active information is quite easy to get. But if the stats are tracked with BO's state this becomes harder since the job scheduling part doesn't really deal with individual buffers. Make drm-active- optional to enable amdgpu to switch to the second method. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Tvrtko Ursulin <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Christian König <[email protected]>
| * | Merge remote-tracking branch 'drm/drm-next' into drm-misc-nextMaarten Lankhorst2024-12-098-40/+91
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The v6.13-rc2 release included a bunch of breaking changes, specifically the MODULE_IMPORT_NS commit. Backmerge in order to fix them before the next pull-request. Include the fix from Stephen Roswell. Caused by commit 25c3fd1183c0 ("drm/virtio: Add a helper to map and note the dma addrs and lengths") Interacting with commit cdd30ebb1b9f ("module: Convert symbol namespace to string literal") Reported-by: Stephen Rothwell <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Maarten Lankhorst <[email protected]>
* | \ \ Merge tag 'amd-drm-next-6.14-2024-12-18' of ↵Dave Airlie2024-12-19150-1028/+4840
|\ \ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.14-2024-12-18: amdgpu: - RAS updates - ISP updates - SDMA queue reset support - Rework DPM powergating interfaces - Documentation updates and cleanups - Panel replay fixes - DCN 3.5 updates - DP tunneling fixes - Use a pm notifier to more gracefully handle VRAM eviction on suspend or hibernate - Add debugfs interfaces for forcing scheduling to specific engine instances - GG 9.5 updates - IH 4.4 updates - Make missing optional firmware less noisy - PSP 13.x updates - SMU 13.x updates - VCN 5.x updates - JPEG 5.x updates - Misc cleanups - GC 12.x updates - DRM panic support - DC FAMS updates - DSC fixes - job handling fixes amdkfd: - GG 9.5 updates - Logging improvements - Misc cleanups - Various Optimizations Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
| * | | drm/amdgpu: Handle NULL bo->tbo.resource (again) in amdgpu_vm_bo_updateMichel Dänzer2024-12-181-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Third time's the charm, I hope? Fixes: d3116756a710 ("drm/ttm: rename bo->mem and make it a pointer") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3837 Reviewed-by: Christian König <[email protected]> Signed-off-by: Michel Dänzer <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * | | drm/admgpu: replace kmalloc() and memcpy() with kmemdup()Mirsad Todorovac2024-12-181-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The static analyser tool gave the following advice: ./drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c:1266:7-14: WARNING opportunity for kmemdup → 1266 tmp = kmalloc(used_size, GFP_KERNEL); 1267 if (!tmp) 1268 return -ENOMEM; 1269 → 1270 memcpy(tmp, &host_telemetry->body.error_count, used_size); Replacing kmalloc() + memcpy() with kmemdump() doesn't change semantics. Original code works without fault, so this is not a bug fix but proposed improvement. Link: https://lwn.net/Articles/198928/ Fixes: 84a2947ecc85 ("drm/amdgpu: Implement virt req_ras_err_count") Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: Xinhui Pan <[email protected]> Cc: David Airlie <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Zhigang Luo <[email protected]> Cc: Victor Skvortsov <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Lijo Lazar <[email protected]> Cc: Yunxiang Li <[email protected]> Cc: Jack Xiao <[email protected]> Cc: Vignesh Chander <[email protected]> Cc: Danijel Slivka <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Mirsad Todorovac <[email protected]> Signed-off-by: Alex Deucher <[email protected]>