kernel - saturneric's kernel source tree

	Commit message (Collapse)	Author	Age	Files	Lines
*	drm/amdgpu: fix a memory leak in fence cleanup when unloading	Alex Deucher	2025-09-09	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit b61badd20b44 ("drm/amdgpu: fix usage slab after free") reordered when amdgpu_fence_driver_sw_fini() was called after that patch, amdgpu_fence_driver_sw_fini() effectively became a no-op as the sched entities we never freed because the ring pointers were already set to NULL. Remove the NULL setting. Reported-by: Lin.Cao <[email protected]> Cc: Vitaly Prosyak <[email protected]> Cc: Christian König <[email protected]> Fixes: b61badd20b44 ("drm/amdgpu: fix usage slab after free") Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit a525fa37aac36c4591cc8b07ae8957862415fbd5) Cc: [email protected]
*	drm/amdgpu: track whether a queue is a kernel queue in amdgpu_mqd_prop	Alex Deucher	2025-07-28	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	Used to to set the MQD appropriately for each queue type. Kernel queues have additional privileges. Acked-by: Christian König <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected] # 6.16.x
*	drm/amdgpu: move reset support type checks into the caller	Alex Deucher	2025-07-17	1	-0/+31
\| \| \| \| \| \| \| \|	Rather than checking in the callbacks, check if the reset type is supported in the caller. Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: Increase reset counter only on success	Lijo Lazar	2025-07-16	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \|	Increment the reset counter only if soft recovery succeeded. This is consistent with a ring hard reset behaviour where counter gets incremented only if hard reset succeeded. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: track ring state associated with a fence	Alex Deucher	2025-07-16	1	-0/+67
\| \| \| \| \| \| \| \| \| \| \| \|	We need to know the wptr and sequence number associated with a fence so that we can re-emit the unprocessed state after a ring reset. Pre-allocate storage space for the ring buffer contents and add helpers to save off and re-emit the unprocessed state so that it can be re-emitted after the queue is reset. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: remove is_mes_queue flag	Alex Deucher	2025-04-08	1	-74/+38
\| \| \| \| \| \| \| \| \|	This was leftover from MES bring up when we had MES user queues in the kernel. It's no longer used so remove it. Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: stop unmapping MQD for kernel queues v3	Christian König	2025-03-26	1	-50/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This looks unnecessary and actually extremely harmful since using kmap() is not possible while inside the ring reset. Remove all the extra mapping and unmapping of the MQDs. v2: also fix debugfs v3: fix coding style typo Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: Add support for CPERs on virtualization	Tony Yi	2025-03-05	1	-3/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add support for CPERs on VFs. VFs do not receive PMFW messages directly; as such, they need to query them from the host. To avoid hitting host event guard, CPER queries need to be rate limited. CPER queries share the same RAS telemetry buffer as error count query, so a mutex protecting the shared buffer was added as well. For readability, the amdgpu_detect_virtualization was refactored into multiple individual functions. Signed-off-by: Tony Yi <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: Introduce cached_rptr and is_guilty callback in amdgpu_ring	[email protected]	2025-02-25	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \|	This patch introduces the following changes: - Add `cached_rptr` to the `amdgpu_ring` structure to store the read pointer before a reset. - Add `is_guilty` callback to the `amdgpu_ring_funcs` structure to check if a ring is guilty of causing a timeout. Suggested-by: Alex Deucher <[email protected]> Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: add mutex lock for cper ring	Tao Zhou	2025-02-17	1	-5/+16
\| \| \| \| \| \| \| \|	Avoid the confliction between read and write of ring buffer. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: read CPER ring via debugfs	Tao Zhou	2025-02-17	1	-11/+36
\| \| \| \| \| \| \| \| \|	We read CPER data from read pointer to write pointer without changing the pointers. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: add RAS CPER ring buffer	Tao Zhou	2025-02-17	1	-11/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	And initialize it, this is a pure software ring to store RAS CPER data. v2: change ring size to 0x100000 v2: update the initialization of count_dw of cper ring, it's dword variable v3: skip VM inv eng for cper v3: init/fini when aca enabled Signed-off-by: Tao Zhou <[email protected]> Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: drop volatile from ring buffer	Christian König	2024-10-28	1	-7/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	Volatile only prevents the compiler from re-ordering reads and writes. Since we always only modify the ring buffer from one CPU thread and have an explicit barrier before signaling the HW this should have no effect at all and just prevents compiler optimisations. While at it drop the local variables as well. Signed-off-by: Christian König <[email protected]> Reviewed-by: Sunil Khatri <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: optimize insert_nop using multi dwords	Sunil Khatri	2024-10-15	1	-3/+19
\| \| \| \| \| \| \| \| \| \| \| \|	Optimize the ring_insert_nop fn for n dwords in one step rather then call to amdgpu_ring_write for each nop packet. This avoid function call for each nop packet and also wptr is updated once only. Signed-off-by: Sunil Khatri <[email protected]> Suggested-by: Christian König <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: move error log from ring write to commit	Sunil Khatri	2024-10-08	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the error message from ring write as an optimization to avoid printing that message on every write instead print once during commit if it exceeds write the allocated size i.e ring->count_dw. Also we do not want to log the error message in between a ring write and complete the write as its mostly not harmful as it will overwrite stale data only as GPU read from ring is faster than CPU write to ring. This reduces the size of amdgpu.ko module by around 600 Kb as write is very often used function and hence the print. Signed-off-by: Sunil Khatri <[email protected]> Suggested-by: Christian König <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu/mes: fix mes ring buffer overflow	Jack Xiao	2024-08-13	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	wait memory room until enough before writing mes packets to avoid ring buffer overflow. v2: squash in sched_hw_submission fix Fixes: de3246254156 ("drm/amdgpu: cleanup MES11 command submission") Fixes: fffe347e1478 ("drm/amdgpu: cleanup MES12 command submission") Signed-off-by: Jack Xiao <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdpgu: Micro-optimise amdgpu_ring_commit	Tvrtko Ursulin	2024-08-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For some value of optimisation we can replace the division with an bitwise and. And it even shrinks the code. Before: 6c9: 53 push %rbx 6ca: 4c 8b 47 08 mov 0x8(%rdi),%r8 6ce: 31 d2 xor %edx,%edx 6d0: 48 89 fb mov %rdi,%rbx 6d3: 8b 87 c8 05 00 00 mov 0x5c8(%rdi),%eax 6d9: 41 8b 48 04 mov 0x4(%r8),%ecx 6dd: f7 d0 not %eax 6df: 21 c8 and %ecx,%eax 6e1: 83 c1 01 add $0x1,%ecx 6e4: 83 c0 01 add $0x1,%eax 6e7: f7 f1 div %ecx 6e9: 89 d6 mov %edx,%esi 6eb: 41 ff 90 88 00 00 00 call 0x88(%r8) After: 6c9: 53 push %rbx 6ca: 48 8b 57 08 mov 0x8(%rdi),%rdx 6ce: 48 89 fb mov %rdi,%rbx 6d1: 8b 87 c8 05 00 00 mov 0x5c8(%rdi),%eax 6d7: 8b 72 04 mov 0x4(%rdx),%esi 6da: f7 d0 not %eax 6dc: 21 f0 and %esi,%eax 6de: 83 c0 01 add $0x1,%eax 6e1: 21 c6 and %eax,%esi 6e3: ff 92 88 00 00 00 call 0x88(%rdx) Reviewed-by: Christian König <[email protected]> Reviewed-by: Sunil Khatri <[email protected]> Signed-off-by: Tvrtko Ursulin <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: do not call insert_nop fn for zero count	Sunil Khatri	2024-08-06	1	-1/+3
\| \| \| \| \| \| \| \| \| \|	Do not make a function call for zero size NOP as it does not add anything in the ring and is unnecessary function call. Reviewed-by: Christian König <[email protected]> Signed-off-by: Sunil Khatri <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: Fix out-of-bounds write warning	Ma Jun	2024-05-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Check the ring type value to fix the out-of-bounds write warning Signed-off-by: Ma Jun <[email protected]> Suggested-by: Christian König <[email protected]> Reviewed-by: Tim Huang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: fix overflowed array index read warning	Tim Huang	2024-04-30	1	-1/+2
\| \| \| \| \| \| \| \| \|	Clear overflowed array index read warning by cast operation. Signed-off-by: Tim Huang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: fix deadlock while reading mqd from debugfs	Johannes Weiner	2024-03-27	1	-17/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An errant disk backup on my desktop got into debugfs and triggered the following deadlock scenario in the amdgpu debugfs files. The machine also hard-resets immediately after those lines are printed (although I wasn't able to reproduce that part when reading by hand): [ 1318.016074][ T1082] ====================================================== [ 1318.016607][ T1082] WARNING: possible circular locking dependency detected [ 1318.017107][ T1082] 6.8.0-rc7-00015-ge0c8221b72c0 #17 Not tainted [ 1318.017598][ T1082] ------------------------------------------------------ [ 1318.018096][ T1082] tar/1082 is trying to acquire lock: [ 1318.018585][ T1082] ffff98c44175d6a0 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x40/0x80 [ 1318.019084][ T1082] [ 1318.019084][ T1082] but task is already holding lock: [ 1318.020052][ T1082] ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu] [ 1318.020607][ T1082] [ 1318.020607][ T1082] which lock already depends on the new lock. [ 1318.020607][ T1082] [ 1318.022081][ T1082] [ 1318.022081][ T1082] the existing dependency chain (in reverse order) is: [ 1318.023083][ T1082] [ 1318.023083][ T1082] -> #2 (reservation_ww_class_mutex){+.+.}-{3:3}: [ 1318.024114][ T1082] __ww_mutex_lock.constprop.0+0xe0/0x12f0 [ 1318.024639][ T1082] ww_mutex_lock+0x32/0x90 [ 1318.025161][ T1082] dma_resv_lockdep+0x18a/0x330 [ 1318.025683][ T1082] do_one_initcall+0x6a/0x350 [ 1318.026210][ T1082] kernel_init_freeable+0x1a3/0x310 [ 1318.026728][ T1082] kernel_init+0x15/0x1a0 [ 1318.027242][ T1082] ret_from_fork+0x2c/0x40 [ 1318.027759][ T1082] ret_from_fork_asm+0x11/0x20 [ 1318.028281][ T1082] [ 1318.028281][ T1082] -> #1 (reservation_ww_class_acquire){+.+.}-{0:0}: [ 1318.029297][ T1082] dma_resv_lockdep+0x16c/0x330 [ 1318.029790][ T1082] do_one_initcall+0x6a/0x350 [ 1318.030263][ T1082] kernel_init_freeable+0x1a3/0x310 [ 1318.030722][ T1082] kernel_init+0x15/0x1a0 [ 1318.031168][ T1082] ret_from_fork+0x2c/0x40 [ 1318.031598][ T1082] ret_from_fork_asm+0x11/0x20 [ 1318.032011][ T1082] [ 1318.032011][ T1082] -> #0 (&mm->mmap_lock){++++}-{3:3}: [ 1318.032778][ T1082] __lock_acquire+0x14bf/0x2680 [ 1318.033141][ T1082] lock_acquire+0xcd/0x2c0 [ 1318.033487][ T1082] __might_fault+0x58/0x80 [ 1318.033814][ T1082] amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu] [ 1318.034181][ T1082] full_proxy_read+0x55/0x80 [ 1318.034487][ T1082] vfs_read+0xa7/0x360 [ 1318.034788][ T1082] ksys_read+0x70/0xf0 [ 1318.035085][ T1082] do_syscall_64+0x94/0x180 [ 1318.035375][ T1082] entry_SYSCALL_64_after_hwframe+0x46/0x4e [ 1318.035664][ T1082] [ 1318.035664][ T1082] other info that might help us debug this: [ 1318.035664][ T1082] [ 1318.036487][ T1082] Chain exists of: [ 1318.036487][ T1082] &mm->mmap_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex [ 1318.036487][ T1082] [ 1318.037310][ T1082] Possible unsafe locking scenario: [ 1318.037310][ T1082] [ 1318.037838][ T1082] CPU0 CPU1 [ 1318.038101][ T1082] ---- ---- [ 1318.038350][ T1082] lock(reservation_ww_class_mutex); [ 1318.038590][ T1082] lock(reservation_ww_class_acquire); [ 1318.038839][ T1082] lock(reservation_ww_class_mutex); [ 1318.039083][ T1082] rlock(&mm->mmap_lock); [ 1318.039328][ T1082] [ 1318.039328][ T1082] * DEADLOCK * [ 1318.039328][ T1082] [ 1318.040029][ T1082] 1 lock held by tar/1082: [ 1318.040259][ T1082] #0: ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu] [ 1318.040560][ T1082] [ 1318.040560][ T1082] stack backtrace: [ 1318.041053][ T1082] CPU: 22 PID: 1082 Comm: tar Not tainted 6.8.0-rc7-00015-ge0c8221b72c0 #17 3316c85d50e282c5643b075d1f01a4f6365e39c2 [ 1318.041329][ T1082] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023 [ 1318.041614][ T1082] Call Trace: [ 1318.041895][ T1082] <TASK> [ 1318.042175][ T1082] dump_stack_lvl+0x4a/0x80 [ 1318.042460][ T1082] check_noncircular+0x145/0x160 [ 1318.042743][ T1082] __lock_acquire+0x14bf/0x2680 [ 1318.043022][ T1082] lock_acquire+0xcd/0x2c0 [ 1318.043301][ T1082] ? __might_fault+0x40/0x80 [ 1318.043580][ T1082] ? __might_fault+0x40/0x80 [ 1318.043856][ T1082] __might_fault+0x58/0x80 [ 1318.044131][ T1082] ? __might_fault+0x40/0x80 [ 1318.044408][ T1082] amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu 8fe2afaa910cbd7654c8cab23563a94d6caebaab] [ 1318.044749][ T1082] full_proxy_read+0x55/0x80 [ 1318.045042][ T1082] vfs_read+0xa7/0x360 [ 1318.045333][ T1082] ksys_read+0x70/0xf0 [ 1318.045623][ T1082] do_syscall_64+0x94/0x180 [ 1318.045913][ T1082] ? do_syscall_64+0xa0/0x180 [ 1318.046201][ T1082] ? lockdep_hardirqs_on+0x7d/0x100 [ 1318.046487][ T1082] ? do_syscall_64+0xa0/0x180 [ 1318.046773][ T1082] ? do_syscall_64+0xa0/0x180 [ 1318.047057][ T1082] ? do_syscall_64+0xa0/0x180 [ 1318.047337][ T1082] ? do_syscall_64+0xa0/0x180 [ 1318.047611][ T1082] entry_SYSCALL_64_after_hwframe+0x46/0x4e [ 1318.047887][ T1082] RIP: 0033:0x7f480b70a39d [ 1318.048162][ T1082] Code: 91 ba 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb b2 e8 18 a3 01 00 0f 1f 84 00 00 00 00 00 80 3d a9 3c 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 53 48 83 [ 1318.048769][ T1082] RSP: 002b:00007ffde77f5c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 1318.049083][ T1082] RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f480b70a39d [ 1318.049392][ T1082] RDX: 0000000000000800 RSI: 000055c9f2120c00 RDI: 0000000000000008 [ 1318.049703][ T1082] RBP: 0000000000000800 R08: 000055c9f2120a94 R09: 0000000000000007 [ 1318.050011][ T1082] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c9f2120c00 [ 1318.050324][ T1082] R13: 0000000000000008 R14: 0000000000000008 R15: 0000000000000800 [ 1318.050638][ T1082] </TASK> amdgpu_debugfs_mqd_read() holds a reservation when it calls put_user(), which may fault and acquire the mmap_sem. This violates the established locking order. Bounce the mqd data through a kernel buffer to get put_user() out of the illegal section. Fixes: 445d85e3c1df ("drm/amdgpu: add debugfs interface for reading MQDs") Cc: [email protected] # v6.5+ Reviewed-by: Shashank Sharma <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: Fix the warning info in mode1 reset	Ma Jun	2024-01-31	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix the warning info below during mode1 reset. [ +0.000004] Call Trace: [ +0.000004] <TASK> [ +0.000006] ? show_regs+0x6e/0x80 [ +0.000011] ? __flush_work.isra.0+0x2e8/0x390 [ +0.000005] ? __warn+0x91/0x150 [ +0.000009] ? __flush_work.isra.0+0x2e8/0x390 [ +0.000006] ? report_bug+0x19d/0x1b0 [ +0.000013] ? handle_bug+0x46/0x80 [ +0.000012] ? exc_invalid_op+0x1d/0x80 [ +0.000011] ? asm_exc_invalid_op+0x1f/0x30 [ +0.000014] ? __flush_work.isra.0+0x2e8/0x390 [ +0.000007] ? __flush_work.isra.0+0x208/0x390 [ +0.000007] ? _prb_read_valid+0x216/0x290 [ +0.000008] __cancel_work_timer+0x11d/0x1a0 [ +0.000007] ? try_to_grab_pending+0xe8/0x190 [ +0.000012] cancel_work_sync+0x14/0x20 [ +0.000008] amddrm_sched_stop+0x3c/0x1d0 [amd_sched] [ +0.000032] amdgpu_device_gpu_recover+0x29a/0xe90 [amdgpu] This warning info was printed after applying the patch "drm/sched: Convert drm scheduler to use a work queue rather than kthread". The root cause is that amdgpu driver tries to use the uninitialized work_struct in the struct drm_gpu_scheduler v2: - Rename the function to amdgpu_ring_sched_ready and move it to amdgpu_ring.c (Alex) v3: - Fix a few more checks based on Vitaly's patch (Alex) v4: - squash in fix noticed by Bert in https://gitlab.freedesktop.org/drm/amd/-/issues/3139 Fixes: 11b3b9f461c5 ("drm/sched: Check scheduler ready before calling timeout handling") Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Vitaly Prosyak <[email protected]> Signed-off-by: Ma Jun <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: Enable tunneling on high-priority compute queues	Friedrich Vock	2023-12-13	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This improves latency if the GPU is already busy with other work. This is useful for VR compositors that submit highly latency-sensitive compositing work on high-priority compute queues while the GPU is busy rendering the next frame. Userspace merge request: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26462 v2: bump driver version (Alex) Reviewed-by: Marek Olšák <[email protected]> Signed-off-by: Friedrich Vock <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: Create an option to disable soft recovery	André Almeida	2023-09-11	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \|	Create a module option to disable soft recoveries on amdgpu, making every recovery go through the device reset path. This option makes easier to force device resets for testing and debugging purposes. Signed-off-by: André Almeida <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Hamza Mahfooz <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: mark soft recovered fences with -ENODATA	Christian König	2023-06-15	1	-0/+7
\| \| \| \| \| \| \| \| \| \|	Set the fence error code before trying to soft-recover it. It gets overwritten when a hard recovery is required. Signed-off-by: Christian König <[email protected]> Reviewed-by: Luben Tuikov <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: add amdgpu_error_* debugfs file	Christian König	2023-06-15	1	-0/+16
\| \| \| \| \| \| \| \| \|	This allows us to insert some error codes into the bottom of the pipeline on an engine. Signed-off-by: Christian König <[email protected]> Reviewed-by: Luben Tuikov <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: Modify indirect buffer packages for resubmission	Jiadong Zhu	2023-06-09	1	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the preempted IB frame resubmitted to cp, we need to modify the frame data including: 1. set PRE_RESUME 1 in CONTEXT_CONTROL. 2. use meta data(DE and CE) read from CSA in WRITE_DATA. Add functions to save the location the first time IBs emitted and callback to patch the package when resubmission happens. Signed-off-by: Jiadong Zhu <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amd/amdgpu: Fix warnings in amdgpu _object, _ring.c	Srinivasan Shanmugam	2023-06-09	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix below warnings reported by checkpatch: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' WARNING: static const char * array should probably be static const char * const WARNING: space prohibited between function name and open parenthesis '(' WARNING: braces {} are not necessary for single statement blocks WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'. Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: add debugfs interface for reading MQDs	Alex Deucher	2023-04-24	1	-0/+59
\| \| \| \| \| \| \| \| \| \|	Provide a debugfs interface to access the MQD. Useful for debugging issues with the CP and MES hardware scheduler. v2: fix missing unreserve/unmap when pos >= size (Alex) Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: fix memory leak in mes self test	Jack Xiao	2023-04-24	1	-0/+2
\| \| \| \| \| \| \| \| \|	The fences associated with mes queue have to be freed up during amdgpu_ring_fini. Signed-off-by: Jack Xiao <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: Add a max ibs per submission limit.	Bas Nieuwenhuizen	2023-04-18	1	-0/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	And ensure each ring supports that many submissions. This makes sure that we don't get surprises after the submission has been scheduled where the ring allocation actually gets rejected. My calculations on the existing limits: COMPUTE v10: 128 COMPUTE v11: 128 COMPUTE v6: 157 COMPUTE v7: 133 COMPUTE v8: 130 COMPUTE v9: 125 GFX v10: 208 GFX v11: 213 GFX v6: 154 (doubling this in the previous patch) GFX v7: 226 GFX v8: 213 GFX v9: 208 GFX v9 (SW): 208 SDMA CIK: 87 SDMA SI: 97 SDMA v2.4: 74 SDMA v3.0: 74 SDMA v4.0: 72 SDMA v5.0: 51 SDMA v6.0: 52 UVD ENC v6.0: 98 UVD ENC v7.0: 92 UVD v3.1: 124 UVD v4.2: 124 UVD v5.0: 83 UVD v6.0 (VM): 55 UVD v7.0: 51 VCE v2.0: 126 VCE v3.0 (VM): 98 VCE v4.0: 93 VCN DEC v1.0: 49 VCN DEC v2.0: 51 VCN DEC v3.0: 51 VCN ENC v1.0: 58 VCN ENC v2.0: 93 VCN ENC v3.0: 93 VCN ENC v4.0: 93 VCN JPEG v1.0: 17 VCN JPEG v2.0: 16 VCN JPEG v2.5: 17 VCN JPEG v3.0: 17 VCN JPEG v4.0: 17 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2498 Reviewed-by: Christian König <[email protected]> Signed-off-by: Bas Nieuwenhuizen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: MCBP based on DRM scheduler (v9)	Jiadong.Zhu	2022-12-02	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Trigger Mid-Command Buffer Preemption according to the priority of the software rings and the hw fence signalling condition. The muxer saves the locations of the indirect buffer frames from the software ring together with the fence sequence number in its fifo queue, and pops out those records when the fences are signalled. The locations are used to resubmit packages in preemption scenarios by coping the chunks from the software ring. v2: Update comment style. v3: Fix conflict caused by previous modifications. v4: Remove unnecessary prints. v5: Fix corner cases for resubmission cases. v6: Refactor functions for resubmission, calling fence_process in irq handler. v7: Solve conflict for removing amdgpu_sw_ring.c. v8: Add time threshold to judge if preemption request is needed. v9: Correct comment spelling. Set fence emit timestamp before rsu assignment. Cc: Christian Koenig <[email protected]> Cc: Luben Tuikov <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Cc: Michel Dänzer <[email protected]> Signed-off-by: Jiadong.Zhu <[email protected]> Acked-by: Luben Tuikov <[email protected]> Acked-by: Huang Rui <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	Revert "drm/amdgpu: add debugfs amdgpu_reset_level"	Victor Zhao	2022-10-19	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit 5bd8d53f6fa53eab5433698d1362dae2aa53c1cc. This commit breaks the reset logic for aldebaran, revert it for now. Will move the mask inside the reset handler. Fixes: 5bd8d53f6fa53e ("drm/amdgpu: add debugfs amdgpu_reset_level") Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: add debugfs amdgpu_reset_level	Victor Zhao	2022-08-16	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce amdgpu_reset_level debugfs in order to help debug and test specific type of reset. Also helps blocking unwanted type of resets. By default, mode2 reset will not be enabled v2: make this debugfs in adev and use debugfs_create_u32 Signed-off-by: Victor Zhao <[email protected]> Acked-by: Andrey Grodzovsky <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amd/amdgpu: Fix alignment issue	Arunpravin Paneer Selvam	2022-06-08	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Fix alignment problems reported by zuul for the commit b07d1d73b09e ("drm/amd/amdgpu: Enable high priority gfx queue") Fixes: b07d1d73b09e ("drm/amd/amdgpu: Enable high priority gfx queue") Signed-off-by: Arunpravin Paneer Selvam <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amd/amdgpu: Enable high priority gfx queue	Arunpravin Paneer Selvam	2022-06-06	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Starting from SIENNA CICHLID asic supports two gfx pipes, enabling two graphics queues, 1 on each pipe, pipe0 queue0 would be the normal piority queue and pipe1 queue0 would be the high priority queue Only one queue per pipe is visble to SPI, SPI looks at the priority value assigned to CP_GFX_HQD_QUEUE_PRIORITY from each of the queue's HQD/MQD. Create contexts applying AMDGPU_CTX_PRIORITY_HIGH which submits job to the high priority queue on GFX pipe1. There would be starvation of LP workload if HP workload is always available. v2: - remove unnecessary check(Nirmoy) - make pipe1 hardware support a separate patch(Nirmoy) - remove duplicate code(Shashank) - add CSA support for second gfx pipe(Alex) v3(Christian): - fix incorrect indentation - merge COMPUTE and GFX switch cases as both calls the same function. v4: - rebase w/ latest code base Signed-off-by: Arunpravin Paneer Selvam <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: initialize/finalize the ring for mes queue	Jack Xiao	2022-05-04	1	-41/+104
\| \| \| \| \| \| \| \| \| \|	Iniailize/finalize the ring for mes queue which submits the command stream to the mes-managed hardware queue. Signed-off-by: Jack Xiao <[email protected]> Acked-by: Christian König <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: add helper function to initialize mqd from ring v4	Jack Xiao	2022-05-04	1	-0/+48
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add the helper function to initialize mqd from ring configuration. v2: use if/else pair instead of ?/: pair v3: use simpler way to judge hqd_active v4: fix parameters to amdgpu_gfx_is_high_priority_compute_queue Signed-off-by: Jack Xiao <[email protected]> Acked-by: Christian König <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: initialize the vmid_wait with the stub fence	Christian König	2022-03-04	1	-0/+1
\| \| \| \| \| \| \| \|	This way we don't need to check for NULL any more. Signed-off-by: Christian König <[email protected]> Reviewed-by: Andrey Grodzovsky <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: Move scheduler init to after XGMI is ready	Andrey Grodzovsky	2022-02-09	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Before we initialize schedulers we must know which reset domain are we in - for single device there iis a single domain per device and so single wq per device. For XGMI the reset domain spans the entire XGMI hive and so the reset wq is per hive. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74112.html
*	drm/amdgpu: cleanup debugfs for amdgpu rings	Nirmoy Das	2021-09-14	1	-12/+6
\| \| \| \| \| \| \| \| \| \|	Use debugfs_create_file_size API for creating ring debugfs, and as its a NULL returning API, change the return type for amdgpu_debugfs_ring_init API as well. Also cleanup surrounding code. Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Shashank Sharma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: use IS_ERR for debugfs APIs	Nirmoy Das	2021-09-14	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	debugfs APIs returns encoded error so use IS_ERR for checking return value. v2: return PTR_ERR(ent) References: https://gitlab.freedesktop.org/drm/amd/-/issues/1686 Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-By: Shashank Sharma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amd/amdgpu/amdgpu_ring: Provide description for 'sched_score'	Lee Jones	2021-04-21	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:169: warning: Function parameter or member 'sched_score' not described in 'amdgpu_ring_init' Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Reviewed-by: Christian König <[email protected]> Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: add the sched_score to amdgpu_ring_init	Christian König	2021-04-09	1	-2/+4
\| \| \| \| \| \| \| \| \| \|	Allow separate ring to share the same scheduler score. No functional change. Signed-off-by: Christian König <[email protected]> Reviewed-and-Tested-by: Leo Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: cleanup struct amdgpu_ring	Nirmoy Das	2021-02-09	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch consist of below related changes: 1 Rename ring->priority to ring->hw_prio. 2 Assign correct hardware ring priority. 3 Remove ring->priority_mutex as ring priority remains unchanged after initialization. 4 Remove unused ring->num_jobs. v3: remove ring->num_jobs. v2: remove ring->priority_mutex. Fixes: 33abcb1f5a17 ("drm/amdgpu: set compute queue priority at mqd_init") Signed-off-by: Nirmoy Das <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amd/amdgpu/amdgpu_ring: Fix misnaming of param 'max_dw'	Lee Jones	2020-11-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:168: warning: Function parameter or member 'max_dw' not described in 'amdgpu_ring_init' drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:168: warning: Excess function parameter 'max_ndw' description in 'amdgpu_ring_init' Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amd/amdgpu/amdgpu_ring: Fix a bunch of function misdocumentation	Lee Jones	2020-11-13	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixes the following W=1 kernel build warning(s): drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:63: warning: Excess function parameter 'adev' description in 'amdgpu_ring_alloc' drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:122: warning: Excess function parameter 'adev' description in 'amdgpu_ring_commit' drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:167: warning: Function parameter or member 'max_dw' not described in 'amdgpu_ring_init' drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:167: warning: Function parameter or member 'irq_src' not described in 'amdgpu_ring_init' drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:167: warning: Function parameter or member 'irq_type' not described in 'amdgpu_ring_init' drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:167: warning: Function parameter or member 'hw_prio' not described in 'amdgpu_ring_init' drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:167: warning: Excess function parameter 'max_ndw' description in 'amdgpu_ring_init' drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:167: warning: Excess function parameter 'nop' description in 'amdgpu_ring_init' drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:285: warning: Excess function parameter 'adev' description in 'amdgpu_ring_fini' drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:325: warning: Function parameter or member 'ring' not described in 'amdgpu_ring_emit_reg_write_reg_wait_helper' drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:325: warning: Excess function parameter 'adev' description in 'amdgpu_ring_emit_reg_write_reg_wait_helper' Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: David Airlie <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu/amdgpu: use "*" adjacent to data name	Deepak R Varma	2020-11-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	When declaring pointer data, the "" symbol should be used adjacent to the data name as per the coding standards. This resolves following issues reported by checkpatch script: ERROR: "foo bar" should be "foo bar" ERROR: "foo bar" should be "foo bar" ERROR: "foo bar" should be "foo bar" ERROR: "(foo)" should be "(foo *)" Signed-off-by: Deepak R Varma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/amdgpu: Get DRM dev from adev by inline-f	Luben Tuikov	2020-08-24	1	-1/+1
\| \| \| \| \| \| \| \| \|	Add a static inline adev_to_drm() to obtain the DRM device pointer from an amdgpu_device pointer. Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
*	drm/scheduler: Scheduler priority fixes (v2)	Luben Tuikov	2020-08-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove DRM_SCHED_PRIORITY_LOW, as it was used in only one place. Rename and separate by a line DRM_SCHED_PRIORITY_MAX to DRM_SCHED_PRIORITY_COUNT as it represents a (total) count of said priorities and it is used as such in loops throughout the code. (0-based indexing is the the count number.) Remove redundant word HIGH in priority names, and rename KERNEL to HIGH, as it really means that, high. v2: Add back KERNEL and remove SW and HW, in lieu of a single HIGH between NORMAL and KERNEL. Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>