aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu
Commit message (Collapse)AuthorAgeFilesLines
...
| | | * drm/amdgpu: Convert query_memory_partition into common helpersHawking Zhang2025-06-243-45/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The query_memory_partition does not need to remain as soc specific callbacks. They can be shared across multiple products Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Move MAX_MEM_RANGES to amdgpu_gmc.hHawking Zhang2025-06-242-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This relocation allows MAX_MEM_RANGES to be shared across multiple products Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Convert pre|post_partition_switch into common helpersHawking Zhang2025-06-243-31/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So they can be reused for future products Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Likun Gao <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Convert update_supported_modes into a common helperHawking Zhang2025-06-243-40/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So it can be used for future products Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Likun Gao <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Convert update_partition_sched_list into a common helper v3Hawking Zhang2025-06-244-126/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The update_partition_sched_list function does not need to remain as a soc specific callback. It can be reused for future products. v2: bypass the function if xcp_mgr is not available (Likun) v3: Let caller check the availability of xcp_mgr (Lijo) Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Convert select_sched into a common helper v3Hawking Zhang2025-06-243-49/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The xcp select_sched function does not need to remain as a soc specific callback. It can be reused for future products v2: bypass the function if xcp_mgr is not available (Likun) v3: Let caller check the availability of xcp mgr (Lijo) Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: use common function to map ip for aqua_vanjaramLikun Gao2025-06-242-73/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Transfer to use function amdgpu_ip_map_init to map ip instance for aqua_vanjaram instead of operation on different ASIC. Signed-off-by: Likun Gao <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: make ip map init to common functionLikun Gao2025-06-243-1/+126
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IP instance map init function can be an common function instead of operation on different ASIC. V2: Create amdgpu_ip.[ch] file for ip related functions. Signed-off-by: Likun Gao <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amd/amdgpu: Refine isp_v4_1_1 loggingPratap Nirujogi2025-06-241-12/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace DRM_ERROR with drm_err function and update log messages to drop __func__ and print return value. Reviewed-by: Mario Limonciello <[email protected]> Signed-off-by: Pratap Nirujogi <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amd/amdgpu: Add ISP Generic PM Domain (genpd) supportPratap Nirujogi2025-06-242-1/+157
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AMDISP I2C device requires to power on ISP HW to probe the sensor device. Instead of using the exported symbols from ISP driver to control the power and clocks remotely,added Generic PM Domain (genpd) support in amdgpu_isp device for its child devices (amd_isp_capture, amd_isp_i2c_designware) to set power and clocks using PM methods. Co-developed-by: Bin Du <[email protected]> Signed-off-by: Bin Du <[email protected]> Reviewed-by: Mario Limonciello <[email protected]> Signed-off-by: Pratap Nirujogi <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini+0x70cVitaly Prosyak2025-06-242-15/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue was reproduced on NV10 using IGT pci_unplug test. It is expected that `amdgpu_driver_postclose_kms()` is called prior to `amdgpu_drm_release()`. However, the bug is that `amdgpu_fpriv` was freed in `amdgpu_driver_postclose_kms()`, and then later accessed in `amdgpu_drm_release()` via a call to `amdgpu_userq_mgr_fini()`. As a result, KASAN detected a use-after-free condition, as shown in the log below. The proposed fix is to move the calls to `amdgpu_eviction_fence_destroy()` and `amdgpu_userq_mgr_fini()` into `amdgpu_driver_postclose_kms()`, so they are invoked before `amdgpu_fpriv` is freed. This also ensures symmetry with the initialization path in `amdgpu_driver_open_kms()`, where the following components are initialized: - `amdgpu_userq_mgr_init()` - `amdgpu_eviction_fence_init()` - `amdgpu_ctx_mgr_init()` Correspondingly, in `amdgpu_driver_postclose_kms()` we should clean up using: - `amdgpu_userq_mgr_fini()` - `amdgpu_eviction_fence_destroy()` - `amdgpu_ctx_mgr_fini()` This change eliminates the use-after-free and improves consistency in resource management between open and close paths. [ +0.094367] ================================================================== [ +0.000026] BUG: KASAN: slab-use-after-free in amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000866] Write of size 8 at addr ffff88811c068c60 by task amd_pci_unplug/1737 [ +0.000026] CPU: 3 UID: 0 PID: 1737 Comm: amd_pci_unplug Not tainted 6.14.0+ #2 [ +0.000008] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.000004] Call Trace: [ +0.000004] <TASK> [ +0.000003] dump_stack_lvl+0x76/0xa0 [ +0.000010] print_report+0xce/0x600 [ +0.000009] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000790] ? srso_return_thunk+0x5/0x5f [ +0.000007] ? kasan_complete_mode_report_info+0x76/0x200 [ +0.000008] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000684] kasan_report+0xbe/0x110 [ +0.000007] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000601] __asan_report_store8_noabort+0x17/0x30 [ +0.000007] amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000801] ? __pfx_amdgpu_userq_mgr_fini+0x10/0x10 [amdgpu] [ +0.000819] ? srso_return_thunk+0x5/0x5f [ +0.000008] amdgpu_drm_release+0xa3/0xe0 [amdgpu] [ +0.000604] __fput+0x354/0xa90 [ +0.000010] __fput_sync+0x59/0x80 [ +0.000005] __x64_sys_close+0x7d/0xe0 [ +0.000006] x64_sys_call+0x2505/0x26f0 [ +0.000006] do_syscall_64+0x7c/0x170 [ +0.000004] ? kasan_record_aux_stack+0xae/0xd0 [ +0.000005] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? kmem_cache_free+0x398/0x580 [ +0.000006] ? __fput+0x543/0xa90 [ +0.000006] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? __fput+0x543/0xa90 [ +0.000004] ? __kasan_check_read+0x11/0x20 [ +0.000007] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? __kasan_check_read+0x11/0x20 [ +0.000003] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? fpregs_assert_state_consistent+0x21/0xb0 [ +0.000006] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? syscall_exit_to_user_mode+0x4e/0x240 [ +0.000005] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? do_syscall_64+0x88/0x170 [ +0.000003] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? do_syscall_64+0x88/0x170 [ +0.000004] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? irqentry_exit+0x43/0x50 [ +0.000004] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? exc_page_fault+0x7c/0x110 [ +0.000006] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000005] RIP: 0033:0x7ffff7b14f67 [ +0.000005] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff [ +0.000004] RSP: 002b:00007fffffffe358 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 [ +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffff7b14f67 [ +0.000003] RDX: 0000000000000000 RSI: 00007ffff7f5755a RDI: 0000000000000003 [ +0.000003] RBP: 00007fffffffe380 R08: 0000555555568170 R09: 0000000000000000 [ +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffffffe5c8 [ +0.000003] R13: 00005555555552a9 R14: 0000555555557d48 R15: 00007ffff7ffd040 [ +0.000007] </TASK> [ +0.000286] Allocated by task 425 on cpu 11 at 29.751192s: [ +0.000013] kasan_save_stack+0x28/0x60 [ +0.000008] kasan_save_track+0x18/0x70 [ +0.000006] kasan_save_alloc_info+0x38/0x60 [ +0.000006] __kasan_kmalloc+0xc1/0xd0 [ +0.000005] __kmalloc_cache_noprof+0x1bd/0x430 [ +0.000006] amdgpu_driver_open_kms+0x172/0x760 [amdgpu] [ +0.000521] drm_file_alloc+0x569/0x9a0 [ +0.000008] drm_client_init+0x1b7/0x410 [ +0.000007] drm_fbdev_client_setup+0x174/0x470 [ +0.000007] drm_client_setup+0x8a/0xf0 [ +0.000006] amdgpu_pci_probe+0x50b/0x10d0 [amdgpu] [ +0.000482] local_pci_probe+0xe7/0x1b0 [ +0.000008] pci_device_probe+0x5bf/0x890 [ +0.000005] really_probe+0x1fd/0x950 [ +0.000007] __driver_probe_device+0x307/0x410 [ +0.000005] driver_probe_device+0x4e/0x150 [ +0.000006] __driver_attach+0x223/0x510 [ +0.000005] bus_for_each_dev+0x102/0x1a0 [ +0.000006] driver_attach+0x3d/0x60 [ +0.000005] bus_add_driver+0x309/0x650 [ +0.000005] driver_register+0x13d/0x490 [ +0.000006] __pci_register_driver+0x1ee/0x2b0 [ +0.000006] xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo] [ +0.000008] do_one_initcall+0x9c/0x3e0 [ +0.000007] do_init_module+0x29e/0x7f0 [ +0.000006] load_module+0x5c75/0x7c80 [ +0.000006] init_module_from_file+0x106/0x180 [ +0.000007] idempotent_init_module+0x377/0x740 [ +0.000006] __x64_sys_finit_module+0xd7/0x180 [ +0.000006] x64_sys_call+0x1f0b/0x26f0 [ +0.000006] do_syscall_64+0x7c/0x170 [ +0.000005] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000013] Freed by task 1737 on cpu 9 at 76.455063s: [ +0.000010] kasan_save_stack+0x28/0x60 [ +0.000006] kasan_save_track+0x18/0x70 [ +0.000005] kasan_save_free_info+0x3b/0x60 [ +0.000006] __kasan_slab_free+0x54/0x80 [ +0.000005] kfree+0x127/0x470 [ +0.000006] amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu] [ +0.000485] drm_file_free.part.0+0x5b1/0xba0 [ +0.000007] drm_file_free+0x13/0x30 [ +0.000006] drm_client_release+0x1c4/0x2b0 [ +0.000006] drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper] [ +0.000007] put_fb_info+0x97/0xe0 [ +0.000006] unregister_framebuffer+0x197/0x380 [ +0.000005] drm_fb_helper_unregister_info+0x94/0x100 [ +0.000005] drm_fbdev_client_unregister+0x3c/0x80 [ +0.000007] drm_client_dev_unregister+0x144/0x330 [ +0.000006] drm_dev_unregister+0x49/0x1b0 [ +0.000006] drm_dev_unplug+0x4c/0xd0 [ +0.000006] amdgpu_pci_remove+0x58/0x130 [amdgpu] [ +0.000482] pci_device_remove+0xae/0x1e0 [ +0.000006] device_remove+0xc7/0x180 [ +0.000006] device_release_driver_internal+0x3d4/0x5a0 [ +0.000007] device_release_driver+0x12/0x20 [ +0.000006] pci_stop_bus_device+0x104/0x150 [ +0.000006] pci_stop_and_remove_bus_device_locked+0x1b/0x40 [ +0.000005] remove_store+0xd7/0xf0 [ +0.000007] dev_attr_store+0x3f/0x80 [ +0.000006] sysfs_kf_write+0x125/0x1d0 [ +0.000005] kernfs_fop_write_iter+0x2ea/0x490 [ +0.000007] vfs_write+0x90d/0xe70 [ +0.000006] ksys_write+0x119/0x220 [ +0.000006] __x64_sys_write+0x72/0xc0 [ +0.000006] x64_sys_call+0x18ab/0x26f0 [ +0.000005] do_syscall_64+0x7c/0x170 [ +0.000005] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000013] The buggy address belongs to the object at ffff88811c068000 which belongs to the cache kmalloc-rnd-01-4k of size 4096 [ +0.000016] The buggy address is located 3168 bytes inside of freed 4096-byte region [ffff88811c068000, ffff88811c069000) [ +0.000022] The buggy address belongs to the physical page: [ +0.000010] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88811c06e000 pfn:0x11c068 [ +0.000006] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ +0.000006] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) [ +0.000007] page_type: f5(slab) [ +0.000007] raw: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000 [ +0.000005] raw: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000 [ +0.000006] head: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000 [ +0.000005] head: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000 [ +0.000006] head: 0017ffffc0000003 ffffea0004701a01 ffffffffffffffff 0000000000000000 [ +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 [ +0.000004] page dumped because: kasan: bad access detected [ +0.000011] Memory state around the buggy address: [ +0.000009] ffff88811c068b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000012] ffff88811c068b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] >ffff88811c068c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] ^ [ +0.000010] ffff88811c068c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] ffff88811c068d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] ================================================================== Cc: Alex Deucher <[email protected]> Cc: Christian König <[email protected]> Cc: Lijo Lazar <[email protected]> Cc: Jesse Zhang <[email protected]> Cc: Arvind Yadav <[email protected]> v2: drop amdgpu_drm_release() and assign drm_release() as the callback directly.(Alex) Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream Error") Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Vitaly Prosyak <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: remove fence slabAlex Deucher2025-06-243-27/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just use kmalloc for the fences in the rare case we need an independent fence. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amd: Adjust output for discovery error handlingMario Limonciello2025-06-241-15/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 017fbb6690c2 ("drm/amdgpu/discovery: check ip_discovery fw file available") added support for reading an amdgpu IP discovery bin file for some specific products. If it's not found then it will fallback to hardcoded values. However if it's not found there is also a lot of noise about missing files and errors. Adjust the error handling to decrease most messages to DEBUG and to show users less about missing files. Reviewed-by: Lijo Lazar <[email protected]> Reported-by: Marcus Seyfarth <[email protected]> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4312 Tested-by: Marcus Seyfarth <[email protected]> Fixes: 017fbb6690c2 ("drm/amdgpu/discovery: check ip_discovery fw file available") Acked-by: Alex Deucher <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/mes: add compatibility checks for set_hw_resource_1Alex Deucher2025-06-242-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Seems some older MES firmware versions do not properly support this packet. Add back some the compatibility checks. v2: switch to fw version check (Shaoyun) Fixes: f81cd793119e ("drm/amd/amdgpu: Fix MES init sequence") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4295 Cc: Shaoyun Liu <[email protected]> Reviewed-by: shaoyun.liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/gfx9: Add Cleaner Shader Support for GFX9.x GPUsSrinivasan Shanmugam2025-06-241-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable the cleaner shader for other GFX9.x series of GPUs to provide data isolation between GPU workloads. The cleaner shader is responsible for clearing the Local Data Store (LDS), Vector General Purpose Registers (VGPRs), and Scalar General Purpose Registers (SGPRs), which helps prevent data leakage and ensures accurate computation results. This update extends cleaner shader support to GFX9.x GPUs, previously available for GFX9.4.2. It enhances security by clearing GPU memory between processes and maintains a consistent GPU state across KGD and KFD workloads. Cc: Manu Rastogi <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/sdma5.2: init engine reset mutexAlex Deucher2025-06-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Missing the mutex init. Fixes: 47454f2dc0bf ("drm/amdgpu: Register the new sdma function pointers for sdma_v5_2") Reviewed-by: Jesse Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/sdma5: init engine reset mutexAlex Deucher2025-06-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Missing the mutex init. Fixes: e56d4bf57fab ("drm/amdgpu/: drm/amdgpu: Register the new sdma function pointers for sdma_v5_0") Reviewed-by: Jesse Zhang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: switch job hw_fence to amdgpu_fenceAlex Deucher2025-06-186-32/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the amdgpu fence container so we can store additional data in the fence. This also fixes the start_time handling for MCBP since we were casting the fence to an amdgpu_fence and it wasn't. Fixes: 3f4c175d62d8 ("drm/amdgpu: MCBP based on DRM scheduler (v9)") Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Add xgmi API to set max speed/widthLijo Lazar2025-06-182-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an API to set the max possible xgmi speed/width. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Deprecate xgmi_link_speed enumLijo Lazar2025-06-182-9/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xgmi doesn't have discrete max speeds defined. Speed numbers can be arbitrary based on SOC. Deprecate the enum. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Extend bus status check to more casesLijo Lazar2025-06-183-7/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of unexpected errors, check if device is alive on the bus. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: reclaim psp fw reservation memory regionFrank Min2025-06-185-2/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PSP v14 fw update introduced changes on memory reservation region, according to the change driver reclaim some non-reserved region. 1. introduce 2 new psp commands to query fw reservation regions 2. add a new reservation region for psp 3. reclaim psp non-used region Signed-off-by: Frank Min <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: refine usage of amdgpu_bad_page_thresholdganglxie2025-06-181-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when amdgpu_bad_page_threshold == -1 or -2, driver will issue a warning message when threshold is reached and continue runtime services. Signed-off-by: ganglxie <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Fix SDMA UTC_L1 handling during start/stop sequencesJesse Zhang2025-06-181-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit makes two key fixes to SDMA v4.4.2 handling: 1. disable UTC_L1 in sdma_cntl register when stopping SDMA engines by reading the current value before modifying UTC_L1_ENABLE bit. 2. Ensure UTC_L1_ENABLE is consistently managed by: - Adding the missing register write when enabling UTC_L1 during start - Keeping UTC_L1 enabled by default as per hardware requirements v2: Correct SDMA_CNTL setting (Philip) Suggested-by: Jonathan Kim <[email protected]> Signed-off-by: Jesse Zhang <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Release reset locks during failuresLijo Lazar2025-06-181-25/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure to release reset domain lock in case of failures. Signed-off-by: Lijo Lazar <[email protected]> Signed-off-by: Ce Sun <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Fixes: 11bb33766f66 ("drm/amdgpu: refactor amdgpu_device_gpu_recover") Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/sdma: handle paging queues in amdgpu_sdma_reset_engine()Alex Deucher2025-06-181-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Need to properly start and stop paging queues if they are present. This is not an issue today since we don't support a paging queue on any chips with queue reset. Fixes: b22659d5d352 ("drm/amdgpu: switch amdgpu_sdma_reset_engine to use the new sdma function pointers") Reviewed-by: Jesse Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdkfd: Move the process suspend and resume out of full accessEmily Deng2025-06-185-17/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the suspend and resume process, exclusive access is not required. Therefore, it can be moved out of the full access section to reduce the duration of exclusive access. v3: Move suspend processes before hardware fini. Remove twice call for bare metal. v4: Refine code Signed-off-by: Emily Deng <[email protected]> Acked-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: VCN v5_0_1 to prevent FW checking RB during DPG pauseSonny Jiang2025-06-181-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a protection to ensure programming are all complete prior VCPU starting. This is a WA for an unintended VCPU running. Signed-off-by: Sonny Jiang <[email protected]> Acked-by: Leo Liu <[email protected]> Reviewed-by: Ruijing Dong <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Add soft reset callback to SDMA v4.4.xJesse Zhang2025-06-182-29/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement soft reset engine callback for SDMA 4.4.x IPs. This avoids IP version check in generic implementation. V2: Correct physical instance ID calculation in soft_reset_engine (Jesse) Signed-off-by: Lijo Lazar <[email protected]> Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Use logical instance ID for SDMA v4_4_2 queue operationsJesse Zhang2025-06-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify SDMA v4_4_2 queue reset and stop operations by: 1. Removing GET_INST(SDMA0) conversion for ring->me 2. Using the logical instance ID (ring->me) directly 3. Maintaining consistent behavior with other SDMA queue operations This change aligns with the existing queue handling logic where ring->me already represents the correct instance identifier. Signed-off-by: Lijo Lazar <[email protected]> Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Fix SDMA engine reset with logical instance IDJesse Zhang2025-06-181-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit makes the following improvements to SDMA engine reset handling: 1. Clarifies in the function documentation that instance_id refers to a logical ID 2. Adds conversion from logical to physical instance ID before performing reset using GET_INST(SDMA0, instance_id) macro 3. Improves error messaging to indicate when a logical instance reset fails 4. Adds better code organization with blank lines for readability The change ensures proper SDMA engine reset by using the correct physical instance ID while maintaining the logical ID interface for callers. V2: Remove harvest_config check and convert directly to physical instance (Lijo) Suggested-by: Jonathan Kim <[email protected]> Signed-off-by: Jesse Zhang <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Suspend IH during mode-2 resetLijo Lazar2025-06-181-4/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On multi-aid SOCs, there could be a continuous stream of interrupts from GC after poison consumption. Suspend IH to disable them before doing mode-2 reset. This avoids conflicts in hardware accesses during interrupt handlers while a reset is ongoing. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amd: Add support for a complete pmops actionMario Limonciello2025-06-183-1/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | complete() callbacks are supposed to handle reversing anything that occurred during prepare() callbacks. They'll be called on every power state transition, and will also be called if the sequence is failed (such as an aborted suspend). Add support for IP blocks to support this action. Reviewed-by: Alex Hung <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Add debug mask to disable CE logsXiang Liu2025-06-184-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add debug mask to disable kernel logs of RAS correctable errors, including both ACA and CE error counter kernel messages. Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: add kicker fws loading for gfx11/smu13/psp13Frank Min2025-06-184-6/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Add kicker firmwares loading for gfx11/smu13/psp13 2. Register additional MODULE_FIRMWARE entries for kicker fws - gc_11_0_0_rlc_kicker.bin - gc_11_0_0_imu_kicker.bin - psp_13_0_0_sos_kicker.bin - psp_13_0_0_ta_kicker.bin - smu_13_0_0_kicker.bin Signed-off-by: Frank Min <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Add kicker device detectionFrank Min2025-06-182-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. add kicker device list 2. add kicker device checking helper function Signed-off-by: Frank Min <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Clear reset flags from ras contextLijo Lazar2025-06-181-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once RAS errors are cleared with appropriate recovery mechanism, clear reset flags also from RAS context. Otherwise, stale flag values could affect the subsequent RAS reset handling on the device. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/gfx9: drop reset_kgqAlex Deucher2025-06-181-46/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It doesn't work reliably and we have soft recover and full adapter reset so drop this. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/gfx8: drop reset_kgqAlex Deucher2025-06-181-71/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It doesn't work reliably and we have soft recover and full adapter reset so drop this. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/gfx7: drop reset_kgqAlex Deucher2025-06-181-71/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It doesn't work reliably and we have soft recover and full adapter reset so drop this. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdkfd: allow compute partition mode switch with cgroup exclusionsJonathan Kim2025-06-182-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The KFD currently bars a compute partition mode switch while a KFD process exists. Since cgroup excluded devices remain excluded for the lifetime of a KFD process and user space is able to mode switch single devices, allow users to mode switch a device with any running process that has been cgroup excluded from this device. Signed-off-by: Jonathan Kim <[email protected]> Reviewed-by: Harish Kasiviswanathan <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: clear pa and mca record counter when resetting eepromganglxie2025-06-181-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clear pa and mca record counter when resetting eeprom, so that ras_num_bad_pages can be calculated correctly Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: fix fence fallback timer expired errorSamuel Zhang2025-06-183-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IH is not working after switching a new gpu index for the first time. During VM resume, QEMU programming of VF MSIX table (register GFXMSIX_VECT0_ADDR_LO) may not work.The access could be blocked by nBIF protection as VF isn't in exclusive access mode. Exclusive access is enabled now, disable/enable MSIX so that QEMU reprograms MSIX table. call amdgpu_restore_msix on resume to restore msix table. Signed-off-by: Samuel Zhang <[email protected]> Acked-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: enable pdb0 for hibernation on SRIOVSamuel Zhang2025-06-185-16/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When switching to new GPU index after hibernation and then resume, VRAM offset of each VRAM BO will be changed, and the cached gpu addresses needed to updated. This is to enable pdb0 and switch to use pdb0-based virtual gpu address by default in amdgpu_bo_create_reserved(). since the virtual addresses do not change, this can avoid the need to update all cached gpu addresses all over the codebase. Signed-off-by: Emily Deng <[email protected]> Signed-off-by: Samuel Zhang <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: update GPU addresses for SMU and PSPSamuel Zhang2025-06-184-0/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | add amdgpu_bo_fb_aper_addr() and update the cached GPU addresses to use the FB aperture address for SMU and PSP. 2 reasons for this change: 1. when pdb0 is enabled, gpu addr from amdgpu_bo_create_kernel() is GART aperture address, it is not compatible with SMU and PSP, it need to be updated to use FB aperture address. 2. Since FB aperture address will change after switching to new GPU index after hibernation, it need to be updated on resume. Signed-off-by: Jiang Liu <[email protected]> Signed-off-by: Samuel Zhang <[email protected]> Acked-by: Christian König <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Remove nbiov7.9 replay count reportingLijo Lazar2025-06-181-20/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Direct pcie replay count reporting is not available on nbio v7.9. Reporting is done through firmware. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Mangesh Gadre <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Fixes: 50709d18f4a6 ("drm/amdgpu: Add pci replay count to nbio v7.9") Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Check pcie replays reporting supportLijo Lazar2025-06-183-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check if pcie replay count reporting is supported before creating sysfs attribute. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Mangesh Gadre <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Enable IFWI update support for PSPv14.0.2 and v14.0.3Shiwu Zhang2025-06-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the psp_vbflash and psp_vbflash_status available in sysfs. v2: make it available for v14.0.2 as well (hawking) Signed-off-by: Shiwu Zhang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: update xgmi info and vram_base_offset on resumeSamuel Zhang2025-06-182-0/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For SRIOV VM env with XGMI enabled systems, XGMI physical node id may change when hibernate and resume with different VF. Update XGMI info and vram_base_offset on resume for gfx444 SRIOV env. Add amdgpu_virt_xgmi_migrate_enabled() as the feature flag. Signed-off-by: Jiang Liu <[email protected]> Signed-off-by: Samuel Zhang <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| * | | drm/amdgpu: add support of debugfs for mqd informationSunil Khatri2025-07-042-0/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add debugfs support for mqd for each queue of the client. The address exposed to debugfs could be used to dump the mqd. Signed-off-by: Sunil Khatri <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian König <[email protected]>