aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu
Commit message (Collapse)AuthorAgeFilesLines
...
| | | * drm/amdgpu: Use correct severity for BP threshold exceed eventXiang Liu2025-06-301-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The severity of CPER for BP threshold exceed event should be set as CPER_SEV_FATAL to match the OOB implementation. Signed-off-by: Xiang Liu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: move scheduler wqueue handling into callbacksAlex Deucher2025-06-3019-21/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the scheduler wqueue stopping and starting into the ring reset callbacks. On some IPs we have to reset an engine which may have multiple queues. Move the wqueue handling into the backend so we can handle them as needed based on the type of reset available. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: move guilty handling into ring resetsAlex Deucher2025-06-303-51/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move guilty logic into the ring reset callbacks. This allows each ring reset callback to better handle fence errors and force completions in line with the reset behavior for each IP. It also allows us to remove the ring guilty callback since that logic now lives in the reset callback. Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: move force completion into ring resetsAlex Deucher2025-06-3021-30/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the force completion handling into each ring reset function so that each engine can determine whether or not it needs to force completion on the jobs in the ring. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: rework queue reset scheduler interactionChristian König2025-06-301-15/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stopping the scheduler for queue reset is generally a good idea because it prevents any worker from touching the ring buffer. Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: update ring reset function signatureAlex Deucher2025-06-3022-26/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Going forward, we'll need more than just the vmid. Add the guilty amdgpu_fence. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: remove job parameter from amdgpu_fence_emit()Alex Deucher2025-06-303-24/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | What we actually care about is the amdgpu_fence object so pass that in explicitly to avoid possible mistakes in the future. The job_run_counter handling can be safely removed at this point as we no longer support job resubmission. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdkfd: add hqd_sdma_get_doorbell callbacks for gfx7/8Alex Deucher2025-06-302-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These were missed when support was added for other generations. The callbacks are called unconditionally so we need to make sure all generations have them. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4304 Link: https://github.com/ROCm/ROCm/issues/4965 Fixes: bac38ca8c475 ("drm/amdkfd: implement per queue sdma reset for gfx 9.4+") Cc: Jonathan Kim <[email protected]> Reported-by: Johl Brown <[email protected]> Reviewed-by: Jonathan Kim <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Fix memory leak in amdgpu_ctx_mgr_entity_finiLin.Cao2025-06-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | patch dd64956685fa ("drm/amdgpu: Remove duplicated "context still alive" check") removed ctx put, which will cause amdgpu_ctx_fini() cannot be called and then cause some finished fence that added by amdgpu_ctx_add_fence() cannot be released and cause memleak. Fixes: dd64956685fa ("drm/amdgpu: Remove duplicated "context still alive" check") Signed-off-by: Lin.Cao <[email protected]> Reviewed-by: Tvrtko Ursulin <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: indent an if statementDan Carpenter2025-06-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "return true;" line wasn't indented. Also checkpatch likes when we align the && conditions. Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Use dma_buf from GEM object instanceThomas Zimmermann2025-06-303-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid dereferencing struct drm_gem_object.import_attach for the imported dma-buf. The dma_buf field in the GEM object instance refers to the same buffer. Prepares to make import_attach an implementation detail of PRIME. Reviewed-by: Christian König <[email protected]> Signed-off-by: Thomas Zimmermann <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Test for imported buffers with drm_gem_is_imported()Thomas Zimmermann2025-06-306-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of testing import_attach for imported GEM buffers, invoke drm_gem_is_imported() to do the test. v2: - keep amdgpu_bo_print_info() as-is (Christian) Reviewed-by: Christian König <[email protected]> Signed-off-by: Thomas Zimmermann <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Convert from DRM_* to dev_*Lijo Lazar2025-06-3012-219/+320
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert from generic DRM_* to dev_* calls to have device context info. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Include sdma_4_4_4.binKent Russell2025-06-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This got missed during SDMA 4.4.4 support. Signed-off-by: Kent Russell <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amd: Include <linux/export.h> when neededAndré Almeida2025-06-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following compile time warning when building with W=1: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing Acked-by: Christian König <[email protected]> Signed-off-by: André Almeida <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amd: Do not include <linux/export.h> when unusedAndré Almeida2025-06-301-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following compile time warning when building with W=1: warning: EXPORT_SYMBOL() is not used, but #include <linux/export.h> is present Acked-by: Christian König <[email protected]> Signed-off-by: André Almeida <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/sdma5.x: suspend KFD queues in ring resetAlex Deucher2025-06-302-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SDMA 5.x only supports engine soft reset which resets all queues on the engine. As such, we need to suspend KFD queues around resets like we do for SDMA 4.x. Reviewed-by: Jesse Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amd: Fix spelling mistake "correctalbe" -> "correctable"Colin Ian King2025-06-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a spelling mistake in a pr_info message. Fix it. Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/sdma7: add ucode version checks for userq supportAlex Deucher2025-06-241-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | SDMA 7.0.0/1: 7836028 Reviewed-by: Jesse Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/sdma6: add ucode version checks for userq supportAlex Deucher2025-06-241-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SDMA 6.0.0 version 24 SDMA 6.0.2 version 21 SDMA 6.0.3 version 25 Reviewed-by: Jesse Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Add more checks to PSP mailboxLijo Lazar2025-06-249-61/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of checking the response flag, use status mask also to check against any unexpected failures like a device drop. Also, log error if waiting on a psp response fails/times out. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Convert init_mem_ranges into common helpersHawking Zhang2025-06-243-184/+191
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | They can be shared across multiple products Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Generalize is_multi_chiplet with a common helper v2Hawking Zhang2025-06-242-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is not necessary to be ip generation specific v2: rename the helper to is_multi_aid (Lijo) Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Convert query_memory_partition into common helpersHawking Zhang2025-06-243-45/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The query_memory_partition does not need to remain as soc specific callbacks. They can be shared across multiple products Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Move MAX_MEM_RANGES to amdgpu_gmc.hHawking Zhang2025-06-242-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This relocation allows MAX_MEM_RANGES to be shared across multiple products Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Convert pre|post_partition_switch into common helpersHawking Zhang2025-06-243-31/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So they can be reused for future products Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Likun Gao <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Convert update_supported_modes into a common helperHawking Zhang2025-06-243-40/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So it can be used for future products Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Likun Gao <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Convert update_partition_sched_list into a common helper v3Hawking Zhang2025-06-244-126/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The update_partition_sched_list function does not need to remain as a soc specific callback. It can be reused for future products. v2: bypass the function if xcp_mgr is not available (Likun) v3: Let caller check the availability of xcp_mgr (Lijo) Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Convert select_sched into a common helper v3Hawking Zhang2025-06-243-49/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The xcp select_sched function does not need to remain as a soc specific callback. It can be reused for future products v2: bypass the function if xcp_mgr is not available (Likun) v3: Let caller check the availability of xcp mgr (Lijo) Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: use common function to map ip for aqua_vanjaramLikun Gao2025-06-242-73/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Transfer to use function amdgpu_ip_map_init to map ip instance for aqua_vanjaram instead of operation on different ASIC. Signed-off-by: Likun Gao <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: make ip map init to common functionLikun Gao2025-06-243-1/+126
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IP instance map init function can be an common function instead of operation on different ASIC. V2: Create amdgpu_ip.[ch] file for ip related functions. Signed-off-by: Likun Gao <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amd/amdgpu: Refine isp_v4_1_1 loggingPratap Nirujogi2025-06-241-12/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace DRM_ERROR with drm_err function and update log messages to drop __func__ and print return value. Reviewed-by: Mario Limonciello <[email protected]> Signed-off-by: Pratap Nirujogi <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amd/amdgpu: Add ISP Generic PM Domain (genpd) supportPratap Nirujogi2025-06-242-1/+157
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AMDISP I2C device requires to power on ISP HW to probe the sensor device. Instead of using the exported symbols from ISP driver to control the power and clocks remotely,added Generic PM Domain (genpd) support in amdgpu_isp device for its child devices (amd_isp_capture, amd_isp_i2c_designware) to set power and clocks using PM methods. Co-developed-by: Bin Du <[email protected]> Signed-off-by: Bin Du <[email protected]> Reviewed-by: Mario Limonciello <[email protected]> Signed-off-by: Pratap Nirujogi <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini+0x70cVitaly Prosyak2025-06-242-15/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue was reproduced on NV10 using IGT pci_unplug test. It is expected that `amdgpu_driver_postclose_kms()` is called prior to `amdgpu_drm_release()`. However, the bug is that `amdgpu_fpriv` was freed in `amdgpu_driver_postclose_kms()`, and then later accessed in `amdgpu_drm_release()` via a call to `amdgpu_userq_mgr_fini()`. As a result, KASAN detected a use-after-free condition, as shown in the log below. The proposed fix is to move the calls to `amdgpu_eviction_fence_destroy()` and `amdgpu_userq_mgr_fini()` into `amdgpu_driver_postclose_kms()`, so they are invoked before `amdgpu_fpriv` is freed. This also ensures symmetry with the initialization path in `amdgpu_driver_open_kms()`, where the following components are initialized: - `amdgpu_userq_mgr_init()` - `amdgpu_eviction_fence_init()` - `amdgpu_ctx_mgr_init()` Correspondingly, in `amdgpu_driver_postclose_kms()` we should clean up using: - `amdgpu_userq_mgr_fini()` - `amdgpu_eviction_fence_destroy()` - `amdgpu_ctx_mgr_fini()` This change eliminates the use-after-free and improves consistency in resource management between open and close paths. [ +0.094367] ================================================================== [ +0.000026] BUG: KASAN: slab-use-after-free in amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000866] Write of size 8 at addr ffff88811c068c60 by task amd_pci_unplug/1737 [ +0.000026] CPU: 3 UID: 0 PID: 1737 Comm: amd_pci_unplug Not tainted 6.14.0+ #2 [ +0.000008] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.000004] Call Trace: [ +0.000004] <TASK> [ +0.000003] dump_stack_lvl+0x76/0xa0 [ +0.000010] print_report+0xce/0x600 [ +0.000009] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000790] ? srso_return_thunk+0x5/0x5f [ +0.000007] ? kasan_complete_mode_report_info+0x76/0x200 [ +0.000008] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000684] kasan_report+0xbe/0x110 [ +0.000007] ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000601] __asan_report_store8_noabort+0x17/0x30 [ +0.000007] amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu] [ +0.000801] ? __pfx_amdgpu_userq_mgr_fini+0x10/0x10 [amdgpu] [ +0.000819] ? srso_return_thunk+0x5/0x5f [ +0.000008] amdgpu_drm_release+0xa3/0xe0 [amdgpu] [ +0.000604] __fput+0x354/0xa90 [ +0.000010] __fput_sync+0x59/0x80 [ +0.000005] __x64_sys_close+0x7d/0xe0 [ +0.000006] x64_sys_call+0x2505/0x26f0 [ +0.000006] do_syscall_64+0x7c/0x170 [ +0.000004] ? kasan_record_aux_stack+0xae/0xd0 [ +0.000005] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? kmem_cache_free+0x398/0x580 [ +0.000006] ? __fput+0x543/0xa90 [ +0.000006] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? __fput+0x543/0xa90 [ +0.000004] ? __kasan_check_read+0x11/0x20 [ +0.000007] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? __kasan_check_read+0x11/0x20 [ +0.000003] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? fpregs_assert_state_consistent+0x21/0xb0 [ +0.000006] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? syscall_exit_to_user_mode+0x4e/0x240 [ +0.000005] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? do_syscall_64+0x88/0x170 [ +0.000003] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? do_syscall_64+0x88/0x170 [ +0.000004] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? irqentry_exit+0x43/0x50 [ +0.000004] ? srso_return_thunk+0x5/0x5f [ +0.000004] ? exc_page_fault+0x7c/0x110 [ +0.000006] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000005] RIP: 0033:0x7ffff7b14f67 [ +0.000005] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff [ +0.000004] RSP: 002b:00007fffffffe358 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 [ +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffff7b14f67 [ +0.000003] RDX: 0000000000000000 RSI: 00007ffff7f5755a RDI: 0000000000000003 [ +0.000003] RBP: 00007fffffffe380 R08: 0000555555568170 R09: 0000000000000000 [ +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffffffe5c8 [ +0.000003] R13: 00005555555552a9 R14: 0000555555557d48 R15: 00007ffff7ffd040 [ +0.000007] </TASK> [ +0.000286] Allocated by task 425 on cpu 11 at 29.751192s: [ +0.000013] kasan_save_stack+0x28/0x60 [ +0.000008] kasan_save_track+0x18/0x70 [ +0.000006] kasan_save_alloc_info+0x38/0x60 [ +0.000006] __kasan_kmalloc+0xc1/0xd0 [ +0.000005] __kmalloc_cache_noprof+0x1bd/0x430 [ +0.000006] amdgpu_driver_open_kms+0x172/0x760 [amdgpu] [ +0.000521] drm_file_alloc+0x569/0x9a0 [ +0.000008] drm_client_init+0x1b7/0x410 [ +0.000007] drm_fbdev_client_setup+0x174/0x470 [ +0.000007] drm_client_setup+0x8a/0xf0 [ +0.000006] amdgpu_pci_probe+0x50b/0x10d0 [amdgpu] [ +0.000482] local_pci_probe+0xe7/0x1b0 [ +0.000008] pci_device_probe+0x5bf/0x890 [ +0.000005] really_probe+0x1fd/0x950 [ +0.000007] __driver_probe_device+0x307/0x410 [ +0.000005] driver_probe_device+0x4e/0x150 [ +0.000006] __driver_attach+0x223/0x510 [ +0.000005] bus_for_each_dev+0x102/0x1a0 [ +0.000006] driver_attach+0x3d/0x60 [ +0.000005] bus_add_driver+0x309/0x650 [ +0.000005] driver_register+0x13d/0x490 [ +0.000006] __pci_register_driver+0x1ee/0x2b0 [ +0.000006] xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo] [ +0.000008] do_one_initcall+0x9c/0x3e0 [ +0.000007] do_init_module+0x29e/0x7f0 [ +0.000006] load_module+0x5c75/0x7c80 [ +0.000006] init_module_from_file+0x106/0x180 [ +0.000007] idempotent_init_module+0x377/0x740 [ +0.000006] __x64_sys_finit_module+0xd7/0x180 [ +0.000006] x64_sys_call+0x1f0b/0x26f0 [ +0.000006] do_syscall_64+0x7c/0x170 [ +0.000005] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000013] Freed by task 1737 on cpu 9 at 76.455063s: [ +0.000010] kasan_save_stack+0x28/0x60 [ +0.000006] kasan_save_track+0x18/0x70 [ +0.000005] kasan_save_free_info+0x3b/0x60 [ +0.000006] __kasan_slab_free+0x54/0x80 [ +0.000005] kfree+0x127/0x470 [ +0.000006] amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu] [ +0.000485] drm_file_free.part.0+0x5b1/0xba0 [ +0.000007] drm_file_free+0x13/0x30 [ +0.000006] drm_client_release+0x1c4/0x2b0 [ +0.000006] drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper] [ +0.000007] put_fb_info+0x97/0xe0 [ +0.000006] unregister_framebuffer+0x197/0x380 [ +0.000005] drm_fb_helper_unregister_info+0x94/0x100 [ +0.000005] drm_fbdev_client_unregister+0x3c/0x80 [ +0.000007] drm_client_dev_unregister+0x144/0x330 [ +0.000006] drm_dev_unregister+0x49/0x1b0 [ +0.000006] drm_dev_unplug+0x4c/0xd0 [ +0.000006] amdgpu_pci_remove+0x58/0x130 [amdgpu] [ +0.000482] pci_device_remove+0xae/0x1e0 [ +0.000006] device_remove+0xc7/0x180 [ +0.000006] device_release_driver_internal+0x3d4/0x5a0 [ +0.000007] device_release_driver+0x12/0x20 [ +0.000006] pci_stop_bus_device+0x104/0x150 [ +0.000006] pci_stop_and_remove_bus_device_locked+0x1b/0x40 [ +0.000005] remove_store+0xd7/0xf0 [ +0.000007] dev_attr_store+0x3f/0x80 [ +0.000006] sysfs_kf_write+0x125/0x1d0 [ +0.000005] kernfs_fop_write_iter+0x2ea/0x490 [ +0.000007] vfs_write+0x90d/0xe70 [ +0.000006] ksys_write+0x119/0x220 [ +0.000006] __x64_sys_write+0x72/0xc0 [ +0.000006] x64_sys_call+0x18ab/0x26f0 [ +0.000005] do_syscall_64+0x7c/0x170 [ +0.000005] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ +0.000013] The buggy address belongs to the object at ffff88811c068000 which belongs to the cache kmalloc-rnd-01-4k of size 4096 [ +0.000016] The buggy address is located 3168 bytes inside of freed 4096-byte region [ffff88811c068000, ffff88811c069000) [ +0.000022] The buggy address belongs to the physical page: [ +0.000010] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88811c06e000 pfn:0x11c068 [ +0.000006] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ +0.000006] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) [ +0.000007] page_type: f5(slab) [ +0.000007] raw: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000 [ +0.000005] raw: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000 [ +0.000006] head: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000 [ +0.000005] head: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000 [ +0.000006] head: 0017ffffc0000003 ffffea0004701a01 ffffffffffffffff 0000000000000000 [ +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 [ +0.000004] page dumped because: kasan: bad access detected [ +0.000011] Memory state around the buggy address: [ +0.000009] ffff88811c068b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000012] ffff88811c068b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] >ffff88811c068c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] ^ [ +0.000010] ffff88811c068c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] ffff88811c068d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000011] ================================================================== Cc: Alex Deucher <[email protected]> Cc: Christian König <[email protected]> Cc: Lijo Lazar <[email protected]> Cc: Jesse Zhang <[email protected]> Cc: Arvind Yadav <[email protected]> v2: drop amdgpu_drm_release() and assign drm_release() as the callback directly.(Alex) Fixes: adba0929736a ("drm/amdgpu: Fix Illegal opcode in command stream Error") Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Vitaly Prosyak <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: remove fence slabAlex Deucher2025-06-243-27/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just use kmalloc for the fences in the rare case we need an independent fence. Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amd: Adjust output for discovery error handlingMario Limonciello2025-06-241-15/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 017fbb6690c2 ("drm/amdgpu/discovery: check ip_discovery fw file available") added support for reading an amdgpu IP discovery bin file for some specific products. If it's not found then it will fallback to hardcoded values. However if it's not found there is also a lot of noise about missing files and errors. Adjust the error handling to decrease most messages to DEBUG and to show users less about missing files. Reviewed-by: Lijo Lazar <[email protected]> Reported-by: Marcus Seyfarth <[email protected]> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4312 Tested-by: Marcus Seyfarth <[email protected]> Fixes: 017fbb6690c2 ("drm/amdgpu/discovery: check ip_discovery fw file available") Acked-by: Alex Deucher <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/mes: add compatibility checks for set_hw_resource_1Alex Deucher2025-06-242-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Seems some older MES firmware versions do not properly support this packet. Add back some the compatibility checks. v2: switch to fw version check (Shaoyun) Fixes: f81cd793119e ("drm/amd/amdgpu: Fix MES init sequence") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4295 Cc: Shaoyun Liu <[email protected]> Reviewed-by: shaoyun.liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/gfx9: Add Cleaner Shader Support for GFX9.x GPUsSrinivasan Shanmugam2025-06-241-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable the cleaner shader for other GFX9.x series of GPUs to provide data isolation between GPU workloads. The cleaner shader is responsible for clearing the Local Data Store (LDS), Vector General Purpose Registers (VGPRs), and Scalar General Purpose Registers (SGPRs), which helps prevent data leakage and ensures accurate computation results. This update extends cleaner shader support to GFX9.x GPUs, previously available for GFX9.4.2. It enhances security by clearing GPU memory between processes and maintains a consistent GPU state across KGD and KFD workloads. Cc: Manu Rastogi <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/sdma5.2: init engine reset mutexAlex Deucher2025-06-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Missing the mutex init. Fixes: 47454f2dc0bf ("drm/amdgpu: Register the new sdma function pointers for sdma_v5_2") Reviewed-by: Jesse Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/sdma5: init engine reset mutexAlex Deucher2025-06-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Missing the mutex init. Fixes: e56d4bf57fab ("drm/amdgpu/: drm/amdgpu: Register the new sdma function pointers for sdma_v5_0") Reviewed-by: Jesse Zhang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: switch job hw_fence to amdgpu_fenceAlex Deucher2025-06-186-32/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the amdgpu fence container so we can store additional data in the fence. This also fixes the start_time handling for MCBP since we were casting the fence to an amdgpu_fence and it wasn't. Fixes: 3f4c175d62d8 ("drm/amdgpu: MCBP based on DRM scheduler (v9)") Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Add xgmi API to set max speed/widthLijo Lazar2025-06-182-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an API to set the max possible xgmi speed/width. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Deprecate xgmi_link_speed enumLijo Lazar2025-06-182-9/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xgmi doesn't have discrete max speeds defined. Speed numbers can be arbitrary based on SOC. Deprecate the enum. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Extend bus status check to more casesLijo Lazar2025-06-183-7/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of unexpected errors, check if device is alive on the bus. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: reclaim psp fw reservation memory regionFrank Min2025-06-185-2/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PSP v14 fw update introduced changes on memory reservation region, according to the change driver reclaim some non-reserved region. 1. introduce 2 new psp commands to query fw reservation regions 2. add a new reservation region for psp 3. reclaim psp non-used region Signed-off-by: Frank Min <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: refine usage of amdgpu_bad_page_thresholdganglxie2025-06-181-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when amdgpu_bad_page_threshold == -1 or -2, driver will issue a warning message when threshold is reached and continue runtime services. Signed-off-by: ganglxie <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Fix SDMA UTC_L1 handling during start/stop sequencesJesse Zhang2025-06-181-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit makes two key fixes to SDMA v4.4.2 handling: 1. disable UTC_L1 in sdma_cntl register when stopping SDMA engines by reading the current value before modifying UTC_L1_ENABLE bit. 2. Ensure UTC_L1_ENABLE is consistently managed by: - Adding the missing register write when enabling UTC_L1 during start - Keeping UTC_L1 enabled by default as per hardware requirements v2: Correct SDMA_CNTL setting (Philip) Suggested-by: Jonathan Kim <[email protected]> Signed-off-by: Jesse Zhang <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu: Release reset locks during failuresLijo Lazar2025-06-181-25/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure to release reset domain lock in case of failures. Signed-off-by: Lijo Lazar <[email protected]> Signed-off-by: Ce Sun <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Fixes: 11bb33766f66 ("drm/amdgpu: refactor amdgpu_device_gpu_recover") Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdgpu/sdma: handle paging queues in amdgpu_sdma_reset_engine()Alex Deucher2025-06-181-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Need to properly start and stop paging queues if they are present. This is not an issue today since we don't support a paging queue on any chips with queue reset. Fixes: b22659d5d352 ("drm/amdgpu: switch amdgpu_sdma_reset_engine to use the new sdma function pointers") Reviewed-by: Jesse Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
| | | * drm/amdkfd: Move the process suspend and resume out of full accessEmily Deng2025-06-185-17/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the suspend and resume process, exclusive access is not required. Therefore, it can be moved out of the full access section to reduce the duration of exclusive access. v3: Move suspend processes before hardware fini. Remove twice call for bare metal. v4: Refine code Signed-off-by: Emily Deng <[email protected]> Acked-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>