aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: Add init level for post reset reinitLijo Lazar2024-11-201-0/+2
| | | | | | | | | | | | | | | When device needs to be reset before initialization, it's not required for all IPs to be initialized before a reset. In such cases, it needs to identify whether the IP/feature is initialized for the first time or whether it's reinitialized after a reset. Add RESET_RECOVERY init level to identify post reset reinitialization phase. This only provides a device level identification, IP/features may choose to track their state independently also. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add reset on init handler for XGMILijo Lazar2024-09-261-0/+5
| | | | | | | | | | | In some cases, device needs to be reset before first use. Add handlers for doing device reset during driver init sequence. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Acked-by: Rajneesh Bhardwaj <[email protected]> Tested-by: Rajneesh Bhardwaj <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: abort KIQ waits when there is a pending resetVictor Skvortsov2024-08-161-0/+6
| | | | | | | | | | | | Stop waiting for the KIQ to return back when there is a reset pending. It's quite likely that the KIQ will never response. Signed-off-by: Koenig Christian <[email protected]> Suggested-by: Lazar Lijo <[email protected]> Tested-by: Victor Skvortsov <[email protected]> Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add reset sources in gpu reset contextEric Huang2024-06-051-0/+13
| | | | | | | | | | reset source or reset cause is very useful info for reset context, it will be used by events API. Suggested-by: Lijo Lazar <[email protected]> Signed-off-by: Eric Huang <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add reset_context flag for host FLRYunxiang Li2024-05-021-0/+1
| | | | | | | | | | | | | | There are other reset sources that pass NULL as the job pointer, such as amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the FLR comes from the host does not work. Add a flag in reset_context to explicitly mark host triggered reset, and set this flag when we receive host reset notification. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Emily Deng <[email protected]> Reviewed-by: Zhigang Luo <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Skip the coredump collection on reset during driver reloadAhmad Rehman2024-04-191-0/+1
| | | | | | | | | | | | | | In passthrough environment, the driver triggers the mode-1 reset on reload. The reset causes the core dump collection which is delayed task and prevents driver from unloading until it is completed. Since we do not need to collect data on "reset on reload" case, we can skip core dump collection. v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well. Signed-off-by: Ahmad Rehman <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refactor code to split devcoredump codeSunil Khatri2024-03-221-16/+0
| | | | | | | | | | | | | | | | Refractor devcoredump code into new files since its functionality is expanded further and better to slit and devcoredump to have its own file. v2: Fix the build failure caught by arm compiler of implicit function declaration with #ifdef v3: squash in fix for implicit declaration error Cc: Ivan Lipski <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: Sunil Khatri <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add ring timeout information in devcoredumpSunil Khatri2024-03-061-0/+1
| | | | | | | | | | | | | | Add ring timeout related information in the amdgpu devcoredump file for debugging purposes. During the gpu recovery process the registered call is triggered and add the debug information in data file created by devcoredump framework under the directory /sys/class/devcoredump/devcdx/ Signed-off-by: Sunil Khatri <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: revert "Adjust removal control flow for smu v13_0_2"Christian König2024-01-181-1/+0
| | | | | | | | | | | | | | | | | | | | Calling amdgpu_device_ip_resume_phase1() during shutdown leaves the HW in an active state and is an unbalanced use of the IP callbacks. Using the IP callbacks like this can lead to memory leaks, double free and imbalanced reference counters. Leaving the HW in an active state can lead to DMA accesses to memory now freed by the driver. Both is a complete no-go for driver unload so completely revert the workaround for now. This reverts commit f5c7e7797060255dbc8160734ccc5ad6183c5e04. Signed-off-by: Christian König <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Create version number for coredumpsAndré Almeida2023-10-201-0/+3
| | | | | | | | | | | | | Even if there's nothing currently parsing amdgpu's coredump files, if we eventually have such tools they will be glad to find a version field to properly read the file. Create a version number to be displayed on top of coredump file, to be incremented when the file format or content get changed. Signed-off-by: André Almeida <[email protected]> Reviewed-by: Shashank Sharma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Move coredump code to amdgpu_reset fileAndré Almeida2023-10-201-0/+11
| | | | | | | | | | Giving that we use codedump just for device resets, move it's functions and structs to a more semantic file, the amdgpu_reset.{c, h}. Signed-off-by: André Almeida <[email protected]> Signed-off-by: Shashank Sharma <[email protected]> Reviewed-by: Shashank Sharma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Keep reset handlers sharedLijo Lazar2023-08-301-4/+12
| | | | | | | | | | | | Instead of maintaining a list per device, keep the reset handlers common per ASIC family. A pointer to the list of handlers is maintained in reset control. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Le Ma <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Tested-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* Revert "drm/amdgpu: let mode2 reset fallback to default when failure"Victor Zhao2022-10-191-2/+1
| | | | | | | | | | | | This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957. This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it. Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Skip put_reset_domain if it doesn't existVignesh Chander2022-09-291-1/+2
| | | | | | | | | For xgmi sriov, the reset is handled by host driver and hive->reset_domain is not initialized so need to check if it exists before doing a put. Signed-off-by: Vignesh Chander <[email protected]> Reviewed-by: Shaoyun Liu <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Adjust removal control flow for smu v13_0_2YiPeng Chai2022-09-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adjust removal control flow for smu v13_0_2: During amdgpu uninstallation, when removing the first device, the kernel needs to first send a mode1reset message to all gpu devices. Otherwise, smu initialization will fail the next time amdgpu is installed. V2: 1. Update commit comments. 2. Remove the global variable amdgpu_device_remove_cnt and add a variable to the structure amdgpu_hive_info. 3. Use hive to detect the first removed device instead of a global variable. V3: 1. Update commit comments. 2. Split a patch into multiple patches. 3. The current patch does: a. Add a work mode of AMDGPU_RESET_FOR_DEVICE_REMOVE into the existing gpu recover path, which make all devices in hive list only have HW reset but no resume (except the base IP). b. Call AMDGPU_RESET_FOR_DEVICE_REMOVE and AMDGPU_NEED_FULL_RESET mode of amdgpu_device_gpu_recover in amdgpu_pci_remove when removing the first device in hive list. c. When removing the first device, the IP blocks keyword function call sequence is as follows: .suspend->mode1reset->.resume(basic ip)->.hw_fini->.early_fini->.sw_fini. ^ | |-<----------<---------<----| The first three sequences are because of a call to amdgpu_device_gpu_recover. The three sequences will be executed in a loop until all devices in the hive list are iterated. The sequences starting from .hw_fini only apply to the first device. Since .suspend has been called before, except the resumed phase1 basic ip blocks, all other ip blocks .hw_fini of current device will do nothing. d. When removing other devices, the calling sequences is the same as legacy: .hw_fini -> .early_fini -> .sw_fini. Since .suspend has been called when removing the first device, except the resumed phase1 basic ip blocks, all of other ip blocks .hw_fini of current device will do nothing. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: let mode2 reset fallback to default when failureVictor Zhao2022-08-161-0/+1
| | | | | | | | | | | - introduce AMDGPU_SKIP_MODE2_RESET flag - let mode2 reset fallback to default reset method if failed v2: move this part out from the asic specific part Signed-off-by: Victor Zhao <[email protected]> Acked-by: Andrey Grodzovsky <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Avoid another list of reset devicesLijo Lazar2022-08-101-0/+1
| | | | | | | | | | | | A list of devices to be reset is already created in amdgpu_device_gpu_recover function. Creating another list with the same nodes is incorrect and not supported in list_head. Instead, pass the device list as part of reset context. Fixes: 9e08564727fc (drm/amdgpu: Refactor mode2 reset logic for v13.0.2) Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Cache result of last reset at reset domain level.Andrey Grodzovsky2022-06-101-0/+1
| | | | | | | | Will be read by executors of async reset like debugfs. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix compile error.Andrey Grodzovsky2022-02-101-2/+1
| | | | | | | | | | | Seems I forgot to add this to the relevant commit when submitting. Signed-off-by: Andrey Grodzovsky <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Christian König <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
* drm/amdgpu: Rework amdgpu_device_lock_adevAndrey Grodzovsky2022-02-091-0/+4
| | | | | | | | | | | | This functions needs to be split into 2 parts where one is called only once for locking single instance of reset_domain's sem and reset flag and the other part which handles MP1 states should still be called for each device in XGMI hive. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74118.html
* drm/amdgpu: Move in_gpu_reset into reset_domainAndrey Grodzovsky2022-02-091-0/+1
| | | | | | | | | We should have a single instance per entrire reset domain. Signed-off-by: Andrey Grodzovsky <[email protected]> Suggested-by: Lijo Lazar <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74116.html
* drm/amdgpu: Move reset sem into reset_domainAndrey Grodzovsky2022-02-091-0/+1
| | | | | | | | | | | We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74117.html
* drm/amdgpu: Rework reset domain to be refcounted.Andrey Grodzovsky2022-02-091-0/+35
| | | | | | | | | | | | | | | | | | | | | | | The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be binded to XGMI hive life cycle. Adress this by making reset domain refcounted and pointed by each member of the hive and the hive itself. v4: Fix crash on boot witrh XGMI hive by adding type to reset_domain. XGMI will only create a new reset_domain if prevoius was of single device type meaning it's first boot. Otherwsie it will take a refocunt to exsiting reset_domain from the amdgou device. Add a wrapper around reset_domain->refcount get/put and a wrapper around send to reset wq (Lijo) Signed-off-by: Andrey Grodzovsky <[email protected]> Acked-by: Christian König <[email protected]> Link: https://www.spinics.net/lists/amd-gfx/msg74121.html
* drm/amdgpu: Fix build warningsLijo Lazar2021-04-091-1/+1
| | | | | | | | | | | | | Fix header guard and make internal functions static. Fixes the below warnings: drivers/gpu/drm/amd/amdgpu/../amdgpu/amdgpu_reset.h:24:9: warning: '__AMDUGPU_RESET_H__' is used as a header guard here, followed by #define of a different macro [-Wheader-guard] drivers/gpu/drm/amd/amdgpu/aldebaran.c:110:6: warning: no previous prototype for function 'aldebaran_async_reset' [-Wmissing-prototypes] drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu13/aldebaran_ppt.c:1435:5: warning: no previous prototype for function 'aldebaran_mode2_reset' [-Wmissing-prototypes] Signed-off-by: Lijo Lazar <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add reset control to amdgpu_deviceLijo Lazar2021-04-091-0/+85
v1: Add generic amdgpu_reset_control to handle different types of resets. It may be added at device, hive or ip level. Each reset control has a list of handlers associated with it to handle different types of reset. Reset control is responsible for choosing the right handler given a particular reset context. Handler objects may implement a set of functions on how to handle a particular type of reset. prepare_env = Prepare environment/software context (not used currently). prepare_hwcontext = Prepare hardware context for the reset. perform_reset = Perform the type of reset. restore_hwcontext = Restore the hw context after reset. restore_env = Restore the environment after reset (not used currently). Reset context carries the context of reset, as of now this is based on the parameters used for current set of resets. v2: Fix coding style Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>