| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
| |
With adding sriov_vf check, we directly return EOPNOTSUPP in
ras_reset_error_count as we should not do anything on VF to reset RAS error
count.
This also fixes the issue that loading guest driver causes register
violations.
Reviewed-by: Ahmad Rehman <[email protected]>
Signed-off-by: Ellen Pan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
| |
VFs query RAS error counts directly from host with
AMDGPU_RAS_VIRT_ERROR_COUNT_QUERY. When ACA is enabled,
an unusable aca_sysfs is created rather than amdgpu_ras_sysfs_create()
Likewise, VFs depend on host support to query CPERs, rather than ACA component.
Signed-off-by: Victor Skvortsov <[email protected]>
Reviewed-by: Zhigang Luo <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
Update ta ra block to keep sync with RAS TA.
Signed-off-by: Stanley.Yang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
| |
Clear old data and save it in V3 format.
v2: only format eeprom data for new ASICs.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
Enable ACA by default for psp v13_0_6/v13_0_14.
Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
for old asics that do not support mca translating, we
just save PA for them
Signed-off-by: ganglxie <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
| |
Change the DMESG reporting of unknown errors to "Boot Controller
Generic Error" to align with the RAS SPEC and provide more clarity
to customers.
Signed-off-by: Xiang Liu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
save only one record to save eeprom space,and
bad_page_num = pa_rec_num + mca_rec_num*16
Signed-off-by: ganglxie <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
bad page adding can be simpler with nps info
Signed-off-by: ganglxie <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
Enable ACA by default for psp v13_0_12.
Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
| |
VFs are not able to query error counts for all RAS blocks. Rather than
returning error for queries on these blocks, skip sysfs the creation
all together.
Signed-off-by: Victor Skvortsov <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
The driver's behavior varies based on
the configuration of amdgpu_bad_page_threshold setting
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It ensures that appropriate error codes are returned when an error
condition is detected
Fixes the below;
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2849 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_umc_pages_in_a_row()' failed.
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2884 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_ras_mca2pa()' failed.
v2: s/-EIO/-EINVAL, retained the use of -EINVAL from
amdgpu_umc_pages_in_a_row & and amdgpu_ras_mca2pa_by_idx, when the
RAS context is not initialized or the convert_ras_err_addr function is
unavailable. (Thomas)
V3: Returning 0 as the absence of eh_data is acceptable. (Tao)
Fixes: a8d133e625ce ("drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes")
Reported-by: Dan Carpenter <[email protected]>
Cc: YiPeng Chai <[email protected]>
Cc: Tao Zhou <[email protected]>
Cc: Hawking Zhang <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
Enable psp v14_0_3 RAS support for non-SRIOV configurations.
Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
Add nbif v6_3_1 fatal error handling support.
Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
Add psp v14_0_3 ras support.
Signed-off-by: Candice Li <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
Enable RAS Cap check and initialize RAS funcs
for psp v13_0_12
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
| |
After the introduction of NPS RAS, one bad page record on eeprom may be
related to 1 or 16 bad pages, so the bad page record and bad page are
two different concepts, define a new variable to store bad page number.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
| |
Init function is for ras table header read and check function is
responsible for the validation of the header. Call them in different
stages.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
Remove unnecessary variable and simplify the logic.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
| |
All legacy RAS bad pages are generated in NPS1 mode, but new bad page
can be generated in any NPS mode, so we can't use retired_page stored
on eeprom directly in non-nps1 mode even for legacy data. We need to
take different actions for different data, new data can be identified
from old data by UMC_CHANNEL_IDX_V2 flag.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
| |
Old version of RAS TA doesn't support to convert MCA address stored on
eeprom to physical address (PA), support to find all bad pages in one
memory row by PA with old RAS TA. This approach is only suitable for
nps1 mode.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
So eeprom space can be saved, compatible with legacy way.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before scheduling a recovery due to scheduler/job hang, check if a RAS
error is detected. If so, choose RAS recovery to handle the situation. A
scheduler/job hang could be the side effect of a RAS error. In such
cases, it is required to go through the RAS error recovery process. A
RAS error recovery process in certains cases also could avoid a full
device device reset.
An error state is maintained in RAS context to detect the block
affected. Fatal Error state uses unused block id. Set the block id when
error is detected. If the interrupt handler detected a poison error,
it's not required to look for a fatal error. Skip fatal error checking
in such cases.
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
| |
NPS mode is introduced, the value of memory physical address (PA)
related to a MCA address varies per nps mode. We need to rely on
MCA address and convert it into PA accroding to nps mode.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
| |
One UMC MCA address could map to multiply physical address (PA):
AMDGPU_RAS_EEPROM_REC_PA: one record store one PA
AMDGPU_RAS_EEPROM_REC_MCA: one record store one MCA address, PA
is not cared about
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
| |
Some in_reset checks are infact checking whether the state is
reinitialization after reset. Replace with reset_in_recovery calls to
identify that it's really checking for recovery stage after reset.
Signed-off-by: Lijo Lazar <[email protected]>
Acked-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Enable RAS late init if VF RAS Telemetry is supported.
When enabled, the VF can use this interface to query total
RAS error counts from the host.
The VF FB access may abruptly end due to a fatal error,
therefore the VF must cache and sanitize the input.
The Host allows 15 Telemetry messages every 60 seconds, afterwhich
the host will ignore any more in-coming telemetry messages. The VF will
rate limit its msg calling to once every 5 seconds (12 times in 60 seconds).
While the VF is rate limited, it will continue to report the last
good cached data.
v2: Flip generate report & update statistics order for VF
Signed-off-by: Victor Skvortsov <[email protected]>
Acked-by: Tao Zhou <[email protected]>
Reviewed-by: Zhigang Luo <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
If VF RAS Capability support is enabled, guest is able to
retrieve the real RAS support from the host.
Signed-off-by: Victor Skvortsov <[email protected]>
Reviewed-by: Zhigang Luo <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
For RAS errors, source of error is known. Skip the core dump of IP
states.
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Use XGMI hive information to rely on resetting XGMI devices on
initialization rather than using mgpu structure. mgpu structure may have
other devices as well.
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Acked-by: Rajneesh Bhardwaj <[email protected]>
Tested-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a separate function to read badpage data during initialization.
Reading bad pages will need hardware access and cannot be done during
reset. Hence in cases where device needs a full reset during
init itself, attempting to read will cause a deadlock.
Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Acked-by: Rajneesh Bhardwaj <[email protected]>
Tested-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
| |
Drop pending_reset flag in gmc block. Instead use init level to
determine which type of init is preferred - in this case MINIMAL.
Signed-off-by: Lijo Lazar <[email protected]>
Acked-by: Rajneesh Bhardwaj <[email protected]>
Tested-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
| |
In multiple GPUs case, after a GPU has started
resetting all GPUs on hive, other GPUs do not
need to trigger GPU reset again.
Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Correctly spelled comments make it easier for the reader to understand
the code.
Replace 'udpate' with 'update' in the comment &
replace 'recieved' with 'received' in the comment &
replace 'dsiable' with 'disable' in the comment &
replace 'Initiailize' with 'Initialize' in the comment &
replace 'disble' with 'disable' in the comment &
replace 'Disbale' with 'Disable' in the comment &
replace 'enogh' with 'enough' in the comment &
replace 'availabe' with 'available' in the comment.
Acked-by: Christian König <[email protected]>
Signed-off-by: Yan Zhen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
| |
The feature is not applicable to specific app platform.
v2: update the disablement condition and commit description
v3: move the setting to amdgpu_ras_check_supported
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
- amdgpu_ras_error_statistic_ue_count()
- amdgpu_ras_error_statistic_ce_count()
- amdgpu_ras_error_statistic_de_count()
The parameter 'err_addr' is no longer used since following patch.
Fixes: a7e8467fbeee ("drm/amdgpu: Remove unused code")
Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
In the convenience of calling it globally.
Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
Data abort exception and unknown errors are supported.
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
Remove unused code.
Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The problem case is as follows:
1. GPU A triggers a gpu ras reset, and GPU A drives
GPU B to also perform a gpu ras reset.
2. After gpu B ras reset started, gpu B queried a DE
data. Since the DE data was queried in the ras reset
thread instead of the page retirement thread, bad
page retirement work would not be triggered. Then
even if all gpu resets are completed, the bad pages
will be cached in RAM until GPU B's bad page retirement
work is triggered again and then saved to eeprom.
This patch can save the bad pages to eeprom in time after gpu
ras reset is completed.
v2:
1. Add the above description to code comments.
2. Reuse existing function.
Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Before uninstalling gpu driver, flush all cached ras
bad pages to eeprom.
v2:
Put the same code into a function and reuse the function.
Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
add amdgpu ras 'event_state' sysfs device attribute support
Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
add amdgpu ras POSION_CONSUMPTION event id support.
Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
add amdgpu ras POSION_CREATION event id support.
Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
v1:
- use unified event id to manage ras events
- add a new function amdgpu_ras_query_error_status_with_event() to accept
event type as parameter.
v2:
add a warn log to show the location of function failure
when calling amdgpu_ras_mark_event(). (Tao Zhou)
v3:
change RAS_EVENT_TYPE_ISR to RAS_EVENT_TYPE_FATAL.
v4:
rename amdgpu_ras_get_recovery_event() to
amdgpu_ras_get_fatal_error_event().
Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
| |
Sysfs node disable query error count during gpu reset.
Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Stanley.Yang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
| |
To align with firmware, hbm id field 0x1 refers to
hbm stack 0, 0x2 refers to hbm statck 1.
Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Add gpu reset check and exception handling for
page retirement.
v2:
Clear poison consumption messages cached in fifo after
non mode-1 reset.
Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
| |
1. The poison fifo is only used for poison consumption
requests.
2. Merge reset requests when poison fifo caches multiple
poison consumption messages
Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|