aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: add ras event id supportYang Wang2024-03-201-14/+18
| | | | | | | | | | | | | | | | add amdgpu ras event id support to better distinguish different error information sources in dmesg logs. the following log will be identify by event id: {event_id} interrupt to inform RAS event {event_id} ACA logs {event_id} errors statistic since from current injection/error query {event_id} errors statistic since from gpu load Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: use helper macro HW_ERR instead of Hardware error stringYang Wang2024-01-291-6/+6
| | | | | | | | use helper macro HW_ERR to instead of Hardware error string. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add interface to check mca umc statusYiPeng Chai2024-01-221-1/+11
| | | | | | | | Add interface to check mca umc status. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Log deferred error separatelyCandice Li2024-01-151-3/+8
| | | | | | | | | Separate deferred error from UE and CE and log it individually. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix variable 'mca_funcs' dereferenced before NULL check in ↵Srinivasan Shanmugam2024-01-051-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'amdgpu_mca_smu_get_mca_entry()' Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c:377 amdgpu_mca_smu_get_mca_entry() warn: variable dereferenced before check 'mca_funcs' (see line 368) 357 int amdgpu_mca_smu_get_mca_entry(struct amdgpu_device *adev, enum amdgpu_mca_error_type type, 358 int idx, struct mca_bank_entry *entry) 359 { 360 const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs; 361 int count; 362 363 switch (type) { 364 case AMDGPU_MCA_ERROR_TYPE_UE: 365 count = mca_funcs->max_ue_count; mca_funcs is dereferenced here. 366 break; 367 case AMDGPU_MCA_ERROR_TYPE_CE: 368 count = mca_funcs->max_ce_count; mca_funcs is dereferenced here. 369 break; 370 default: 371 return -EINVAL; 372 } 373 374 if (idx >= count) 375 return -EINVAL; 376 377 if (mca_funcs && mca_funcs->mca_get_mca_entry) ^^^^^^^^^ Checked too late! Cc: Yang Wang <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: MCA supports recording umc address informationYiPeng Chai2023-12-191-2/+11
| | | | | | | | | | | | MCA supports recording umc address information. V2: Move err_addr variable from struct ras_err_node to struct ras_err_info. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix missing mca debugfs nodeYang Wang2023-11-301-1/+1
| | | | | | | | | | | Use amdgpu_ip_version() helper function to check ip version. The ip version contains other information, use the helper function to avoid reading wrong value. Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Move mca debug mode decision to rasLijo Lazar2023-11-291-1/+1
| | | | | | | | | | | | | Refactor code such that ras block decides the default mca debug mode, and not swsmu block. By default mca debug mode is set to false. v2: squash in uninitialized value fix (Alex) Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: correct mca debugfs dump reg listYang Wang2023-11-091-2/+9
| | | | | | | | avoid driver to touch invalid mca reg. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: correct acclerator check architecutre dumpHawking Zhang2023-11-091-4/+11
| | | | | | | | So driver doesn't touch invalid aca entries. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refine smu v13.0.6 mca dump driverYang Wang2023-11-091-6/+158
| | | | | | | | | | | | | refine smu mca driver to support query ras error from pmfw path. - correct gfx smu bank hwid (from mp5 to smu bank) - retire unused callback function in amdgpu_mca_smu_funcs{} - add new mca_bank_set{} structure to collect mca bank - move enum mca_reg_idx into amdgpu_mca.h header - add mca status register field decode macro Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add amdgpu mca debug sysfs supportYang Wang2023-09-201-0/+116
| | | | | | | | add amdgpu mca debug sysfs support. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add amdgpu smu mca dump feature supportYang Wang2023-09-201-0/+70
| | | | | | | | add amdgpu smu mca dump feature support. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Rework mca ras sw_initHawking Zhang2023-03-151-0/+72
| | | | | | | | | To align with other IP blocks Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Stanley Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Remove redundant calls of amdgpu_ras_block_late_fini in mca ras ↵yipechai2022-03-021-6/+0
| | | | | | | | | | block Remove redundant calls of amdgpu_ras_block_late_fini in mca ras block. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Remove redundant calls of ras_late_init in mca ras blockyipechai2022-02-171-6/+0
| | | | | | | | Remove redundant calls of ras_late_init in mca ras block. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Optimize amdgpu_mca_ras_late_init/amdgpu_mca_ras_fini function codeyipechai2022-02-141-39/+2
| | | | | | | | Optimize amdgpu_mca_ras_late_init/amdgpu_mca_ras_fini function code. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Optimize xxx_ras_late_init/xxx_ras_late_fini for each ras blockyipechai2022-02-141-3/+4
| | | | | | | | | | | | | | | | | | 1. Define amdgpu_ras_block_late_init to create sysfs nodes and interrupt handles. 2. Define amdgpu_ras_block_late_fini to remove sysfs nodes and interrupt handles. 3. Replace ras block variable members in struct amdgpu_ras_block_object with struct ras_common_if, which can make it easy to associate each ras block instance with each ras block functional interface. 4. Add .ras_cb to struct amdgpu_ras_block_object. 5. Change each ras block to fit for the changement of struct amdgpu_ras_block_object. Signed-off-by: yipechai <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Modify mca block to fit for the unified ras block data and opsyipechai2022-01-141-4/+7
| | | | | | | | | | | | | | | 1.Modify mca block to fit for the unified ras block data and ops. 2.Define special .ras_block_match function for mca block to identify itself. 3.Change amdgpu_mca_ras_funcs to amdgpu_mca_ras_block(amdgpu_mca_ras had been used), and the corresponding variable name remove _funcs suffix. 4.Remove the const flag of cma ras variable so that cma ras block can be able to be inserted into amdgpu device ras block link list. 5.Invoke amdgpu_ras_register_ras_block function to register cma ras block into amdgpu device ras block link list. 6.Remove the redundant code about cma in amdgpu_ras.c after using the unified ras block. Signed-off-by: yipechai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: John Clements <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Updated RAS infrastructureJohn Clements2021-09-231-4/+4
| | | | | | | | Update RAS infrastructure to support RAS query for MCA subblocks Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add driver infrastructure for MCA RASJohn Clements2021-08-241-0/+117
Add MCA specific IP blocks targetting RAS features Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>