aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: refine eeprom data checkganglxie2025-07-151-0/+28
| | | | | | | | | | add eeprom data checksum check before driver unload. reset eeprom and save correct data to eeprom when check failed Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Convert from DRM_* to dev_*Lijo Lazar2025-06-301-34/+43
| | | | | | | | Convert from generic DRM_* to dev_* calls to have device context info. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refine usage of amdgpu_bad_page_thresholdganglxie2025-06-181-10/+9
| | | | | | | | | | when amdgpu_bad_page_threshold == -1 or -2, driver will issue a warning message when threshold is reached and continue runtime services. Signed-off-by: ganglxie <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: clear pa and mca record counter when resetting eepromganglxie2025-06-181-0/+2
| | | | | | | | | clear pa and mca record counter when resetting eeprom, so that ras_num_bad_pages can be calculated correctly Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Set RAS EEPROM table version to v3 for umc v12_5Candice Li2025-04-111-0/+1
| | | | | | | | Set RAS EEPROM table version to v3 for umc v12_5. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Reset RAS table if header is invalidLijo Lazar2025-04-081-5/+7
| | | | | | | | | If a valid header is not found during RAS eeprom init, consider it as new and reset RAS table info. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add basic validation for RAS headerLijo Lazar2025-04-071-3/+19
| | | | | | | | | | If RAS header read from EEPROM is corrupted, it could result in trying to allocate huge memory for reading the records. Add some validation to header fields. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add EEPROM I2C address support for smu v13_0_12Candice Li2025-03-191-0/+2
| | | | | | | | Add EEPROM I2C address support for smu v13_0_12. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: format old RAS eeprom data into V3 versionTao Zhou2025-03-181-12/+14
| | | | | | | | | | Clear old data and save it in V3 format. v2: only format eeprom data for new ASICs. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Save PA of bad pages for old asicsganglxie2025-03-141-2/+7
| | | | | | | | | for old asics that do not support mca translating, we just save PA for them Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: increase RAS bad page thresholdTao Zhou2025-03-071-3/+3
| | | | | | | | | | | For default policy, driver will issue an RMA event when the number of bad pages is greater than 8 physical rows, rather than reaches 8 physical rows, don't rely on threshold configurable parameters in default mode. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Change page/record number calculation based on npsganglxie2025-02-251-10/+7
| | | | | | | | | save only one record to save eeprom space,and bad_page_num = pa_rec_num + mca_rec_num*16 Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Save nps to eepromganglxie2025-02-251-2/+6
| | | | | | | | nps info saved together with bad page makes bad page parsing more efficient Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add flag to make VBIOS read optionalLijo Lazar2025-02-131-1/+1
| | | | | | | | | | | Certain SOCs may not need much data from VBIOS. Some data like VBIOS version used will be missed but it doesn't affect functionality. Add a flag to make VBIOS image optional. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Update usage for bad page thresholdHawking Zhang2025-02-131-18/+23
| | | | | | | | | The driver's behavior varies based on the configuration of amdgpu_bad_page_threshold setting Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: simplify return statement in amdgpu_ras_eeprom_initDheeraj Reddy Jonnalagadda2024-12-181-1/+1
| | | | | | | | | | | | Remove the logically dead code in the last return statement of amdgpu_ras_eeprom_init. The condition res < 0 is redundant since res is already checked for a negative value earlier. Replace return res < 0 ? res : 0; with return 0 to improve clarity. Fixes: 63d4c081a556 ("drm/amdgpu: Optimize EEPROM RAS table I/O") Closes: https://scan7.scan.coverity.com/#/project-view/52337/11354?selectedIssue=1602413 Signed-off-by: Dheeraj Reddy Jonnalagadda <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: correct the calculation of RAS bad pageTao Zhou2024-12-101-13/+27
| | | | | | | | | | After the introduction of NPS RAS, one bad page record on eeprom may be related to 1 or 16 bad pages, so the bad page record and bad page are two different concepts, define a new variable to store bad page number. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: split ras_eeprom_init into init and check functionsTao Zhou2024-12-101-0/+20
| | | | | | | | | | Init function is for ras table header read and check function is responsible for the validation of the header. Call them in different stages. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: return error when eeprom checksum failedJinzhou Su2024-12-101-2/+4
| | | | | | | | | | | Return eeprom table checksum error result, otherwise it might be overwritten by next call. V2: replace DRM_ERROR with dev_err Signed-off-by: Jinzhou Su <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add a flag to indicate UMC channel index versionTao Zhou2024-12-101-1/+10
| | | | | | | | | | | v1 (legacy way): store channel index within a UMC instance in eeprom v2: store global channel index in eeprom V2: only save the flag on eeprom, clear it after saving. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix a typoAndrew Kreimer2024-09-181-1/+1
| | | | | | | | Fix a typo in comments. Reported-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Kreimer <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix eeprom max record countStanley.Yang2024-07-241-0/+3
| | | | | | | | | | | | | The eeprom table is empty before initializing, set eeprom table version first before initializing. Changed from V1: Reuse amdgpu_ras_set_eeprom_table_version function Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 015b8a2fdf39a4c288ff24e7b715b8d9198e56dc)
* drm/amdgpu: add RAS is_rma flagTao Zhou2024-06-051-4/+6
| | | | | | | | Set the flag to true if bad page number reaches threshold. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add smu v13_0_14 ip blockHawking Zhang2024-05-021-0/+2
| | | | | | | | Add smu v13_0_14 ip block support Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Le Ma <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Update setting EEPROM table versionCandice Li2024-03-201-5/+17
| | | | | | | | | Use helper function instead of umc callback to set EEPROM table version. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: send smu rma reason event in ras eeprom driverYang Wang2024-02-121-0/+3
| | | | | | | | | send smu rma reason event to smu in ras eeprom driver. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Update EEPROM I2C address for smu v13_0_0Candice Li2023-11-291-0/+6
| | | | | | | | Check smu v13_0_0 SKU type to select EEPROM I2C address. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix complex macros errorSrinivasan Shanmugam2023-10-051-1/+1
| | | | | | | | | | | | | | | | Fixes the below: ERROR: Macros with complex values should be enclosed in parentheses WARNING: macros should not use a trailing semicolon +#define amdgpu_inc_vram_lost(adev) atomic_inc(&((adev)->vram_lost_counter)); Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Cc: "Pan, Xinhui" <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: change if condition for bad channel bitmap updateTao Zhou2023-09-261-2/+4
| | | | | | | | | | | | | | | The amdgpu_ras_eeprom_control.bad_channel_bitmap is u32 type, but the channel index could be larger than 32. For the ASICs whose channel number is more than 32, the amdgpu_dpm_send_hbm_bad_channel_flag interface is not supported, so we simply bypass channel bitmap update under this condition. v2: replace sizeof with BITS_PER_TYPE, we should check bit number instead of byte number. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use function for IP version checkLijo Lazar2023-09-201-2/+2
| | | | | | | | | Use an inline function for version check. Gives more flexibility to handle any format changes. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Only support RAS EEPROM on dGPU platformCandice Li2023-08-301-1/+2
| | | | | | | | RAS EEPROM device is only supported on dGPU platform for smu v13_0_6. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add I2C EEPROM support on smu v13_0_6Candice Li2023-08-151-0/+2
| | | | | | | | | | Support I2C EEPROM on smu v13_0_6. v2: Move IP_VERSION(13, 0, 6) ahead of IP_VERSION(13, 0, 10). Signed-off-by: Candice Li <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amd: Avoid reading the VBIOS part number twiceMario Limonciello2023-07-211-4/+4
| | | | | | | | | | | | | The VBIOS part number is read both in amdgpu_atom_parse() as well as in atom_get_vbios_pn() and stored twice in the `struct atom_context` structure. Remove the first unnecessary read and move the `pr_info` line from that read into the second. v2: squash in unused variable removal Signed-off-by: Mario Limonciello <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Fix kdoc warningSrinivasan Shanmugam2023-06-151-2/+2
| | | | | | | | | | | | | | | Fixes the following gcc with W=1: drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:76: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * EEPROM Table structure v1 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:98: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * EEPROM Table structrue v2.1 Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Set EEPROM ras infoStanley.Yang2023-06-091-0/+24
| | | | | | | | | Set EEPROM ras info: rma status, health percent and bad page threshold. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Calculate EEPROM table ras info bytes sumStanley.Yang2023-06-091-0/+19
| | | | | | | | It's more reasonable to check EEPROM table ras info bytes. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add support EEPROM table v2.1Stanley.Yang2023-06-091-13/+191
| | | | | | | | | Add ras info to EEPROM table, app can analyse device ECC status without GPU driver through EEPROM table ras info. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add RAS table v2.1 macro definitionStanley.Yang2023-06-091-0/+13
| | | | | | | | Add RAS EEPROM table version 2.1 macro definition. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Rename ras table versionStanley.Yang2023-06-091-3/+2
| | | | | | | | | Rename RAS_TABLE_VER to RAS_TABLE_VER_V1, move RAS_TABLE_VER_V1 from amdgpu_ras_eeprom.c to amdgpu_ras_eeprom.h. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: simplify amdgpu_ras_eeprom.cAlex Deucher2023-04-131-52/+20
| | | | | | | | | | | | | | All chips that support RAS also support IP discovery, so use the IP versions rather than a mix of IP versions and asic types. Checking the validity of the atom_ctx pointer is not required as the vbios is already fetched at this point. v2: add comments to id asic types based on feedback from Luben Reviewed-by: Luben Tuikov <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: Luben Tuikov <[email protected]>
* drm/amdgpu: Return from switch early for EEPROM I2C addressLuben Tuikov2023-03-311-5/+3
| | | | | | | | | | | | | | | | As soon as control->i2c_address is set, return; remove the "break;" from the switch--it is unnecessary. This mimics what happens when for some cases in the switch, we call helper functions with "return <helper function>". Remove final function "return true;" to indicate that the switch is final and terminal, and that there should be no code after the switch. Cc: Candice Li <[email protected]> Cc: Kent Russell <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Remove second moot switch to set EEPROM I2C addressLuben Tuikov2023-03-311-9/+0
| | | | | | | | | | | | | | | | Remove second switch since it already has its own function and case in the first switch. This also avoids requalifying the EEPROM I2C address for VEGA20, SIENNA CICHLID, and ALDEBARAN, as those have been set by the first switch and shouldn't match SMU v13.0.x. Cc: Candice Li <[email protected]> Cc: Kent Russell <[email protected]> Cc: Alex Deucher <[email protected]> Fixes: 158225294683 ("drm/amdgpu: Add EEPROM I2C address for smu v13_0_0") Fixes: c9bdc6c3cf39 ("drm/amdgpu: Add EEPROM I2C address support for ip discovery") Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_errTao Zhou2023-02-231-5/+14
| | | | | | | | | | | bad_page_threshold controls page retirement behavior and it should be also checked. v2: simplify the condition of bad page handling path. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: change default behavior of bad_page_threshold parameterTao Zhou2023-02-231-2/+2
| | | | | | | | | | | Ignore ras umc bad page threshold by default, GPU initialization won't be stopped in this mode. v2: refine the description of bad_page_threshold. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Stanley.Yang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add support for RAS table at 0x40000Luben Tuikov2022-11-171-1/+6
| | | | | | | | | | | | | Add support for RAS table at I2C EEPROM address of 0x40000, since on some ASICs it is not at 0, but at 0x40000. Cc: Alex Deucher <[email protected]> Cc: Kent Russell <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Tested-by: Kent Russell <[email protected]> Reviewed-by: Kent Russell <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Decouple RAS EEPROM addresses from chipsLuben Tuikov2022-11-091-12/+11
| | | | | | | | | | | | | | | | Abstract RAS I2C EEPROM addresses from chip names, and set their macro definition names to the address they set, not the chip they attach to. Since most chips either use I2C EEPROM address 0 or 40000h for the RAS table start offset, this leaves us with only two macro definitions as opposed to five, and removes the redundancy of four. Cc: Candice Li <[email protected]> Cc: Tao Zhou <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Kent Russell <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Remove redundant I2C EEPROM addressLuben Tuikov2022-11-091-3/+21
| | | | | | | | | | | | | | | | | | | Remove redundant EEPROM_I2C_MADDR_54H address, since we already have it represented (ARCTURUS), and since we don't include the I2C device type identifier in EEPROM memory addresses, i.e. that high up in the device abstraction--we only use EEPROM memory addresses, as memory is continuously represented by EEPROM device(s) on the I2C bus. Add a comment describing what these memory addresses are, how they come about and how they're usually extracted from the device address byte. Cc: Candice Li <[email protected]> Cc: Tao Zhou <[email protected]> Cc: Alex Deucher <[email protected]> Fixes: c9bdc6c3cf39df ("drm/amdgpu: Add EEPROM I2C address support for ip discovery") Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add EEPROM I2C address support for ip discoveryCandice Li2022-10-271-2/+18
| | | | | | | | | 1. Update EEPROM_I2C_MADDR_SMU_13_0_0 to EEPROM_I2C_MADDR_54H 2. Add EEPROM I2C address support for smu v13_0_0 and v13_0_10. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Update ras eeprom support for smu v13_0_0 and v13_0_10Candice Li2022-10-271-0/+10
| | | | | | | | Enable RAS EEPROM support for smu v13_0_0 and v13_0_10. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add EEPROM I2C address for smu v13_0_0Candice Li2022-09-131-0/+10
| | | | | | | | Set correct EEPROM I2C address for smu v13_0_0. Signed-off-by: Candice Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>