aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h
Commit message (Collapse)AuthorAgeFilesLines
* drm/amdgpu: refine eeprom data checkganglxie2025-07-151-0/+2
| | | | | | | | | | add eeprom data checksum check before driver unload. reset eeprom and save correct data to eeprom when check failed Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: refine ras error injection when eeprom initialization failedganglxie2025-06-301-0/+2
| | | | | | | | | when eeprom initialization failed, we still support ras error injection, and reserve bad pages, but do not save bad pages to eeprom Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: format old RAS eeprom data into V3 versionTao Zhou2025-03-181-0/+1
| | | | | | | | | | Clear old data and save it in V3 format. v2: only format eeprom data for new ASICs. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Change page/record number calculation based on npsganglxie2025-02-251-14/+6
| | | | | | | | | save only one record to save eeprom space,and bad_page_num = pa_rec_num + mca_rec_num*16 Signed-off-by: ganglxie <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: correct the calculation of RAS bad pageTao Zhou2024-12-101-0/+5
| | | | | | | | | | After the introduction of NPS RAS, one bad page record on eeprom may be related to 1 or 16 bad pages, so the bad page record and bad page are two different concepts, define a new variable to store bad page number. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: split ras_eeprom_init into init and check functionsTao Zhou2024-12-101-0/+2
| | | | | | | | | | Init function is for ras table header read and check function is responsible for the validation of the header. Call them in different stages. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add flag to indicate the type of RAS eeprom recordTao Zhou2024-12-101-0/+14
| | | | | | | | | | | | One UMC MCA address could map to multiply physical address (PA): AMDGPU_RAS_EEPROM_REC_PA: one record store one PA AMDGPU_RAS_EEPROM_REC_MCA: one record store one MCA address, PA is not cared about Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: add RAS is_rma flagTao Zhou2024-06-051-2/+1
| | | | | | | | Set the flag to true if bad page number reaches threshold. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Set EEPROM ras infoStanley.Yang2023-06-091-0/+5
| | | | | | | | | Set EEPROM ras info: rma status, health percent and bad page threshold. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add support EEPROM table v2.1Stanley.Yang2023-06-091-1/+11
| | | | | | | | | Add ras info to EEPROM table, app can analyse device ECC status without GPU driver through EEPROM table ras info. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add RAS table v2.1 macro definitionStanley.Yang2023-06-091-0/+1
| | | | | | | | Add RAS EEPROM table version 2.1 macro definition. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Rename ras table versionStanley.Yang2023-06-091-0/+2
| | | | | | | | | Rename RAS_TABLE_VER to RAS_TABLE_VER_V1, move RAS_TABLE_VER_V1 from amdgpu_ras_eeprom.c to amdgpu_ras_eeprom.h. Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: message smu to update bad channel infoStanley.Yang2022-03-151-0/+4
| | | | | | | | | It should notice SMU to update bad channel info when detected uncorrectable error in UMC block Signed-off-by: Stanley.Yang <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Drop inline from amdgpu_ras_eeprom_max_record_countMichel Dänzer2021-09-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | This was unusual; normally, inline functions are declared static as well, and defined in a header file if used by multiple compilation units. The latter would be more involved in this case, so just drop the inline declaration for now. Fixes compile failure building for ppc64le on RHEL 8: In file included from ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h:32, from ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:33: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c: In function ‘amdgpu_ras_recovery_init’: ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h:90:17: error: inlining failed in call to ‘always_inline’ ‘amdgpu_ras_eeprom_max_record_count’: function body not available 90 | inline uint32_t amdgpu_ras_eeprom_max_record_count(void); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1985:34: note: called from here 1985 | max_eeprom_records_len = amdgpu_ras_eeprom_max_record_count(); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: c84d46707ebb "drm/amdgpu: validate bad page threshold in ras(v3)" Reviewed-by: Lyude Paul <[email protected]> Signed-off-by: Michel Dänzer <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: RAS EEPROM table is now in debugfsLuben Tuikov2021-07-011-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add "ras_eeprom_size" file in debugfs, which reports the maximum size allocated to the RAS table in EEROM, as the number of bytes and the number of records it could store. For instance, $cat /sys/kernel/debug/dri/0/ras/ras_eeprom_size 262144 bytes or 10921 records $_ Add "ras_eeprom_table" file in debugfs, which dumps the RAS table stored EEPROM, in a formatted way. For instance, $cat ras_eeprom_table Signature Version FirstOffs Size Checksum 0x414D4452 0x00010000 0x00000014 0x000000EC 0x000000DA Index Offset ErrType Bank/CU TimeStamp Offs/Addr MemChl MCUMCID RetiredPage 0 0x00014 ue 0x00 0x00000000607608DC 0x000000000000 0x00 0x00 0x000000000000 1 0x0002C ue 0x00 0x00000000607608DC 0x000000001000 0x00 0x00 0x000000000001 2 0x00044 ue 0x00 0x00000000607608DC 0x000000002000 0x00 0x00 0x000000000002 3 0x0005C ue 0x00 0x00000000607608DC 0x000000003000 0x00 0x00 0x000000000003 4 0x00074 ue 0x00 0x00000000607608DC 0x000000004000 0x00 0x00 0x000000000004 5 0x0008C ue 0x00 0x00000000607608DC 0x000000005000 0x00 0x00 0x000000000005 6 0x000A4 ue 0x00 0x00000000607608DC 0x000000006000 0x00 0x00 0x000000000006 7 0x000BC ue 0x00 0x00000000607608DC 0x000000007000 0x00 0x00 0x000000000007 8 0x000D4 ue 0x00 0x00000000607608DD 0x000000008000 0x00 0x00 0x000000000008 $_ Cc: Alexander Deucher <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Cc: John Clements <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Xinhui Pan <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Alexander Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Optimize EEPROM RAS table I/OLuben Tuikov2021-07-011-7/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split functionality between read and write, which simplifies the code and exposes areas of optimization and more or less complexity, and take advantage of that. Read and write the table in one go; use a separate stage to decode or encode the data, as opposed to on the fly, which keeps the I2C bus busy. Use a single read/write to read/write the table or at most two if the number of records we're reading/writing wraps around. Check the check-sum of a table in EEPROM on init. Update the checksum at the same time as when updating the table header signature, when the threshold was increased on boot. Take advantage of arithmetic modulo 256, that is, use a byte!, to greatly simplify checksum arithmetic. Cc: Alexander Deucher <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Alexander Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Get rid of test functionLuben Tuikov2021-07-011-2/+0
| | | | | | | | | | | The code is now tested from userspace. Remove already macroed out test function. Cc: Alexander Deucher <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alexander Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Some renamesLuben Tuikov2021-07-011-6/+19
| | | | | | | | | | | Qualify with "ras_". Use kernel's own--don't redefine your own. Cc: Alexander Deucher <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Reviewed-by: Alexander Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Use explicit cardinality for clarityLuben Tuikov2021-07-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RAS_MAX_RECORD_NUM may mean the maximum record number, as in the maximum house number on your street, or it may mean the maximum number of records, as in the count of records, which is also a number. To make this distinction whether the number is ordinal (index) or cardinal (count), rename this macro to RAS_MAX_RECORD_COUNT. This makes it easy to understand what it refers to, especially when we compute quantities such as, how many records do we have left in the table, especially when there are so many other numbers, quantities and numerical macros around. Also rename the long, amdgpu_ras_eeprom_get_record_max_length() to the more succinct and clear, amdgpu_ras_eeprom_max_record_count(). When computing the threshold, which also deals with counts, i.e. "how many", use cardinal "max_eeprom_records_count", than the quantitative "max_eeprom_records_len". Simplify the logic here and there, as well. Cc: Guchun Chen <[email protected]> Cc: John Clements <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Alexander Deucher <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Alexander Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Simplify RAS EEPROM checksum calculationsLuben Tuikov2021-07-011-1/+1
| | | | | | | | | | | | | | | | | | | | | Rename update_table_header() to write_table_header() as this function is actually writing it to EEPROM. Use kernel types; use u8 to carry around the checksum, in order to take advantage of arithmetic modulo 8-bits (256). Tidy up to 80 columns. When updating the checksum, just recalculate the whole thing. Cc: Alexander Deucher <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Alexander Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: RAS xfer to read/writeLuben Tuikov2021-07-011-3/+5
| | | | | | | | | | | | | | | | | Wrap amdgpu_ras_eeprom_xfer(..., bool write), into amdgpu_ras_eeprom_read() and amdgpu_ras_eeprom_write(), as that makes reading and understanding the code clearer. Cc: Jean Delvare <[email protected]> Cc: Alexander Deucher <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Cc: Lijo Lazar <[email protected]> Cc: Stanley Yang <[email protected]> Cc: Hawking Zhang <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Alexander Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Rename misspelled functionLuben Tuikov2021-07-011-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of fixing the spelling in amdgpu_ras_eeprom_process_recods(), rename it to, amdgpu_ras_eeprom_xfer(), to look similar to other I2C and protocol transfer (read/write) functions. Also to keep the column span to within reason by using a shorter name. Change the "num" function parameter from "int" to "const u32" since it is the number of items (records) to xfer, i.e. their count, which cannot be a negative number. Also swap the order of parameters, keeping the pointer to records and their number next to each other, while the direction now becomes the last parameter. Cc: Jean Delvare <[email protected]> Cc: Alexander Deucher <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Cc: Lijo Lazar <[email protected]> Cc: Stanley Yang <[email protected]> Cc: Hawking Zhang <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Alexander Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: RAS and FRU now use 19-bit I2C addressLuben Tuikov2021-07-011-1/+1
| | | | | | | | | | | | | | | | | Convert RAS and FRU code to use the 19-bit I2C memory address and remove all "slave_addr", as this is now absolved into the 19-bit address. Cc: Jean Delvare <[email protected]> Cc: John Clements <[email protected]> Cc: Alexander Deucher <[email protected]> Cc: Andrey Grodzovsky <[email protected]> Cc: Lijo Lazar <[email protected]> Cc: Stanley Yang <[email protected]> Cc: Hawking Zhang <[email protected]> Signed-off-by: Luben Tuikov <[email protected]> Acked-by: Alexander Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: remove unnecessary reading for epprom headerDennis Li2021-02-261-3/+1
| | | | | | | | | | | | If the number of badpage records exceed the threshold, driver has updated both epprom header and control->tbl_hdr.header before gpu reset, therefore GPU recovery thread no need to read epprom header directly. v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold Signed-off-by: Dennis Li <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: break GPU recovery once it's in bad state(v4)Guchun Chen2020-08-041-0/+4
| | | | | | | | | | | | | | | | | When GPU executes recovery and retriving bad GPU tag from external eerpom device, the recovery will be broken and error message is printed as well for user's awareness. v2: Refine warning message in threshold reaching case, and fix spelling typo. v3: Fix explicit calling of bad gpu. v4: Rename function names. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: break driver init process when it's bad GPU(v5)Guchun Chen2020-08-041-1/+2
| | | | | | | | | | | | | | | | | | | | When retrieving bad gpu tag from eeprom, GPU init should fail as the GPU needs to be retired for further check. v2: Fix spelling typo, correct the condition to detect bad gpu tag and refine error message. v3: Refine function argument name. v4: Fix missing check of returning value of i2c initialization error case. v5: Use dev_err to print PCI information in dmesg instead of DRM_ERROR. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: validate bad page threshold in ras(v3)Guchun Chen2020-08-041-0/+2
| | | | | | | | | | | | | | | | | Bad page threshold value should be valid in the range between -1 and max records length of eeprom. It could determine when saved bad pages exceed threshold value, and proceed corresponding actions. v2: When using the default typical value, it should be min value between typical value and eeprom max records length. v3: drop the case of setting bad_page_cnt_threshold to be 0xFFFFFFFF, as it confuses user. Signed-off-by: Guchun Chen <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: move i2c bus lock out of ras structureAlex Deucher2020-07-211-1/+0
| | | | | | | | | It's not really ras related. It's just a lock for the bus in general. This removes the ras dependency from the smu i2c bus. Reviewed-by: Andrey Grodzovsky <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Move EEPROM I2C adapter to amdgpu_deviceAndrey Grodzovsky2020-03-161-2/+0
| | | | | | | | | | | | | | | Puts the i2c adapter in common place for sharing by RAS and upcoming data read from FRU EEPROM feature. v2: Move i2c adapter to amdgpu_pm and rename it. v3: Move i2c adapter init to ASIC specific code and get rid of the switch case in amdgpu_device Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Resolved offchip EEPROM I/O issueJohn Clements2019-11-261-0/+1
| | | | | | | | Updated target I2C address Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: John Clements <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add amdgpu_ras_eeprom_reset_tableAndrey Grodzovsky2019-09-161-0/+1
| | | | | | | | | This will allow to reset the table on the fly. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Tao Zhou <[email protected]> Reviewed-by: Guchun Chen <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
* drm/amdgpu: Add RAS EEPROM table.Andrey Grodzovsky2019-08-271-0/+90
Add RAS EEPROM table manager to eanble RAS errors to be stored upon appearance and retrived on driver load. v2: Fix some prints. v3: Fix checksum calculation. Make table record and header structs packed to do correct byte value sum. Fix record crossing EEPROM page boundry. v4: Fix byte sum val calculation for record - look at sizeof(record). Fix some style comments. v5: Add description to EEPROM_TABLE_RECORD_SIZE and syntax fixes. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Luben Tuikov <[email protected]> Signed-off-by: Alex Deucher <[email protected]>