A Proposal of an ECC-based Adaptive Fault-Tolerant Mechanism for 16-bit data words

Authors

Keywords:

Adaptability, Error Correction Codes, Fault Tolerance, Multiple Bit Errors, Single Bit Errors, Reliability

Abstract

With the integration scale level reached in CMOS technology, memory systems provide a great storage capacity, but at the price of an augment in their fault rate. In this way, the probability of experiencing Single Cell Upsets or Multiple Cell Upsets have risen.

Error Correction Codes (ECC) are broadly employed to protect memory systems. Though, the inclusion of an ECC in a computer system adds, in each memory word, some extra bits used to detect and/or correct errors. In addition, encoding and decoding circuitries must be added, introducing overheads in area, delay, and power consumption.

Usually, when an ECC-based fault tolerance mechanism is designed, its fault tolerance properties cannot be modified. However, in some applications, current memory systems can suffer a variable fault rate during their operation. Thus, it seems very interesting that this mechanism would be able to adapt to these variable fault conditions.

This work proposes an Adaptive Fault-Tolerant mechanism based on ECC. This mechanism can adapt to different fault conditions, being able to correct and/or detect single and multiple bits in error. The Adaptive Fault-Tolerant mechanism proposed uses a unique encoder, that is, it is not necessary to re-encode the data to change the error coverage.

Downloads

Download data is not yet available.

Author Biographies

Joaquín Gracia-Morán, Instituto ITACA, Universitat Politècnica de València

Joaquín Gracia-Morán received the B.Sc., M.Sc., and Ph.D. degrees in computer engineering from the Universitat Politècnica de València (UPV), Spain, in 1995, 1997, and 2004, respectively, where he is currently an Associate Professor with the Department of Computer Engineering (DISCA). His research interests include design and implementation of digital systems, design and validation of fault-tolerant systems and VHDLbased fault injection. He is a member with the Fault-Tolerant Systems (STF) research line within the Institute ITACA from the UPV.

Luis-J. Saiz-Adalid, Instituto ITACA, Universitat Politècnica de València

Luis-J. Saiz-Adalid received the M.Sc. and Ph.D. degrees in computer engineering from the Universitat Politècnica de València (UPV), Spain, in 1995 and 2015, respectively.

After 15 years in the industry (IBM, 1995-Celestica, 1998). He is currently an Associate Professor with the DISCA, UPV. His research interests include design and implementation of digital systems, design and validation of fault-tolerant systems and design of error control codes. He is a member with the Fault-Tolerant Systems (STF) research line within the Institute ITACA from the UPV.

J.-Carlos Baraza-Calvo, Instituto ITACA, Universitat Politècnica de València

J.-Carlos Baraza-Calvo received the B.Sc. and Ph.D. degrees in computer engineering from the Universitat Politècnica de València, (UPV), Spain, in 1993 and 2003, respectively. He is currently an Associate Professor with the DISCA. His research interests include design and implementation of digital systems, design and validation of fault-tolerant systems and VHDL-based fault injection. He is a member with the Fault-Tolerant Systems (STF) research line within the Institute ITACA from the UPV.

Daniel Gil-Tomas, Instituto ITACA, Universitat Politècnica de València

Daniel Gil-Tomás received the B.Sc. degree in electrical and electronic physics from the Universitat de València, Spain, in 1985, and the Ph.D. degree in computer engineering from the Universitat Politècnica de València (UPV), Spain, in 1999, where he is currently an Associate Professor with the DISCA. His research interests include design and validation of fault-tolerant systems, reliability physics and reliability of emerging nanotechnologies. He is a member with the Fault-Tolerant Systems (STF) research line within the Institute ITACA from the UPV.

Pedro-J. Gil-Vicente, Instituto ITACA, Universitat Politècnica de València

Pedro-J. Gil-Vicente (M'93) received the B.Sc. degree in electronic engineering from the Universitat Politècnica de Catalunya (UPC), Tarragona, Spain, in 1979, and the M.Sc. degree in industrial engineering and the Ph.D. degree in computer engineering from the Universitat Politècnica de València (UPV), Spain, in 1985 and 1992, respectively, where he is currently a Professor with the DISCA. He has been the Head of the DISCA, and he is also the Head of the Fault-Tolerant Systems (STF) research line, within the Institute ITACA. His research interests include the design and validation of realtime fault-tolerant distributed systems, the dependability validation using fault injection, the design and verification of embedded systems, and the dependability and security benchmarking. He has authored more than 150 research articles on these subjects.

Prof. Gil-Vicente has also served as a Program Committee member in the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), the European Dependable Computing Conference (EDCC) and the Latin American Symposium on Dependable Computing (LADC).

References

IEEE International Roadmap for Devices and Systems 2020. Acceso 20 de Septiembre, 2021. [Online]. Disponible en: https://irds.ieee.org/

E. Ibe, H. Taniguchi, Y. Yahagi, K. Shimbo, and T. Toba, “Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule”, IEEE Trans. Electron Devices, vol. 57, no. 7, pp. 1527–1538, July 2010.

G. Tsiligiannis et. al., “Multiple Cell Upset Classification in Commercial SRAMs”, IEEE Transactions on Nuclear Science, vol. 61, no. 4, August 2014.

N.G. Chechenin and M. Sajid, “Multiple cell upsets rate estimation for 65 nm SRAM bit-cell in space radiation environment”, 3rd International Conference and Exhibition on Satellite & Space Missions, May 2017.

E. Fujiwara, Code Design for Dependable Systems: Theory and Practical Application, Ed. Wiley-Interscience, 2006.

R. W. Hamming, “Error detecting and error correcting codes,” Bell System Technical Journal, vol. 29, pp. 147–160, 1950.

C.L. Chen and M.Y. Hsiao, “Error-correcting codes for semiconductor memory applications: a state-of-the-art review”, IBM Journal of Research and Development, vol. 58, no. 2, pp. 124–134, March 1984.

J. Gracia-Moran, L.J. Saiz-Adalid, D. Gil-Tomás, and P.J. Gil-Vicente, “Improving Error Correction Codes for Multiple Cell Upsets in Space Applications”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26(10), pp. 2132-2142, October 2018.

A. Sánchez-Macián, P. Reviriego, J. Tabero, A. Regadío and J.A. Maestro, “SEFI protection for Nanosat 16-bit Chip On-Board Computer Memories”, IEEE Transactions on Device and Materials Reliability, vol. 17(4), pp. 698-707, December 2017.

J. Gracia-Moran, L.J. Saiz-Adalid, J.C. Baraza-Calvo, P.J. Gil-Vicente, “Correction of Adjacent Errors with Low Redundant Matrix Error Correction Codes”, 8th Latin-American Symposium on Dependable Computing (LADC 2018), pp. 107-114, October 2018.

C. Krishna and I. Koren. “Adaptive Fault-Tolerance for Cyber-Physical Systems”, IEEE International Conference on Computing, Networking and Communications (ICNC 2013), pp. 310–314, 2013.

A. Jacobs, A.D. George, and G. Cieslewsky, “Reconfigurable Fault Tolerance: A Framework for environmentally Adaptive Fault Mitigation in Space”, 2009 International Conference on Field Programmable Logic and Applications (FPL 2009), pp. 199-204, September 2009.

K.H.K. Kim and T. F. Lawrence, “Adaptive Fault Tolerance: Issues and Approaches”, Second IEEE Workshop on Future Trends of Distributed Computing Systems. pp. 38–46, 1990.

J.-C. Baraza-Calvo, J. Gracia-Morán, L.-J. Saiz-Adalid, D. Gil-Tomás, and P.-J. Gil-Vicente, “Proposal of an Adaptive Fault Tolerance Mechanism to Tolerate Intermittent Faults in RAM,” Electronics, vol. 9, no. 12, p. 2074, Dec. 2020.

F. Silva, A. Muniz, J. Silveira, and C Marcon, “CLC-A: An Adaptive Implementation of the Column Line Code (CLC) ECC”, 2020 33rd Symposium on Integrated Circuits and Systems Design (SBCCI), August 2020.

D. Shin, J. Park, J. Park, S. Paul, and S. Bhunia “Adaptive ECC for Tailored Protection of Nanoscale Memory”, IEEE Design & Test, vol. 3, no. 6, pp. 84-93, December 2017.

X. Wang et al., “Sliding basket: An adaptive ECC scheme for runtime write failure suppression of STT-RAM cache”, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 762-767, March 2016.

Y. S. Lee, G. Koo, Y. -H. Gong and S. W. Chung, “Stealth ECC: A Data-Width Aware Adaptive ECC Scheme for DRAM Error Resilience”, 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2022, pp. 382-387, doi: 10.23919/DATE54114.2022.9774775.

W. Excoffon, J.C. Fabre, and M. Lauer, “Analysis of Adaptive Fault Tolerance for Resilient Computing”, 2017 13th European Dependable Computing Conference (EDCC 2017), pp. 50-57, September 2017.

J.A. Cheatham and J.M Emmert, “A Survey of Fault Tolerant Methodologies”, ACM Transactions on Design Automation of Electronic Systems, Vol. 11, no. 2, pp. 501-533, April 2006.

D. Fay, A. Shye, S. Bhattacharya, D.A. Connors and S. Wichmanm, “An Adaptive Fault-Tolerant Memory System for FPGA-based Architectures in the Space Environment”, 2nd NASA/ESA Conference on Adaptive Hardware and Systems, August 2007.

U. Sharma, “Fault Tolerant Techniques for Reconfigurable Platforms”, 1st Amrita ACM-W Celebration on Women in Computing in India (A2CWiC '10), pp. 60:1-60:4, 2010.

Z. Chishti, A.R. Alameldeen, C. Wilkerson, W. Wei, and S.L. Lu, “Improving Cache Lifetime Reliability at Ultra-low Voltages”, 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42), pp. 89-99, December 2009.

A. Basak, S. Paul, J. Park, J. Park, and S. Bhunia, “Reconfigurable ECC for adaptive protection of memory”, 2013 IEEE 56th International Midwest Symposium on Circuits and Systems (MWSCAS 2013), pp. 1085-1088, August 2013.

S. Choi et al., “A Decoder for Short BCH Codes with High Decoding Efficiency and Low Power for Emerging Memories”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 2, pp. 387-397, February 2019.

J. Xiao-Bo, T. Xue-Qing, and H. Wei-Pei, “Novel ECC structure and evaluation method for NAND flash memory”, 2015 28th IEEE International System-on-Chip Conference (SOCC), pp. 100-104, September 2015.

L. Yuan, H. Liu, P. Jia, and Y. Yang, “An adaptive ECC scheme for dynamic protection of NAND Flash memories”, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1052-1055, April 2015.

R. Ramaraju et al., “Hierarchical error correction for large memories”, U.S. Patent 8 677 205B2, March 18, 2014.

Z. Wang, “Hierarchical decoding of double error correcting codes for high speed reliable memories”, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1-7, May/June 2013.

O. Marin, P. Sens, J.-P. Briot, and Z. Guessoum, “Towards Adaptive Fault-Tolerance for Distributed Multi-Agent Systems”, 4th European Research Seminar on Advances in Distributed Systems, pp. 195–201, 2001.

L.C. Lung, F. Favarim, G.T. Santos, and M. Correia, “An Infrastructure for Adaptive Fault Tolerance on FT-CORBA”, 9th IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing, 2006.

J. Fraga, F. Siqueira, and F. Favarim, “An Adaptive Fault-Tolerant Component Model”, 9th IEEE Workshop on Object-Oriented Real-Time Dependable Systems, pp. 179–186, 2003.

M. Lauer et al., “Engineering Adaptive Fault-Tolerance Mechanisms for Resilient Computing on ROS”, 2016 IEEE 17th International Symposium on High Assurance Systems Engineering (HASE 2016), pp. 94-101, 2016.

K.A. LaBel, “Proton single event effects (SEE) guideline”, Submitted for publication on the NASA Electronic Parts and Packaging (NEPP) Program web site, August 2009. Disponible en https://nepp.nasa.gov/files/18365/Proton_RHAGuide_NASAAug09.pdf

L.J. Saiz-Adalid et al., “Flexible unequal error control codes with selectable error detection and correction levels”, 2013 Computer Safety, Reliability, and Security Conference (SAFECOMP 2013), pp. 178–189, September 2013.

L.J. Saiz-Adalid, J. Gracia-Morán, D. Gil-Tomás, J.C. Baraza-Calvo, P.J. Gil-Vicente, “Ultrafast Codes for Multiple Adjacent Error Correction and Double Error Detection”, IEEE Access 2019, 7, 151131–151143, doi:10.1109/ACCESS.2019.2947315.

J. Gracia-Morán, L.J. Saiz-Adalid, J.C. Baraza-Calvo, D. Gil-Tomás, P.J. Gil-Vicente, “Design, Implementation and Evaluation of a Low Redundant Error Correction Code”, IEEE Latin American Transactions, Vol. 19(11), pp. 1903 – 1911, November 2021

L.J. Saiz-Adalid, J. Gracia-Morán, D. Gil-Tomás, J.C. Baraza-Calvo, P.J. Gil-Vicente, “Reducing the Overhead of BCH Codes: New Double Error Correction Codes”, Electronics 2020, 9, 1897. https://doi.org/10.3390/electronics9111897.

https://opencores.org/projects/usimplez

R. Naseer and J. Draper, “DEC ECC design to improve memory reliability in Sub-100nm technologies”, 2008 15th IEEE Int. Conf. on Electronics, Circuits and Systems, pp. 586-589, August 2008.

Xilinx. ZC702 Evaluation Board for the Zynq-7000 XC7Z020 SoC. [Online] Available at: https://www.xilinx.com/support/documentation/boards_and_kits/zc702_zvik/ug850-zc702-eval-bd.pdf

Published

2024-04-13

How to Cite

Gracia-Morán, J., Saiz-Adalid, L.-J., Baraza-Calvo, J.-C., Gil-Tomas, D., & Gil-Vicente, P.-J. (2024). A Proposal of an ECC-based Adaptive Fault-Tolerant Mechanism for 16-bit data words. IEEE Latin America Transactions, 22(5), 418–427. Retrieved from https://latamt.ieeer9.org/index.php/transactions/article/view/8334

Issue

Section

Electronics

Most read articles by the same author(s)

Similar Articles

You may also start an advanced similarity search for this article.