Modern Supercomputers are composed of thousands of Graphics Processing Units (GPUs) that work in parallel. Titan, the world's second fastest supercomputer for open science in 2015, consists of more than 18,000 GPUs used by scientists from various domains such as astrophysics, fusion, climate, and combustion. Due to the large-scale and the long duration, these scientific applications may encounter interruptions due to system failures as well as Silent Data Corruptions (SDCs). Therefore, while the performance improvement achieved via the inherent parallelism available in GPUs is necessary to expedite the scientific discovery process, it is equally critical that applications are able to cope with system failures during their execution, without losing all of the work.
As we will show in the talk that the newest GPU cores are sensitive to radiation-induced errors, including those from the terrestrial neutron radiation environment. Experimental data obtained during three years of radiation experiments on current GPUs and the analysis of Titan field data will be presented and discussed. A detailed analysis of the causes and effects of radiation-induced failures in supercomputers will be provided using a wide set of parallel applications as case studies. Experimental data will be used to show the benefit of enabling ECC on GPUs main memory structures and compare its efficiency with duplication and Algorithm Based Fault Tolerance one. Finally, novel code optimizations to reduce the time-to-solution of specific parallel algorithms are continuously implemented. As experimentally demonstrated, codes optimizations increase the code sensitivity but may reduce the execution time in a way that increase the overall system reliability.
Paolo Rech received his master and Ph.D. degrees from Padova University, Padova, Italy, in 2006 and 2009, respectively. His studies included radiation tests and the effect of neutrons, protons, and alpha particles on programmable devices like FPGAs and Systems On Chip. He was a Post Doc at LIRMM, Montpellier, France from 2010 to 2012, working on radiation effects on electronic devices at high altitudes. He is currently an associate professor at the Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil. Recently, he started collaborations with NVIDIA, AMD, and Los Alamos National Lab to evaluate and mitigate the radiation-induced effects in devices designed for large-scale HPC centers.