Current Projects

Lossy Compression

Long running HPC applications depend on checkpoint restart to recover from failures and utilize multiple time allocations. Memory bandwidth and in particular file system bandwidth continues to be limiters on application performance and scalability. Compression techniques can be used to reduce data size limiting its impact.

Lossless compression fails to generate high compression factors, but lossy compression generates noticeably higher compression factors at the expense of adding a small but controllable amount of error into the simulation.

SDC Propagation

As simulation size and machine complexity grows on next generation HPC systems, machine errors are expected to increase. In particular, silent data corruption (SDC), which occurs due to cosmic radiation striking hardware components causing the state of a transistor to flip, remains a concern key concern.

SDC in iterative methods can lead to extra iterations required to find a solution, can convergence to an incorrect solution, or cause an application crash. Understanding how SDC propagates through iterative methods can lead to better mitigation systems that in turn increases effective system utilization.

Resilient AMG

Many scientific applications from modeling blood flow to electromagnetics depend upon sparse matrix structures, which are central to the underling linear algebra computations. These types of computations comprise a sizable percent of high performance computing (HPC) workloads. One crucial computation is solving a linear system. Much is known about the efficiency and scalability of linear solvers, but their behavior in the presence of faults is still unclear. As machines are built using a higher number of cores the individual cores themselves are not becoming more reliable; therefore, as the number of cores in a system increases the mean time between interrupt decreases. Because of this, fault tolerance/resilience is receiving increased attention. We need to elevate fault consideration to a first class priority in order to efficiently utilize these resources. Resiliency techniques need to be developed and analyzed to allow linear solvers the ability to remain efficient and scalable on emerging HPC architectures.

Single injection on a 2D Poisson problem The figure shows the dramatic chance in convergence that is possible when a single fault is inject. Here the fault occurs during the residual calculation before restriction on the finest level where the first component of the residual vector is perturbed with a single bit-flip. The bit that is flipped is given by the trend name.

Because of the increase in the iterations required to achieve convergence, this motivates the need to devise low-cost checks for silent data corruption that occurs in situations like that shown here. These detectors can be algorithmic based or agnostic. A duality is expressed with algorithmic based ones in they provide the best chance of catching SDC and the expense of portability.

Fault Injection

To facilitate my research projects related to silent data corruptions in HPC applications, I've created a fault injection, FlipIt, framework designed as an LLVM compiler pass. This compiler pass surrounds application code with code to facilitate fault injection. The possible injection sites are enumerated at compile time, but fault activate is done purely at runtime based on a provided fault distribution.


[1]   Laguna, I., Schulz, M., Richards, D.F., Calhoun, J. & Olson, L. IPAS: Intelligent Protection Against Silent Output Corruption in Scientific Applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, pages 227-238, ACM, 2016.
[2]   Calhoun, J., Olson, L., Snir, M. & Gropp, W.D. Towards a More Fault Resilient Multigrid Solver. In Proceedings of the Symposium on High Performance Computing, pages 1-8, Society for Computer Simulation International, 2015.
[3]   Beckwith, K., Veitzer, S., McCormick, S., Ruge, J., Olson, L. & Calhoun, J. Fully-implicit ultrascale physics solvers and application to ion source modelling. In Plasma Sciences (ICOPS) held with 2014 IEEE International Conference on High-Power Particle Beams (BEAMS), 2014 IEEE 41st International Conference on, pages 1-8, 2014.
[4]   Calhoun, J., Olson, L. & Snir, M. FlipIt: An LLVM Based Fault Injector for HPC. In Euro-Par 2014: Parallel Processing Workshops, 8805:547-558, Springer International Publishing, 2014.
[5]   Ahn, J. & Calhoun, J. Dynamic contact of viscoelastic bodies with two obstacles: mathematical and numerical approaches. Electronic Journal of Differential Equations, 2013(85):1-23, 2013.
[6]   Calhoun, J. & Jiang, H. Preemption of a CUDA Kernel Function. In Proceedings of the 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, pages 247-252, IEEE Computer Society, 2012.
[7]   Calhoun, J., Graham, J., Zhou, H. & Jiang, H. Acceleration of Generalized Minimum Aberration Designs of Hadamard Matrices on Graphics Processing Units. In Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication \& 2012 IEEE 9th International Conference on Embedded Software and Systems, pages 1294-1300, IEEE Computer Society, 2012.
[8]   Calhoun, J., Graham, J. & Jiang, H. On Using a Graphics Processing Unit to Solve the Closest Substring Problem. In Proceedings of 2011 International Conference on Parallel and Distributed Processing Techniques and Applications, pages ???-???, 2011.