Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Morán, Marina; Balladini, Javier; Rexachs, Dolores; Rucci, Enzo

dc.creator	Morán, Marina
dc.creator	Balladini, Javier
dc.creator	Rexachs, Dolores
dc.creator	Rucci, Enzo
dc.date	2023
dc.date.accessioned	2025-12-17T15:59:17Z
dc.date.available	2025-12-17T15:59:17Z
dc.identifier.uri	https://rdi.uncoma.edu.ar/handle/uncomaid/19175
dc.description.abstract	Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.	es_ES
dc.format	application/pdf	es_ES
dc.language	eng	es_ES
dc.publisher	arXiv	es_ES
dc.relation.uri	https://arxiv.org/abs/2311.06419	es_ES
dc.relation.uri	https://doi.org/10.1016/j.jpdc.2023.104797	es_ES
dc.rights	Atribución-NoComercial-CompartirIgual 4.0	es_ES
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/4.0/	es_ES
dc.source	Journal of Parallel and Distributed Computing, october 2023	es_ES
dc.subject	Energy saving	es_ES
dc.subject	Fault	es_ES
dc.subject	Tolerance	es_ES
dc.subject	Methods	es_ES
dc.subject	Checkpoint	es_ES
dc.subject	Parallel	es_ES
dc.subject	Applications	es_ES
dc.subject	ACPI	es_ES
dc.subject	DVFS	es_ES
dc.subject.other	Ciencias de la Computación e Información	es_ES
dc.title	Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems	es_ES
dc.type	Articulo	es
dc.type	article	eu
dc.type	acceptedVersion	eu
dc.description.fil	Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática; Argentina.	es_ES
dc.description.fil	Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática; Argentina.	es_ES
dc.description.fil	Fil: Rexachs, Dolores. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España.	es_ES
dc.description.fil	Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.	es_ES
dc.subject.cole	Artículos	es_ES