Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Show simple item record

dc.creator Morán, Marina
dc.creator Balladini, Javier
dc.creator Rexachs, Dolores
dc.creator Rucci, Enzo
dc.date 2023
dc.date.accessioned 2025-12-17T15:59:17Z
dc.date.available 2025-12-17T15:59:17Z
dc.identifier.uri https://rdi.uncoma.edu.ar/handle/uncomaid/19175
dc.description.abstract Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure. es_ES
dc.format application/pdf es_ES
dc.language eng es_ES
dc.publisher arXiv es_ES
dc.relation.uri https://arxiv.org/abs/2311.06419 es_ES
dc.relation.uri https://doi.org/10.1016/j.jpdc.2023.104797 es_ES
dc.rights Atribución-NoComercial-CompartirIgual 4.0 es_ES
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/ es_ES
dc.source Journal of Parallel and Distributed Computing, october 2023 es_ES
dc.subject Energy saving es_ES
dc.subject Fault es_ES
dc.subject Tolerance es_ES
dc.subject Methods es_ES
dc.subject Checkpoint es_ES
dc.subject Parallel es_ES
dc.subject Applications es_ES
dc.subject ACPI es_ES
dc.subject DVFS es_ES
dc.subject.other Ciencias de la Computación e Información es_ES
dc.title Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems es_ES
dc.type Articulo es
dc.type article eu
dc.type acceptedVersion eu
dc.description.fil Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática; Argentina. es_ES
dc.description.fil Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática; Argentina. es_ES
dc.description.fil Fil: Rexachs, Dolores. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España. es_ES
dc.description.fil Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. es_ES
dc.subject.cole Artículos es_ES


Files in this item

This item appears in the following Collection(s)

Show simple item record

Atribución-NoComercial-CompartirIgual 4.0 Except where otherwise noted, this item's license is described as Atribución-NoComercial-CompartirIgual 4.0

Search RDI


Browse

My Account

Statistics