Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Mostrar el registro sencillo del ítem

dc.creator Morán, Marina
dc.creator Balladini, Javier
dc.creator Rexachs, Dolores
dc.creator Rucci, Enzo
dc.date 2023
dc.date.accessioned 2025-12-17T15:59:17Z
dc.date.available 2025-12-17T15:59:17Z
dc.identifier.uri https://rdi.uncoma.edu.ar/handle/uncomaid/19175
dc.description.abstract Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure. es_ES
dc.format application/pdf es_ES
dc.language eng es_ES
dc.publisher arXiv es_ES
dc.relation.uri https://arxiv.org/abs/2311.06419 es_ES
dc.relation.uri https://doi.org/10.1016/j.jpdc.2023.104797 es_ES
dc.rights Atribución-NoComercial-CompartirIgual 4.0 es_ES
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/ es_ES
dc.source Journal of Parallel and Distributed Computing, october 2023 es_ES
dc.subject Energy saving es_ES
dc.subject Fault es_ES
dc.subject Tolerance es_ES
dc.subject Methods es_ES
dc.subject Checkpoint es_ES
dc.subject Parallel es_ES
dc.subject Applications es_ES
dc.subject ACPI es_ES
dc.subject DVFS es_ES
dc.subject.other Ciencias de la Computación e Información es_ES
dc.title Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems es_ES
dc.type Articulo es
dc.type article eu
dc.type acceptedVersion eu
dc.description.fil Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática; Argentina. es_ES
dc.description.fil Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática; Argentina. es_ES
dc.description.fil Fil: Rexachs, Dolores. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España. es_ES
dc.description.fil Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. es_ES
dc.subject.cole Artículos es_ES


Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

Atribución-NoComercial-CompartirIgual 4.0 Excepto si se señala otra cosa, la licencia del ítem se describe como Atribución-NoComercial-CompartirIgual 4.0