Data Management and Object storages in HPC
Juha Lento
2025-05-28

Data management

Different storages have different purposes, and are meant for different kinds of data. No single storage is at the same time fast, cheap, large capacity, long lifetime, broadly accessible, etc.

Old saying that supercomputers transfer a compute bound problem into an I/O problem is more true now than ever. Still, the data management in many projects looks like an afterthought.

Old classics and a new one

Types of storage and where it is accessible from

How is it accessed?

Issues

Solutions

In the long run, I’d say the web interfaces are the way to go. We already have allas.csc.fi and pouta.csc.fi interfaces to the object storage, and “Cloud Storage Configuration” in puhti.csc.fi, mahti.csc.fi, and lumi.csc.fi. Object storages are web services, after all.

Meanwhile, in HPC we often still need to deal with command line tools.

Data-mover

Data-mover is a development idea/project, that tries to solve the acute problem of Puhti scratch disc being way too full, by providing an easy to use tool to transfer problematic datasets to Allas. It does this by using restic and batch jobs. Unfortunately, while the tool solves some problems, it also creates new ones.

I’d say the correct solution was already on the first slide of this set, plan the data management already when writing the application and the workflow, and do not create difficult datasets (too many small files, huge files, etc.) in the first place.