“I have a simple philosophy. Fill what’s empty. Empty what’s full. And scratch where it itches.” – Alice Roosevelt Longworth
When one orchestrates HPC operations, scratch space often comes to the fore as a limiting factor that is often too small, big or slow.
Many types of jobs — seismic, life sciences, financial services, etc. — don’t do all their work purely in memory. Each of many workers loads an initial configuration and reads/writes many intermediate points to scratch storage — often shared — before the total job completes. What can easily happen is that huge I/O problems replace huge computational problems as bottlenecks shift, and scratch management is a common issue.
Traditional HPC systems often have local storage on each node, typically solid state drives (SSDs) which are fast yet expensive. If it’s known in advance that a specific amount is constantly in use, this approach can be cost effective. However, this is rarely true, and populating every node in a large system with the maximal possible scratch is cost prohibitive. There’s also no simple way to share node-local storage for use cases other than sharing on that specific node.
Another option is storing scratch data on an external high performance file system like Lustre®. The storage itself is very cost effective, yet it has significant drawbacks for scratch use cases involving multitudinous transactional access. Parallel file system bandwidth is extremely high but with high latency, and external storage connections are often far narrower than a system’s internal connectivity. There’s little choice, though — especially if the scratch files need to be shared between nodes, which is common in large HPC jobs.
A great example of this challenge can be found in depth migration of seismic data. Recorded as time-series data at a surface location, to be useful, seismic data must be relocated (“migrated”) to its subsurface source. Building the required earth models for this process is a long and iterative process.
Many methods for depth migration require the calculation of large tables that ideally are shared between nodes computing cells in associated regions. An obvious problem is that node-local storage tends to be limited in size and shareability. In particular, it often becomes an improvised hack that the rest of the job is forced to depend on. One can avoid this with external storage, but having a huge pool of nodes reach for these tables at the same time creates a bottleneck for this job and for all other users.
There is a solution to this issue, and it involves making scratch space a global, internal supercomputer resource. This lets users take advantage of huge internal supercomputer bandwidth to avoid having to expensively overprovision local storage.
Cray has developed the DataWarp™ I/O accelerator to provide this intermediate level of high-bandwidth, low-latency, file-based storage connected directly to our supercomputers’ internal high-speed Aries network. Supercomputer resources are leveraged to provide a storage bus in addition to supporting low-latency, high performance messaging as part of computation.
Unlike existing improvised solutions, the DataWarp accelerator is flexibly reconfigurable on a per-job basis. Each job can have its own allocation and usage pattern. A single node can have private scratch and/or the totality can have global scratch in any desired mix. All DataWarp system resources are potentially available to all nodes without the waste of unreachable storage.
From personal experience, I’ve had to troubleshoot enough scratch allocation and misuse issues that this solution really resonates with me as a different, better way to look at the problem. Sure, one can just throw SSDs at it, but elegance has more rewards than just the existential.
The post Scratching an Itch: Mitigating HPC Scratch Bottlenecks appeared first on Cray Blog.