Scratching an Itch: Mitigating HPC Scratch Bottlenecks

“I have a simple philosophy. Fill what’s empty. Empty what’s full. And scratch where it itches.” – Alice Roosevelt Longworth

When one orchestrates HPC operations, scratch space often comes to the fore as a limiting factor that is often too small, big or slow.

Many types of jobs — seismic, life sciences, financial services, etc. — don’t do all their work purely in memory. Each of many workers loads an initial configuration and reads/writes many intermediate points to scratch storage — often shared — before the total job completes. What can easily happen is that huge I/O problems replace huge computational problems as bottlenecks shift, and scratch management is a common issue.

Traditional HPC systems often have local storage on each node, typically solid state drives (SSDs) which are fast yet expensive. If it’s known in advance that a specific amount is constantly in use, this approach can be cost effective. However, this is rarely true, and populating every node in a large system with the maximal possible scratch is cost prohibitive. There’s also no simple way to share node-local storage for use cases other than sharing on that specific node.

Another option is storing scratch data on an external high performance file system like Lustre®. The storage itself is very cost effective, yet it has significant drawbacks for scratch use cases involving multitudinous transactional access. Parallel file system bandwidth is extremely high but with high latency, and external storage connections are often far narrower than a system’s internal connectivity. There’s little choice, though — especially if the scratch files need to be shared between nodes, which is common in large HPC jobs.

A great example of this challenge can be found in depth migration of seismic data. Recorded as time-series data at a surface location, to be useful, seismic data must be relocated (“migrated”) to its subsurface source. Building the required earth models for this process is a long and iterative process.

Many methods for depth migration require the calculation of large tables that ideally are shared between nodes computing cells in associated regions. An obvious problem is that node-local storage tends to be limited in size and shareability. In particular, it often becomes an improvised hack that the rest of the job is forced to depend on. One can avoid this with external storage, but having a huge pool of nodes reach for these tables at the same time creates a bottleneck for this job and for all other users.

There is a solution to this issue, and it involves making scratch space a global, internal supercomputer resource. This lets users take advantage of huge internal supercomputer bandwidth to avoid having to expensively overprovision local storage.

Cray has developed the DataWarp™ I/O accelerator to provide this intermediate level of high-bandwidth, low-latency, file-based storage connected directly to our supercomputers’ internal high-speed Aries network. Supercomputer resources are leveraged to provide a storage bus in addition to supporting low-latency, high performance messaging as part of computation.

Unlike existing improvised solutions, the DataWarp accelerator is flexibly reconfigurable on a per-job basis. Each job can have its own allocation and usage pattern. A single node can have private scratch and/or the totality can have global scratch in any desired mix. All DataWarp system resources are potentially available to all nodes without the waste of unreachable storage.

From personal experience, I’ve had to troubleshoot enough scratch allocation and misuse issues that this solution really resonates with me as a different, better way to look at the problem. Sure, one can just throw SSDs at it, but elegance has more rewards than just the existential.

The post Scratching an Itch: Mitigating HPC Scratch Bottlenecks appeared first on Cray Blog.

Scratching an Itch: Mitigating HPC Scratch Bottlenecks

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112