
:max_bytes(150000):strip_icc()/004_what-is-an-rss-feed-and-where-to-get-it-4684568-81e2a3e6b5434de4bfae850ee083a1fe.jpg)
Speaking in general terms, without knowing any specifics about what you're running, the temporary creation of a data structure that is twice 1.6 GB in size would seem to be enough to trigger job cancellation, in your example, in addition to any space already allocated. In the process of reallocating that data, your process may temporarily ask for a chunk of memory larger than what Slurm has available for that job.ĭepending on implementation, a C++ std::vector, for instance, may try to create a temporary, new vector that is twice or some other multiple of size, once enough elements are added, to copy over data from the old vector. Its financial position deteriorated over several years. That your sacct call shows 1.6 GB usage just before the 3 GB job is cancelled might be suggestive of how your process is using memory.Ī data structure used by your process may require resizing as it grows. That is how Silicon Valley Bank ( svb ), the 16th-largest lender in America, with about 200bn in assets, went bust. The problem is simple: the kernel killed a process from the offending job and the SLURM accounting mechanism didn't poll at the right time to see the spike in usage that caused the kernel to kill the process. SLURM sets up a cgroup for the job with the appropriate limits which the Linux kernel strictly enforces. FSL's implementation uses a Linux kernel feature called "cgroups" to control memory and CPU usage. SLURM's accounting mechanism is polling based and doesn't always catch spikes in memory usage. It is suggested in this wiki post that the job manager may not get usage data fast enough to track a spike in memory usage, for the sacct tool to give you a specific answer:
