CPU Estimate

When This Applies

Use this guide when your workload runs primarily on CPUs. Typical cases include data preprocessing, classical machine learning, simulations that do not use GPU kernels, and post processing of results. If you are training or fine-tuning deep learning models, see GPU Estimate. Most realistic workflows mix both.

High Level Summary of Steps

  • Run a toy example

  • Run a scaled up but still modest example

  • Determine scaling modifiers such as problem size and number of cores

Baseline Run

Example Python Code

The first step is to run a baseline job to estimate resource usage, typically with a small problem size that completes in a few minutes. For example, if your program is a Python script mult.py that multiplies two matrices:

1import numpy as np
2N = 2**13
3a = np.random.rand(N, N)
4b = np.random.rand(N, N)
5# multiply two matrices
6c = a @ b
7print("Multiplying matrices of size", N)

Extracting Information From Shell Command

You can use this shell command to measure wall time, CPU time, and peak RAM for your Python program:

/usr/bin/time -v python mult.py > program_out.log 2> log.out

The file program_out.log captures your program printed output, while log.out records resource usage metrics. The relevant parts of log.out while running the job on an Intel i7-1185G7 with the number of threads set to 1:

1Command being timed: "python mult.py"
2User time (seconds): 27.52
3System time (seconds): 1.16
4Percent of CPU this job got: 99%
5Elapsed (wall clock) time (h:mm:ss or m:ss): 0:28.79
6Maximum resident set size (kbytes): 1630324

and with the number of threads set to 4:

1Command being timed: "python mult.py"
2User time (seconds): 59.21
3System time (seconds): 1.39
4Percent of CPU this job got: 365%
5Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.60
6Maximum resident set size (kbytes): 1631708

Here, the output parameters correspond to:

  • Wall clock time is the elapsed real-world time for running your code.

  • User time is the time the CPU spent executing your calculation.

  • System time is the time the kernel spent on I/O and other system calls on behalf of your process.

  • Peak RAM is the maximum amount of RAM consumed by your program. This typically sets the lower bound on the available RAM for the node you request.

  • Percent of CPU is the ratio of total CPU time, meaning user plus system, to wall clock time, expressed as a percentage. A value above 100 percent indicates that multiple cores were active simultaneously. For example, 365 percent means roughly 3.65 cores were busy on average.

  • If outputs are saved in the output_folder, then the command du -sh /path-to-output/output_folder after running the program gives you the estimated storage requirements.

Array Size

Cores/Threads

Wall time(s)

Peak RAM (GB)

Percent of CPU (%)

8192

1

28.79

1.63

99

8192

4

16.60

1.63

365

8192

8

12.47

1.63

686

Note

Regardless of CPU utilization, your job will consume CPU core-hours based on the number of cores you requested. It is your responsibility to maximize the efficiency of the program to avoid wasting allocated resources.

Example Slurm Script

If you want to run the command on a compute node, sometimes a specific compute node, instead of the login node, you can either request an interactive session or submit a batch job. You might also need to choose your workgroup with workgroup -g test before requesting a compute node.

Interactive Session

salloc --job-name=interactive --partition=standard --nodes=1 --mem=16G --time=02:00:00

Batch Job Submission

 1#!/bin/bash
 2#SBATCH --job-name=mult_py
 3#SBATCH --output=slurm_%j.out
 4#SBATCH --error=slurm_%j.err
 5#SBATCH --partition=standard
 6#SBATCH --time=01:00:00
 7#SBATCH --ntasks=1
 8#SBATCH --cpus-per-task=1
 9#SBATCH --mem=16G
10
11# optional: load modules or activate environment
12# module load python
13# source ~/.aiml/bin/activate
14
15/usr/bin/time -v python mult.py > program_out.log 2> log.out

After saving the above file as submit.slurm, you can submit the batch job using the command sbatch submit.slurm.

Limiting Threads

To get a consistent baseline on a single core, restrict NumPy internal threading using:

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

Set the environment variables that apply to your NumPy BLAS installation. A standard NumPy installation comes with OpenBLAS, unless it is configured to use MKL. You can check the NumPy linkage by executing:

import numpy as np
np.show_config()

You can also verify the number of threads being used with the command htop in your shell. Once you have your baseline run without any errors, you can scale up by varying the associated environment variable, as in the Python matrix multiplication example above, or by varying the number of processes in an MPI run for distributed computation:

mpirun -np 4 my_mpi_program

Note

For the Python matrix multiplication example above, it might be faster to run on a single thread compared to multiple threads unless you have a large enough matrix.

Note

When you have both MPI and multithreading, it is a good idea to ensure that the number of MPI processes times the number of threads does not exceed the number of available or requested CPU cores.

Limitations of /usr/bin/time

The /usr/bin/time utility provides a useful snapshot but does not capture everything you may need for a complete resource estimate:

  • Disk usage growth as a function of time

  • RAM consumption change as a function of time

  • GPU usage, covered in GPU Estimate

  • Concurrent process and thread breakdown

For time series profiling of memory or disk, consider tools such as psrecord, memory_profiler, or custom logging within your code.

Worked Example A: CPU Data Prep

Pilot: 10k images in 12 minutes on 8 threads; peak RAM = 6 GB.

Full data: 10M images, giving a scale factor of 1,000.

Core-hours: 12 minutes × 1,000 = 12,000 minutes of wall time on 8 cores, which is 12,000 minutes ÷ 60 minutes per hour × 8 cores = 1,600 core-hours.

Running 10 jobs in parallel reduces calendar time but does not change the total: still 1,600 core-hours.

Add a 20 percent buffer, giving 1,920 core-hours.

Storage: inputs 4 TB plus outputs 6 TB plus logs and artifacts 1 TB equals 11 TB; retain for 2 months for a total of 22 TB-months.