Energy Accounting on Derecho
As part of CISL's recent submission to the Green500 list, we have implemented a PBS hook (plugin) which allows PBS to record cumulative energy consumption data for each job individually.
This works by sampling the energy consumed by each node individually since last boot time at both the beginning of the job and the end of the job. The total difference between these two readings is then stored in PBS after the job ends. As a result, energy data displayed while a job is running is not meaningful but will be correct after the job is finished.
To query this energy data about a recently finished job (currently within the last 72 hours) qstat -f -x
can be used, for example:
matthews@derecho1:~> qstat -f -x 4238938 | grep 'x-ncar.*-energy'
resources_used.x-ncar-cpu-energy = 758
resources_used.x-ncar-energy = 2527
resources_used.x-ncar-gpu0-energy = 0
resources_used.x-ncar-gpu1-energy = 0
resources_used.x-ncar-gpu2-energy = 0
resources_used.x-ncar-gpu3-energy = 0
resources_used.x-ncar-memory-energy = 1131
Each of these fields is in Joules (Watt*Seconds) and covers a period of time from immediately before the job script starts to immediately after it ends. Total energy is provided as "x-ncar-energy" and the energy that the system attributed to the CPU, RAM, and individual GPUs are called out separately. Note that these major components do not constitute all of the energy consumed by the node so the total energy will generally be higher than the sum of the components. There also may be a small error introduced due to the delay between when each of these measurements is sampled. If a job spans multiple nodes, each field represents the total energy consumed by that device type across all nodes. For example, a three-node GPU job would report the sum of all GPU-ID 0 devices in the "x-ncar-gpu0-energy" field.
If the job ran on a node which didn't have GPUs (as above), the GPU energy will be listed as 0 Joules. If the job ran on a shared node (the develop queue), it's not possible to completely isolate the energy consumption of your job from other jobs running on that node. So it's recommended to rerun your job on an exclusive use node before trusting this data.
For completed jobs which began after approximately April 2 2024, qhist
can instead be used. For example, to see all power metrics for a single job:
vanderwb@derecho2:~> qhist -j 4246835 -l
4246835.desched1
...
Node Eng (J) = 5787
CPU Eng (J) = 691
RAM Eng (J) = 1006
GPU0 Eng (J) = 626
GPU1 Eng (J) = 610
GPU2 Eng (J) = 623
GPU3 Eng (J) = 612
...
Using typical qhist
flags, you can also show power metrics for all of your jobs for a particular time period. Here, we display jobs from April 20, 2024, sorted by node energy usage:
vanderwb@derecho2:~> qhist -p 20240420 -f '{energy-node}' -s energy-node | head -n 6
Job ID E-Node(J)
4190307 6005563096
4198428 5612065907
4194550 3745933887
4199951 3336310504
4197478 1259626184
It should also be noted that this data includes only the energy consumed by the compute nodes. Energy consumed by the interconnect, control system, cooling, and storage is not included as it would be very difficult to proportion to individual jobs. In CISL's experiments for the Green500 it was observed that the GPU partition consumes approximately 15% more energy measured at the circuit breaker than would be accounted for by this method. Energy dedicated to storage and generating facility chilled water would be additional to that 15% but it does include the part of the interconnect which is local to the racks in question and some of the control system. Nevertheless, this data should be accurate enough for use in comparisons of workloads on a science throughput per Joule basis.
Due to the heterogeneous and shared nature of Casper, this feature is currently only available on Derecho; however, if you would find it useful on Casper, please let us know.