Tomoe Note 006: Running multiple pipelines in parallel

Michael Richmond
Apr 27, 2017

Executive Summary

Running 2 pipelines simultaneously on "pmdata" takes about the same time (6 hours) as running a single pipeline. But running 3 pipelines simultaneously takes more time (about 8 hours), so we may be starting to reach the limits of disk I/O.


The machine "pmdata"

Thanks to Ohsawa-san's work, we can now use the nice computer for processing Tomoe data. Here are some specs for the computer:

The lscpu command provides the following information:


pmdata

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Model name:            Intel(R) Xeon(R) CPU E3-1230 v5 @ 3.40GHz
Stepping:              3
CPU MHz:               3748.898
BogoMIPS:              6816.00


There are 8 CPUs, each with 4 cores, and 2 threads per core.

We can use the free command to examine the memory:



              total        used        free      shared  buff/cache   available
Mem:       16215816      950492      334672       10272    14930652    14531088
Swap:       8191996      740272     7451724

So, that is 16 GB of memory.


Running pipelines in parallel

The data from the night of 20160411 = Apr 11, 2016, consists of measurements from 8 CMOS chips. Each chip generated 126 "chunks"; a "chunk" is 360 FITS images compressed into a single big file, representing about 3 minutes of data from the sky. The size of a raw "chunk" is about 1.8 TB, and it represents observations covering a span of about 6.5 hours.

The pipeline breaks up the "chunk" into individual FITS images, cleans then, finds and measures stars in each image, then calibrates the stars astrometrically and photometrically. The output is a set of calibrated star lists.

How long does it take the pipeline to run, and what happens if we try to run multiple pipelines in parallel?

The Linux time command reports three values for a process:

real
The "wall clock time" taken, from start to finish.

user
Time spent running commands in memory -- doing the work requested by the user: re-formatting FITS files, subtracting dark frames, finding stars, etc.

sys
Time spent by the Linux system to carry out tasks required to support the user's process: in our case, mostly reading and writing data to and from disk files


Numbers from the tests

I have run the pipeline on the machine "pmdata" in several configurations: one pipeline at a time, 2 pipeline simultaneously, and 3 pipelines simultaneously. Let's look at the results.

One pipeline at a time It took about 6:08 (6 hours, 8 minutes) to process the entire set for chip 0.

Two pipelines simultaneously I tried small tests in which I explicitly forced one pipeline to run on one particular CPU, but the pipeline to run a different particular CPU. The time it took to run tasks in parallel appeared to be the same as if I allowed the OS to pick the CPUs for each job. So, these numbers represent the times when the OS is choosing the CPUs/cores for every process and sub-process.

Overnight on Apr 25/26



                         chip 1               chip 2

  real   (HH:MM)         06:57                 06:54

  user   (seconds)      16,493                15,747

  sys    (seconds)       2,756                 2,787

  number of stars        4.6 M                 4.5 M


Overnight on Apr 26/27. There may have been small changes in some of the pipeline code since the previous results shown above.



                         chip 6               chip 7

  real   (HH:MM)         05:27                 05:29

  user   (seconds)      10,333                10,414

  sys    (seconds)       3,063                 3,050

  number of stars        4.8 M                 4.4 M

Three pipelines simultaneously In this case again, the OS is choosing which processors to use for each pipeline process and sub-process.

Late afternoon and evening, Apr 27



                         chip 3             chip 4            chip 5

  real   (HH:MM)         07:48               07:47             08:06

  user   (seconds)      10,892              10,969            11.571

  sys    (seconds)       3,575               3,571             3,606

  number of stars        4.6 M               4.3 M             5.1 M


Conclusion

Running 2 pipelines simultaneously appears to be nearly as fast as running a single pipeline. That's good.

But running 3 pipelines simultaneously appears to slow down considerably, taking roughly 1.3 times as long as a single pipeline. Of course, the machine does process three entire chips during that time, so the net result is still faster than running 3 pipelines sequentially.

When running 3 pipelines simultaneously, the "sys" portion of the execution time increases. This is expected if the disk I/O is becoming a bottleneck for the processing. The system may have to wait for one pipeline to read data from a disk file before it can read data for a second pipeline, for example.

I will run additional tests over the next few days, including running 4 pipelines simultaneously.