I’m putting this stuff up here in case anybody finds it useful. There are still some bugs in it and I fix them as I find them. However, it has been stable for some time now.
It is nice to monitor how much a cluster is being used. We can only really have a small system (cash supply problem) so we have to make the best with what we have.
I have found that the best way to capture usage stats is directly from pbsnodes -av
The images above don’t have the full month’s data. The head node was reimaged and didn’t restore my crontab. The red line is the number of cpus that are “online” state. The blue utilisation is the aggregate number of cpus requested by pbs, or, if the memory requested for a node is greater than 90%, the node is considered 100% utilised. We realise that you can still squeeze jobs onto a node that has less than 10% memory (in case of a 128GB node, that is still 12GB free). When all of our nodes are more than 48GB, I will change this to 95% or even 98%.
My little utility gathers data using this script:
# Database to store the files DATABASE="/home/hpcdata/PBS/DataWarehouse" # Grab the current time in unix format time=`date +%s` # Define the 'scratch' directory TEMPDIR="/dev/shm" # Define the 'scratch' file TEMPFILE="pbs.$time" # Current Archive file FILE=`date +"pbs_%Y_%m_%d.tar"` # Output that data to the tempory space /opt/pbs/default/bin/pbsnodes -av > "$TEMPDIR/$TEMPFILE" cd $TEMPDIR # Check that file exists, if not, create it if [ ! -e "$DATABASE/$FILE" ] ; then # Create the new file #touch "$DATABASE/$FILE" # Add the current data to the new tar file tar -cf "$DATABASE/$FILE" $TEMPFILE rm -f "$TEMPDIR/$TEMPFILE" # zip up yesterdays tar file YESTERDAYFILE=`date --date="1 days ago" +"pbs_%Y_%m_%d.tar"` cd $DATABASE gzip "$DATABASE/$YESTERDAYFILE" else # Still the same day, add the new data to the tar file tar -rf "$DATABASE/$FILE" $TEMPFILE rm -f "$TEMPDIR/$TEMPFILE" fi
For larger clusters, you can imagine that the amount of data consumed would be quite large since the above is called every minute. This is why the script adds the data to a tar file and on midnight, the tar file is gzipped. The program below will read the appropriate gzip and parse to a smaller file (if not already present) that contains only the information required to draw the graphs. These can be deleted should they be too big, but their size is pretty small (by design). There is only one file per day, so even a year of files should be fairly small.
I also have this setup as a crontab. crontab -l
# DO NOT EDIT THIS FILE - edit the master and reinstall. # (/tmp/crontab.XXXX4rWlMo installed on Tue Jan 17 08:34:26 2012) # (Cron version V5.0 -- $Id: crontab.c,v 1.12 2004/01/23 18:56:42 vixie Exp $) # Gather the stats * * * * * sh /home/hpcdata/PBS/CapturePBSData.sh >> /dev/null 2>&1 # Generate the graphs every 15 minutes */15 * * * * sh /home/hpcdata/public_html/HPCGraphs/GenerateHour.sh >> /dev/null 2>&1 */15 * * * * sh /home/hpcdata/public_html/HPCGraphs/GenerateDay.sh >> /dev/null 2>&1 # Generate the graphs daily 0 0 * * * sh /home/hpcdata/public_html/HPCGraphs/GenerateMonth.sh >> /dev/null 2>&1 0 0 * * * sh /home/hpcdata/public_html/HPCGraphs/GenerateYear.sh >> /dev/null 2>&1
You can see that I run the data capture every minute and generate graphs every fifteen minutes or so. This is because we have these graphs going onto a webpage for users. The graphing utility (below) will only parse the tar.gz day’s data if it has not already been completed. This is done to cut down time and space requirements.
hpcdata@host:~/public_html/HPCGraphs> cat GenerateDay.sh cd /home/hpcdata/public_html/HPCGraphs java -jar /home/hpcdata/PBS/JHPCChart.jar Day chmod a+r *_Day.png
When those “GenerateTIME.sh” scripts are run, there is another configuration file called HPCGraph.conf that contains the following information
# Directory of database DATABASE /home/hpcdata/PBS/DataWarehouse # Directory to store charts IMAGESTORE /home/hpcdata/public_html/HPCGraphs IMAGEWIDTH 640 IMAGEHEIGHT 360
The directory structure looks like this:
hpcdata@host:~/PBS> pwd /home/hpcdata/PBS hpcdata@host:~/PBS> ll total 63 -rwx------ 1 hpcdata users 1050 2012-01-16 08:39 CapturePBSData.sh -rw-r----- 1 hpcdata users 493 2012-01-17 08:35 CrontabEntry.txt drwx------ 3 hpcdata users 8192 2012-02-08 00:00 DataWarehouse -rw------- 1 hpcdata users 34143 2012-01-16 08:45 JHPCChart.jar drwx------ 2 hpcdata users 251 2012-01-16 07:04 lib
And the public bit:
hpcdata@host:~/public_html/HPCGraphs> pwd /home/hpcdata/public_html/HPCGraphs hpcdata@host:~/public_html/HPCGraphs> ll total 228 -rw-r--r-- 1 hpcdata users 10825 2012-02-08 10:00 Aggregate_Day.png -rw-r--r-- 1 hpcdata users 11754 2012-02-08 10:00 Aggregate_Hour.png -rw-r--r-- 1 hpcdata users 19467 2012-02-06 12:07 Aggregate_January_2012.png -rw-r--r-- 1 hpcdata users 11331 2012-02-08 00:00 Aggregate_Month.png -rw-r--r-- 1 hpcdata users 11685 2012-02-08 00:01 Aggregate_Year.png -rw------- 1 hpcdata www 106 2012-01-16 09:53 GenerateDay.sh -rw------- 1 hpcdata www 107 2012-01-16 09:53 GenerateHour.sh -rw------- 1 hpcdata www 110 2012-01-16 09:53 GenerateMonth.sh -rw------- 1 hpcdata www 108 2012-01-16 09:54 GenerateYear.sh -rw------- 1 hpcdata www 172 2012-01-31 11:11 HPCChart.conf -rw-r--r-- 1 hpcdata users 20509 2012-02-08 10:00 Utilisation_Day.png -rw-r--r-- 1 hpcdata users 16985 2012-02-08 10:00 Utilisation_Hour.png -rw-r--r-- 1 hpcdata users 44110 2012-02-06 12:07 Utilisation_January_2012.png -rw-r--r-- 1 hpcdata users 22667 2012-02-08 00:00 Utilisation_Month.png -rw-r--r-- 1 hpcdata users 28273 2012-02-08 00:01 Utilisation_Year.png
Downloads: Please note, my little server only has a 1Mbps upload so, it can take a little while 🙁
The jar file can be found here: JHPCChart.tar.gz
The code to generate the graphs can be found: Temporarily removed
Update: Here is an updated version of JHPCChart.jar
Reason: I used a version of jtar that when supplying a corrupted tar file, would go into an infinite loop. I submitted a bug report and later fixed the bug and am waiting for the update to occur. This error occurs when something happens to the cluster admin node or head node when the tar file is being written to when the system goes down.
There are several ways to run the above program:
java -jar /home/hpcdata/PBS/JHPCChart.jar Hour java -jar /home/hpcdata/PBS/JHPCChart.jar Day java -jar /home/hpcdata/PBS/JHPCChart.jar Week java -jar /home/hpcdata/PBS/JHPCChart.jar Month java -jar /home/hpcdata/PBS/JHPCChart.jar Year java -jar /home/hpcdata/PBS/JHPCChart.jar 20120101 20120131 "Utilisation for January, 2012"
That last one is in the format of:
JHPCChart.jar StartDate FinishDate “Title”
Where StartDate and FinishDate are of the format YYYYMMDD
Sorry to those of you who use different date formats, I’m Australian 🙁
I am also working on a nice 3D graphing utility as well that is based off the same data. It is clearly a work in progress, but each “box” is a compute node. The colour of the node is how many cpus are being used. The height of the ‘water’ is how much memory is being consumed. I’m still going on this.
I’m doing the above for a bit of fun really and that I miss doing the visualisation work. I have some good ideas how to take this further. I’m motivated by some of Paul Bourke’s PBS Visualisation work. If any body is interested in the source so we can work on it, just email me or contact via comments.
I am trying to get full approval to open source everything on this page and get it up on github.