Feb 062012
 

I’m putting this stuff up here in case anybody finds it useful. There are still some bugs in it and I fix them as I find them. However, it has been stable for some time now.

It is nice to monitor how much a cluster is being used. We can only really have a small system (cash supply problem) so we have to make the best with what we have.

I have found that the best way to capture usage stats is directly from pbsnodes -av

The images above don’t have the full month’s data. The head node was reimaged and didn’t restore my crontab. The red line is the number of cpus that are “online” state. The blue utilisation is the aggregate number of cpus requested by pbs, or, if the memory requested for a node is greater than 90%, the node is considered 100% utilised. We realise that you can still squeeze jobs onto a node that has less than 10% memory (in case of a 128GB node, that is still 12GB free). When all of our nodes are more than 48GB, I will change this to 95% or even 98%.

My little utility gathers data using this script:

# Database to store the files
DATABASE="/home/hpcdata/PBS/DataWarehouse"
 
# Grab the current time in unix format
time=`date +%s`
# Define the 'scratch' directory
TEMPDIR="/dev/shm"
# Define the 'scratch' file
TEMPFILE="pbs.$time"
# Current Archive file
FILE=`date +"pbs_%Y_%m_%d.tar"`
 
 
# Output that data to the tempory space
/opt/pbs/default/bin/pbsnodes -av  > "$TEMPDIR/$TEMPFILE"
 
cd $TEMPDIR
 
# Check that file exists, if not, create it
if [ ! -e "$DATABASE/$FILE" ] ; then
    # Create the new file
    #touch "$DATABASE/$FILE"
    # Add the current data to the new tar file
    tar -cf "$DATABASE/$FILE" $TEMPFILE
    rm -f "$TEMPDIR/$TEMPFILE"
 
    # zip up yesterdays tar file
    YESTERDAYFILE=`date --date="1 days ago" +"pbs_%Y_%m_%d.tar"`
    cd $DATABASE
    gzip "$DATABASE/$YESTERDAYFILE"
 
else
    # Still the same day, add the new data to the tar file
    tar -rf "$DATABASE/$FILE" $TEMPFILE
    rm -f "$TEMPDIR/$TEMPFILE"
fi

For larger clusters, you can imagine that the amount of data consumed would be quite large since the above is called every minute. This is why the script adds the data to a tar file and on midnight, the tar file is gzipped. The program below will read the appropriate gzip and parse to a smaller file (if not already present) that contains only the information required to draw the graphs. These can be deleted should they be too big, but their size is pretty small (by design). There is only one file per day, so even a year of files should be fairly small.

I also have this setup as a crontab. crontab -l

# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (/tmp/crontab.XXXX4rWlMo installed on Tue Jan 17 08:34:26 2012)
# (Cron version V5.0 -- $Id: crontab.c,v 1.12 2004/01/23 18:56:42 vixie Exp $)
# Gather the stats
* * * * * sh /home/hpcdata/PBS/CapturePBSData.sh >> /dev/null 2>&1
 
# Generate the graphs every 15 minutes
*/15 * * * * sh /home/hpcdata/public_html/HPCGraphs/GenerateHour.sh >> /dev/null 2>&1
*/15 * * * * sh /home/hpcdata/public_html/HPCGraphs/GenerateDay.sh >> /dev/null 2>&1
# Generate the graphs daily
0 0 * * * sh /home/hpcdata/public_html/HPCGraphs/GenerateMonth.sh >> /dev/null 2>&1
0 0 * * * sh /home/hpcdata/public_html/HPCGraphs/GenerateYear.sh >> /dev/null 2>&1

You can see that I run the data capture every minute and generate graphs every fifteen minutes or so. This is because we have these graphs going onto a webpage for users. The graphing utility (below) will only parse the tar.gz day’s data if it has not already been completed. This is done to cut down time and space requirements.

GenerateDay.sh

hpcdata@host:~/public_html/HPCGraphs> cat GenerateDay.sh 
cd /home/hpcdata/public_html/HPCGraphs
java -jar /home/hpcdata/PBS/JHPCChart.jar Day
chmod a+r *_Day.png

When those “GenerateTIME.sh” scripts are run, there is another configuration file called HPCGraph.conf that contains the following information

# Directory of database
DATABASE /home/hpcdata/PBS/DataWarehouse
# Directory to store charts
IMAGESTORE /home/hpcdata/public_html/HPCGraphs
IMAGEWIDTH 640
IMAGEHEIGHT 360

The directory structure looks like this:

hpcdata@host:~/PBS> pwd
/home/hpcdata/PBS
hpcdata@host:~/PBS> ll
total 63
-rwx------ 1 hpcdata users  1050 2012-01-16 08:39 CapturePBSData.sh
-rw-r----- 1 hpcdata users    493 2012-01-17 08:35 CrontabEntry.txt
drwx------ 3 hpcdata users  8192 2012-02-08 00:00 DataWarehouse
-rw------- 1 hpcdata users  34143 2012-01-16 08:45 JHPCChart.jar
drwx------ 2 hpcdata users   251 2012-01-16 07:04 lib

And the public bit:

hpcdata@host:~/public_html/HPCGraphs> pwd
/home/hpcdata/public_html/HPCGraphs
hpcdata@host:~/public_html/HPCGraphs> ll
total 228
-rw-r--r-- 1 hpcdata users 10825 2012-02-08 10:00 Aggregate_Day.png
-rw-r--r-- 1 hpcdata users 11754 2012-02-08 10:00 Aggregate_Hour.png
-rw-r--r-- 1 hpcdata users 19467 2012-02-06 12:07 Aggregate_January_2012.png
-rw-r--r-- 1 hpcdata users 11331 2012-02-08 00:00 Aggregate_Month.png
-rw-r--r-- 1 hpcdata users 11685 2012-02-08 00:01 Aggregate_Year.png
-rw------- 1 hpcdata www     106 2012-01-16 09:53 GenerateDay.sh
-rw------- 1 hpcdata www     107 2012-01-16 09:53 GenerateHour.sh
-rw------- 1 hpcdata www     110 2012-01-16 09:53 GenerateMonth.sh
-rw------- 1 hpcdata www     108 2012-01-16 09:54 GenerateYear.sh
-rw------- 1 hpcdata www     172 2012-01-31 11:11 HPCChart.conf
-rw-r--r-- 1 hpcdata users 20509 2012-02-08 10:00 Utilisation_Day.png
-rw-r--r-- 1 hpcdata users 16985 2012-02-08 10:00 Utilisation_Hour.png
-rw-r--r-- 1 hpcdata users 44110 2012-02-06 12:07 Utilisation_January_2012.png
-rw-r--r-- 1 hpcdata users 22667 2012-02-08 00:00 Utilisation_Month.png
-rw-r--r-- 1 hpcdata users 28273 2012-02-08 00:01 Utilisation_Year.png

Downloads: Please note, my little server only has a 1Mbps upload so, it can take a little while 🙁
The jar file can be found here: JHPCChart.tar.gz
The code to generate the graphs can be found: Temporarily removed

Update: Here is an updated version of JHPCChart.jar
Reason: I used a version of jtar that when supplying a corrupted tar file, would go into an infinite loop. I submitted a bug report and later fixed the bug and am waiting for the update to occur. This error occurs when something happens to the cluster admin node or head node when the tar file is being written to when the system goes down.

There are several ways to run the above program:

java -jar /home/hpcdata/PBS/JHPCChart.jar Hour
java -jar /home/hpcdata/PBS/JHPCChart.jar Day
java -jar /home/hpcdata/PBS/JHPCChart.jar Week
java -jar /home/hpcdata/PBS/JHPCChart.jar Month
java -jar /home/hpcdata/PBS/JHPCChart.jar Year
java -jar /home/hpcdata/PBS/JHPCChart.jar 20120101 20120131 "Utilisation for January, 2012"

That last one is in the format of:
JHPCChart.jar StartDate FinishDate “Title”
Where StartDate and FinishDate are of the format YYYYMMDD

Sorry to those of you who use different date formats, I’m Australian 🙁

Extra
I am also working on a nice 3D graphing utility as well that is based off the same data. It is clearly a work in progress, but each “box” is a compute node. The colour of the node is how many cpus are being used. The height of the ‘water’ is how much memory is being consumed. I’m still going on this.

I’m doing the above for a bit of fun really and that I miss doing the visualisation work. I have some good ideas how to take this further. I’m motivated by some of Paul Bourke’s PBS Visualisation work. If any body is interested in the source so we can work on it, just email me or contact via comments.

I am trying to get full approval to open source everything on this page and get it up on github.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)

Human Conf Test * Time limit is exhausted. Please reload CAPTCHA.