Mar 132012
 

Sometimes I do a lot of processing from the command line. I’ve really grown, over the years, to love bash programming. However, one of the things I think that needs to be improved is parallel processing.

Below, I highlight several ways in which I perform parallel scripting. Sometimes I have to use specific examples to get the point across.

The Generic example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
JOBS=1000
 
numRunning=0
numProcs=10
 
for i in {1..$JOBS}
do
  {
    # Do some task in here for this iteration of i
  } &
 
  #Count the number of jobs currently running.
  numRunning=`jobs -p | wc -l`	
  while [ "$numRunning" -ge "$numProcs" ]
  do
    # I have hit the maximum number of simultaneous jobs running
    # Sleep periodically, and recheck.
    sleep 1
    numRunning=`jobs -p | wc -l`
  done
done
 
# Wait for the remaining jobs to finish
numRunning=`jobs -p | wc -l`	
  echo waiting at directory for $numRunning jobs
wait

You can see that I have a 1000 JOBS (line 1) that need to be done and I have defined that I want to run at most 10 in parallel (line 4). Whatever the task I require to do, I put the sequence of commands between the braces on lines 8 and 10. This ‘task’ is then backgrounded and the number of backgrounded jobs can be counted with the ‘jobs’ command. If the tasks take very little time each, I would recommend using a decimal value for the sleep command (eg sleep 0.1) rather than sleeping for a second.

It should be noted that this does not work when numProcs=1

The following example is me processing thousands of zip files that each contain potentially thousands of tiff files that need to be OCR’ed (Optical Character Recognition).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
DATADIR="/media/Data/Data/wipo/publication/2004"
FINISHDIR="/media/Data/Data/wipo/WIPO_GOCR/2004"
 
# GOCR Version (0.49 - My tuned)
GOCR="/home/dwyer3/Public/gocr-0.49/src/gocr"
# Options for GOCR
GOCR_OPTIONS=" -d 12 -u \* -m 2 -C \"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890<>,.();/_=-\" -f XML "
 
# Parallel variables
numRunning=0
numProcs=10
 
skip=1
for z in $(find $DATADIR -iname "*.zip" -exec du -sk {} \; | sort -nr)
do
    # Every second iteration needs to be skipped - just the size of the file
    if [[ $skip == 1 ]]
    then
        let skip=0
        continue;
    else
        let skip=1
    fi
 
    # Define a work group (Inside the brackets is done in parallel)
    {
        # Get the name of the file
        filename=$(echo ${z%%.$SUFFIX} | sed 's#^.*/##')
        currentdir="$(dirname $z)"
 
        file_noext="${z%.*}"
        justFilename="${file_noext##*/}"
 
        # extract teh 001 of the 76547-001
        ext=${justFilename##*-}
 
        # Get the path of the file
        path=`dirname $z`
 
        # Create the new output directory
        PROCESSDIR=${currentdir//$DATADIR/$FINISHDIR}
        PROCESSDIR=$PROCESSDIR-$ext
 
        # If directory does not exist, create it
        if [ ! -d "$PROCESSDIR" ]; then
            mkdir -p $PROCESSDIR
        fi
 
        # Unzip the file into ProcessDIR
        unzip -q -o $z -d $PROCESSDIR
 
        # Delete all but the tif files
        for i in `ls $PROCESSDIR | grep -v .tif`; do rm $PROCESSDIR/$i; done
 
        numFiles=`ls $PROCESSDIR | wc -l`
        echo Processing: $z $numFiles
 
        # Get a list of all the tif files
        for f in $(find $PROCESSDIR -iname "*.tif")
        do
            file_noext="${f%.*}"
            justFilename="${file_noext##*/}"
 
            pbmfile="$PROCESSDIR/$justFilename.pbm"
            convert $f $pbmfile
            # Perform the GOCR
            $GOCR -i $pbmfile -d 12 -u \* -m 2 -C \"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890\<\>\,\.\(\)\;\/_=-\" -f XML -o $pbmfile.xml 2> /dev/null
 
            rm $pbmfile
        done
    } &
    # Background this 'task'
 
    #Count the number of jobs currently running.
    numRunning=`jobs -p | wc -l`	
    while [ "$numRunning" -ge "$numProcs" ]
    do
      # I have hit the maximum number of simultaneous jobs running
      # Sleep periodically, and recheck.
      sleep 1
      numRunning=`jobs -p | wc -l`
    done
 
done
 
# Wait for the remaining jobs to finish
numRunning=`jobs -p | wc -l`	
  echo waiting at directory for $numRunning jobs
wait

The interesting bit with the above example is the, successful, attempt at load balancing on line 14. Some of the zip files being processed can contain upto 30,000 tiff images. This would take a single processor a very long time. So, to load balance, I sort all of the zip files from biggest to smallest and process in that order. This means that a good attempt is made that all processors do as much processing as possible. In the case of the 30,000 tiff file case, this core was running for most of the total time and appeared towards the end of the processing. This would have been a disaster in terms of total time to complete the work.

Makefile

Another way to perform parallel processing from the command line is via Make

FILES=$(shell find . -type f -name '*.zip' -exec du -sk {} \; | sort -nr | awk '{print $$2}')
DIRS=$(basename $(FILES))
all: $(DIRS)
$(DIRS): %: %.zip
	mkdir -p $@
	unzip -q -o -d $@ $<
.PHONY: $(DIRS)
make -j12

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)

Human Conf Test * Time limit is exhausted. Please reload CAPTCHA.