Mex files are an excellent tool to have in the HPC toolchest. Matlab is a great tool to prototype in; You don’t have to worry too much about allocating variables and all of those things that can take time in a compiled language … you can just get on with the job.
But Matlab is slow. Particularly when you are dealing with large datasets. This is where mex files come in. You can take advantage of the ease of changing parameters, easy saving/loading of datasets, quick visualisation, etc but also get the computational backend grunt of specific hardware coupled with compilers that understand the hardware.
Enough talk.
The hardest thing with Mex file programming is getting your variables into and out of your C/C++/Fortran program. Matlab has plenty of examples for doing this, I’m just providing additional examples. You get data in and out of Matlab and C/C++/Fortran via the
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | // Matlab includes // Note: These are located under MATLAB/extern/include #include "mex.h" #include "matrix.h" // Standard includes #include <math.h> #define GRAD_A_RAD(g) ((g)*0.0174532925199433) // Forward declaration bool ConvertLL2UTM(double Lat, double Long, double *northing, double *easting); /* The Matlab call is as follows: * [Easting, Northing] = LL2UTM(Longitude, Latitude); * * Lat or Long can be a single value or they can be arrays. * As a result, Easting/Northing can be scalars or vectors. * */ void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { // Array size checking int numberOfElements_Lat; int numberOfElements_Long; int n, m; // Inputs arrays double *latitude; double *longitude; // Output arrays double *northing; double *easting; // First, verify the correct number of input/output parameters if(nrhs!=2) { mexErrMsgIdAndTxt("LL2UTM", "Two inputs required."); } if(nlhs!=2) { mexErrMsgIdAndTxt("LL2UTM","Two outputs required."); } ///////////////////////////// // Query the input parameters ///////////////////////////// // Get the size of the first input parameter : Latitude numberOfElements_Lat = mxGetNumberOfElements(prhs[0]); // Get the size of the second input parameter: Longitude numberOfElements_Long = mxGetNumberOfElements(prhs[1]); // m must equal n if ( numberOfElements_Lat != numberOfElements_Long ) { mexErrMsgIdAndTxt("LL2UTM", "Two inputs are not equal dimensions."); } // Get the latitude and longitude latitude = mxGetPr(prhs[0]); longitude = mxGetPr(prhs[1]); // Get the sizes of the input such that the output matches // Get the number of columns n = mxGetN(prhs[0]); // Get the number of rows m = mxGetM(prhs[0]); /////////////////////////////// // Setup the output parameters /////////////////////////////// plhs[0] = mxCreateDoubleMatrix(m, n, mxREAL); easting = mxGetPr(plhs[0]); plhs[1] = mxCreateDoubleMatrix(m, n, mxREAL); northing = mxGetPr(plhs[1]); // Perform the calculations for (int i = 0; i<numberOfElements_Lat; i++) { ConvertLL2UTM(latitude[i], longitude[i], &northing[i], &easting[i]); } } /* * Convert Latitude/Longitude to UTM's northing and easting * * I would not take this conversion for gospel - works for Australia * I'd test it before relying on it - just sayin' :P */ bool ConvertLL2UTM(double Lat, double Long, double *northing, double *easting) { const double a = 6378137.0; const double ee = 0.00669437999; const double k0 = 0.9996; const double e2 = ee / (1-ee); // -180.00 .. 179.9; double LongTemp = (Long+180)-int((Long+180)/360)*360-180; double LatRad = GRAD_A_RAD(Lat); double LongRad = GRAD_A_RAD(LongTemp); double LongOriginRad; double N, T, C, A, M; //Make sure the longitude is between -180.00 .. 179.9 int zona = int((LongTemp + 180)/6.0) + 1; if (Lat >= 56.0 && Lat < 64.0 && LongTemp >= 3.0 && LongTemp < 12.0) zona = 32; // Special zones for Svalbard if (Lat >= 72.0 && Lat < 84.0) { if (LongTemp>=0.0 && LongTemp<9.0) zona = 31; else if (LongTemp>=9.0 && LongTemp<21.0) zona = 33; else if (LongTemp>=21.0 && LongTemp<33.0) zona = 35; else if (LongTemp>=33.0 && LongTemp<42.0) zona = 37; } LongOriginRad = GRAD_A_RAD((zona-1)*6 - 180 + 3); N = a/sqrt(1-ee*sin(LatRad)*sin(LatRad)); T = tan(LatRad)*tan(LatRad); C = e2*cos(LatRad)*cos(LatRad); A = cos(LatRad)*(LongRad-LongOriginRad); M = a*((1 - ee/4 - 3*ee*ee/64 - 5*ee*ee*ee/256)*LatRad - (3*ee/8 + 3*ee*ee/32 + 45*ee*ee*ee/1024)*sin(2*LatRad) + (15*ee*ee/256 + 45*ee*ee*ee/1024)*sin(4*LatRad) - (35*ee*ee*ee/3072)*sin(6*LatRad)); *easting = (double)(k0*N*(A+(1-T+C)*A*A*A/6 + (5-18*T+T*T+72*C-58*e2)*A*A*A*A*A/120) + 500000.0); *northing = (double)(k0*(M+N*tan(LatRad)*(A*A/2+(5-T+9*C+4*C*C)*A*A*A*A/24 + (61-58*T+T*T+600*C-330*e2)*A*A*A*A*A*A/720))); if (Lat < 0) { //10000000 metre offset for southern hemisphere *northing += 10000000; } return true; } |
Note that prhs[x] denotes input variables and plhs[x] denotes output variables.
And the associated Makefile
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | #!/bin/make # Path to where Matlab is installed MATLAB = /home/dwyer3/Software/R2010b # The source file with the mexFunction SRC = Main.cpp # What is the name for the mex file? BINARY = LL2UTM # I want to generate the mex file for a linux x86_64 MEX_SUFFIX = mexa64 # What compiler to use? CC = icpc OPTS = -DMATLAB_MEX_FILE -fPIC -fno-omit-frame-pointer -DMX_COMPAT_32 -O3 OBJS = $(SRC:.cpp=.o) INC = -I$(MATLAB)/extern/include -I$(MATLAB)/simulink/include LIBS = -shared -Wl,--version-script,$(MATLAB)/extern/lib/glnxa64/mexFunction.map -Wl,--no-undefined LIBS += -Wl,-rpath-link,$(MATLAB)/bin/glnxa64 -L$(MATLAB)/bin/glnxa64 -lmx -lmex -lmat LIBS += -lpthread .SUFFIXES: .cpp .o .cpp.o: $(CC) $(OPTS) $(INC) -c $< -o $@ Go: $(OBJS) $(CC) $(OBJS) $(OPTS) -o $(BINARY).$(MEX_SUFFIX) $(LIBS) @echo Binary created!! clean: set nonomatch; rm -f $(BINARY).$(MEX_SUFFIX) *.o |
To compile:
[dwyer3@hpc-md LL2UTM]$ make
icpc -DMATLAB_MEX_FILE -fPIC -fno-omit-frame-pointer -DMX_COMPAT_32 -O3 -I/home/dwyer3/Software/R2010b/extern/include -I/home/dwyer3/Software/R2010b/simulink/include -c Main.cpp -o Main.o
icpc Main.o -DMATLAB_MEX_FILE -fPIC -fno-omit-frame-pointer -DMX_COMPAT_32 -O3 -openmp -o LL2UTM.mexa64 -shared -Wl,--version-script,/home/dwyer3/Software/R2010b/extern/lib/glnxa64/mexFunction.map -Wl,--no-undefined -Wl,-rpath-link,/home/dwyer3/Software/R2010b/bin/glnxa64 -L/home/dwyer3/Software/R2010b/bin/glnxa64 -lmx -lmex -lmat -lpthread
Binary created!!
You will see that there is now an executable mex file:
[dwyer3@hpc-md LL2UTM]$ ll
total 40
-rwxrwxr-x. 1 dwyer3 dwyer3 12407 Apr 12 12:38 LL2UTM.mexa64
In Matlab, I make sure that my “Current Folder” is the same as where the executable mex file just created (or add it to the Matlab path [a way to do this is to add a directory via /$MATLAB/toolbox/local/matlabrc.m] – I have a directory of Mex files for specific jobs … the above being one). I have a list of latitude and longitude coordinates
lat <14316603x1 double>
long <14316603x1 double>
>> tic; [Easting, Northing] = LL2UTM(lat, long); toc;
Elapsed time is 3.491809 seconds.
Now, I don’t really know whether this is good or bad since I don’t have a Matlab version of LL2UTM. However, I know I can be very happy if I parallelise the for loop in the Main.cpp
On line 81, just prior to the for loop, add
#pragma omp parallel for
In the Makefile, add
-openmp
to the end of line 19.
Recompile the mexfile and run the example above again in Matlab
>> tic; [Easting, Northing] = LL2UTM(lat, long); toc;
Elapsed time is 0.304706 seconds.
Now, you can see why I like Mex files. It’s not a perfect scaling, but I get results quicker. Note that the -openmp is an Intel Compiler thing. To see the same for GNU compilers, please see my OpenMP Introduction post.
Speed Tip:
Currently, the above uses math.h. I should be using mathimf.h (particularly given the number of times those math functions are called). If I use Intel’s -limf, note that these will have to be compiled statically so that there is no version mismatch between my version and Matlab’s version. To see an example, please see my MATLAB MKL MEX post
Thanks to observations by Anthony Rasmussen, the above code can be sped up. He was quite right to point these flaws out and in order to keep up the pretence that I know what I’m talking about, an alternative version of the above can be found here. 🙂
Serial:
>> tic; [Easting, Northing] = LL2UTM(lat, long); toc;
Elapsed time is 0.993255 seconds.
Parallel:
>> tic; [Easting, Northing] = LL2UTM(lat, long); toc;
Elapsed time is 0.222407 seconds.
I’ll put a cuda version up soon … stay tuned.
In ConvertLL2UTM(), lots of multiplication and division by numbers that are a factor of two. Does replacing these ops with bit-shifts help, or the function doesn’t get hit enough for it to matter? (DISCLAIMER: large glasses of sav blanc = 5. Anger with computers at large = high.)
Hey man, I see your sav. and raise a semillon.
You’re right, but personally I’d go with a const declaration (as opposed to a #define) of the inverse. I used to be a fan of bit shifting, but according to the Intel Architecture Optimisation Guide, the bit shift requires 2 cycles (1 latency, 1 throughput). Given that a multiply is also 1 latency, 1 throughput but also has the advantage of vectorisation (see here – admittedly I’m not using here) and bit shifting is not a vectorial operation (currently), it is probably much of a muchness. Probably worth a micro-benchmark. Hmmm, an hour of tomorrow just got booked solid.
I like code to be easily readable – like a book. Personally, me seeing bit shifting and other tricks makes you do a double take. The same goes with relying on obscure IEEE754 rules. Easily overlooked to the inexperienced eye.
Looking at it again, what I’d do to milk the flops would be to bring the sin, cos and tan operations together. Bust out some trig identities. Would probably register some of the ee and A combinations.
I get a sick pleasure out of these sort of things.
Mark, sober again. Point taken re bit-shifting. Makes for code that is hard to read.
Since ee is const, can’t you pre-evaluate things like (15*ee*ee/256 + 45*ee*ee*ee/1024) as another const ?
I’ve updated the code to something a little faster. Actually, now I look at it again, something could be done with those nested if’s
*sigh*