Norwegian Service Centre for Climate Modelling -> SGI optimisation -> Additional optimization of CCM3.2
Additional optimization of CCM3.2By Egil Støren and Arild Burud, Norwegian Service Centre for Climate Modelling
BackgroundThe NoSerC team has already optimized a version of the CCM3.2 model based on a source provided by Jon Egill Kristjansson. This is documented in the report Optimising a modified ccm3.2 climate model. Later, new code has been added to the source by Øyvind Seland and Alf Kirkevåg. An integrated version, containing the contributions from all three researchers, has been provided by Øyvind Seland. On July 18. 2002, this source was copied from directory: /home/u4/oyvinds/ccm3.2/SGI.T42.spectral.std.indir on Gridur to our directory, and this source has been the basis for further optimization.
Summary of resultsAll optimization tests have been performed on the Gridur computer.
The optimization effort has resulted in an improvement in CPU time for
a 24 hour simulation run using 8 processors, of about 18.5 %.
Summary of optimization methodsA major problem with the code was found to be the use of large arrays and loops scanning whole arrays without much localized reuse of array elements. Also accessing parts of the arrays positioned far apart in the memory inside tight loops, was part of this problem. This use of arrays resulted in poor utilization of the caches. One remedy has been to reorder the array indices for some arrays. One large array has been removed completely from the source, and the declaration of other arrays has been changed (reducing the storage need per array element from 8 bytes to 4 bytes).
Another main improvement has resulted from changing the processor scheduling in parallel loops. Two alternatives are available (SIMPLE and DYNAMIC). All loops had, per default, SIMPLE scheduling. For one of the loops, this has been changed to DYNAMIC. This change alone accounted for almost half of the improvement in CPU time.
Additionally, minor improvements have been obtained by various techniques,
such as moving loop-invariant computations out of loops.
Details pertaining to individual files
Experiments with parallel schedulingTo find out how different scheduling options worked out for runs using different number of processors, the following experiment was performed.
For all the files using the DOACROSS directive, scheduling was set to either SIMPLE (for all files) or DYNAMIC (for all files). For these two alternatives runs were executed using 2, 4, 8, 12, 16, 20, 24 and 32 processors. Each run was executed three times, and the medians were used to produce the charts below. The first chart shows the CPU time and real time used, measured by the time command. The continuous curve shows the CPU time for an idealized situation where the performance scales perfectly with the number of processors.
The next chart shows the improvement in CPU performance gained by using DYNAMIC scheduling compared to SIMPLE scheduling.
Although this experiment did not take into account that different files
have varying benefits from using the DYNAMIC approach, it shows clearly
that DYNAMIC scheduling is best suited for runs using only up to about
30 processors. Beyond this, use of DYNAMIC scheduling could be counterproductive
due to large overhead time.
Send comments to webmaster