A first test case was the vertex-cell incidence counting algorithm shown on page , using a simple array-based Fortran77 data structure for triangular grids as point of reference:
INTEGER ntri INTEGER til(1:3,1:ntri), ncells(1:nv) DO 20 c = 1,ntri DO 10 vc = 1,3 ncells(til(vc,c)) = ncells(til(vc,c))+1 10 CONTINUE 20 CONTINUE
Here ntri
is the number of triangles, and til(v,c)
is the index of vertex number v of cell c (
v).
It turns out that in this case, the KAI C++ compiler
(options used: +K3)
can completely eliminate
the overhead due to abstraction.
The algorithm involves indirect addressing, which is much more typical of unstructured grid algorithms than are plain vector loops in a BLAS style.
Another test case involving geometric primitives
still shows a non-vanishing overhead,
where the factor varies between 1.2 and 1.8.
For instance, the following loop for summing up
the facet normals of a cell has an overhead of about 1.8
if a generic grid geometry is used,
and of 1.2 if the geometry is specialized to the grid type
(that is, using low-level code inside the calculation of normal()):4
coord_type normal(0,0); for(CellIterator c(aGrid); c; ++c) for(FacetOnCellIterator fc(*c); fc; ++fc) normal += Geom.normal(fc);
The only difference between the two implementations of normal() is, that in the high-level case coordinate values are stored in a grid function and accessed via intermediate vertices (carrying an extra pointer to a grid), whereas in the low-level case, coordinate values are stored in an array and are accessed by vertex indices obtained as in the Fortran til cell-vertex incidence array above.
The options used for KCC were +K3 -abstract_float -abstract_pointer -restrict -inline_implicit_space_time=100. Omitting the last option increases run time by a factor of about 3! A reason is probably the deeper nesting of function calls inside normal() in the high-level version.
The examples presented are in some respect a worst-case scenario for the performance of generic implementations using the micro-kernel, because no real work is performed inside the loops. The performance gain obtained by using a better data layout often outweighs the difference between high-level and low-level data access.