 
 
 
 
 
   
A first test case was the vertex-cell incidence counting 
algorithm shown on page ![[*]](http://www.math.tu-cottbus.de/icons/LatexHtml//crossref.gif) , 
using a simple array-based Fortran77 data
structure for triangular grids as point of reference:
, 
using a simple array-based Fortran77 data
structure for triangular grids as point of reference:
   INTEGER ntri 
   INTEGER til(1:3,1:ntri), ncells(1:nv) 
   DO 20 c = 1,ntri
      DO 10 vc = 1,3
         ncells(til(vc,c)) = ncells(til(vc,c))+1 
10    CONTINUE
20 CONTINUE
Here ntri is the number of triangles, and til(v,c)
is the index of vertex number v of cell c (
 v
   v ).
It turns out that in this case, the KAI C++ compiler
(options used: +K3) 
can completely eliminate 
the overhead due to abstraction.
).
It turns out that in this case, the KAI C++ compiler
(options used: +K3) 
can completely eliminate 
the overhead due to abstraction. 
The algorithm involves indirect addressing, which is much more typical of unstructured grid algorithms than are plain vector loops in a BLAS style.
Another test case involving geometric primitives
still shows a non-vanishing overhead, 
where the factor varies between 1.2 and 1.8.
For instance, the following  loop for summing up 
the facet normals of a cell has an overhead of about 1.8
if a generic grid geometry is used,
and of 1.2 if the geometry is specialized to the grid type
(that is, using low-level code inside the calculation of normal()):4
coord_type normal(0,0);
for(CellIterator c(aGrid); c; ++c)
  for(FacetOnCellIterator fc(*c); fc; ++fc) 
    normal += Geom.normal(fc);
The only difference between the two implementations of normal() is, that in the high-level case coordinate values are stored in a grid function and accessed via intermediate vertices (carrying an extra pointer to a grid), whereas in the low-level case, coordinate values are stored in an array and are accessed by vertex indices obtained as in the Fortran til cell-vertex incidence array above.
The options used for KCC were +K3 -abstract_float -abstract_pointer -restrict -inline_implicit_space_time=100. Omitting the last option increases run time by a factor of about 3! A reason is probably the deeper nesting of function calls inside normal() in the high-level version.
The examples presented are in some respect a worst-case scenario for the performance of generic implementations using the micro-kernel, because no real work is performed inside the loops. The performance gain obtained by using a better data layout often outweighs the difference between high-level and low-level data access.
 
 
 
 
