Blocked Algorithms and Recursive Algorithms
 Blocked algorithms can be implemented using basically two
different approaches: using loop nests or using recursion (i.e.,
recursive divide and conquer algorithms).
     Modern high-performance libraries are loop-based
        implementations mostly because of:
	
	-  Code Legacy: The first high-performance libraries were
	     written in FORTRAN (e.g., recall that recursion is not
	     allowed in FORTRAN).
	     
	     -  Register allocations and scheduling algorithms: some
	          of the most effective register allocations and
	          instruction scheduling algorithms aim at the
	          optimizations of basic loop nests.
-  Overhead: a recursive algorithm can be often re-written
	     using loop nests without the overhead due to function
	     calls. 
-  Loop Tiling: loop tiling can be used to implement blocked
	     algorithms as effectively as recursion can.
        
 Blocked algorithms can be naturally translated into
        applications using recursive algorithms. This straightforward
        translation makes recursive algorithms a powerful tools in
        developing new applications or prototypes. With the
        presentation in the market of modern processors and compilers,
        some of the important reasons to opt for a pure
        loop-nest-based implementation are weakening.
	
	-  New architectures need new compilers and new
	     libraries. The life-spam of architectures and of
	     compilers, does not justify the investment for
	     hand-tuned high-performance libraries. 
-  In case the environment of choice is based on an
	     interpreter, an optimal register allocation may be a
	     quixotic attempt.
-  Some compiler are more and more aggressive in the
             allocation of function parameters into registers,
             reducing one of the function call overhead. 
-  Modern processors and architectures are based on deeper
	     and deeper memory hierarchy: to exploit the performance
	     potential, tiling must be used several times, tailored
	     and tuned. A recursive implementation will not be always
	     optimal but it will not be always the worst and it will
	     need little maintenance.
 
Compiler Optimizations for Divide and Conquer Applications
   Recursive implementations of divide and conquer algorithms have
  space for tremendous improvements, only if the compiler has
  available a model of the computation. 
  
 
  
   For example, the Fast Fourier Transform in the West (FFTW) is an
  application where multiple algorithms are available and the choice
  is postponed as long as a plan has been determined. The
  choice is based on the profiling of the performance for several
  versions of an algorithm (but for the same input). The plan is a
  model of the dynamic behavior of the computation.
  
   A compiler may optimize the solution for the leaves aggressively
  (because leaves are simple basic blocks) when the leaves have enough
  code to work on (and some pointer disambiguation are possible).
  
   More interestingly, a compiler can determine a model of the
  dynamic behavior and generate run-time support for the execution of
  the application:
  
     -  If a problem of size n is divided in 8 sub-problems 
     of which one has size n/2, the others 7 have size
     n/8, and each sub-problem can be executed independently,
     then the compiler may suggest the allocation of the largest
     problem to the fastest processor in an heterogeneous network,
     while the others are allocated to less powerful processors.
     
-  If a problem is too large to obtain useful information using
     profiling, it may be useful to have a static tool which is able
     to achieve an estimation of how many time a particular leaf is
     executed. A hybrid solution, where a partial execution is allowed
     can offer extremely valuable information as well as fast
     executions.
     
This project aims at the modeling of the decomposition process and
  it aims at the minimization of decomposition work.