Dear Paul,


The sizes of Blitz arrays are run-time rather than compile-time entities.  Thus, there is an inherent difference in the two cases you present below.  One way to recapture the efficiency of the statically allocated array is to use an array-type class in which the array sizes in each dimension are template parameters (e.g., the vestigial Matrix class of blitz, or something like the tvmet class library).  Another approach utilizing the existing blitz Array class involves supplying the Array constructor with memory that has been allocated statically.  Have you checked the performance of code such as shown below?


Regards, Julian C.



#include <iostream>


int main(void)


        double pA[128*128];

        blitz::Array<double, 2> A(pA,shape(128,128),neverDeleteData);


        A = 1;


        double sum = 0.e0;

        sum = blitz::sum(A);


        std::cout << "Sum : " << sum << std::endl;




From: Paul Hilscher []
Sent: Monday, January 24, 2011 6:11 PM
Subject: Re: [Blitz-support] Blitz-support Digest, Vol 53, Issue 1


Dear David,


indeed, e.g. the intel compiler does vectorization automatically - sometimes. Their are some requirements

for vectorization, e.g. proper memory alignment, datatypes etc. If the compiler can guarantee that these

are fulfilled he will do vectorization, otherwise not.


To be sure, I wrote some example code,





int main(void)


        double A[128][128];


        for(int i=0; i<128;i++) for(int j=0; j<128;j++) A[i][j] = 1.0e0;


        double sum = 0.e0;

        for(int i=0; i<128;i++) for(int j=0; j<128;j++) sum += A[i][j];


        std::cout << "Sum : " << sum << std::endl;






without using blitz, using icc 11.1 (intel compiler) with -O3 (vectorization enabled), he produces

an object file with, (gdb  disassemble main)



 xor    %eax,%eax

0x0000000000400e21 <main+49>:   pxor   %xmm2,%xmm2

0x0000000000400e25 <main+53>:   movaps %xmm2,%xmm1

0x0000000000400e28 <main+56>:   addpd  %xmm0,%xmm1

0x0000000000400e2c <main+60>:   addpd  %xmm0,%xmm2



pxor, movaps and addpd are SSE instructions, so vectorization is used. In case for

blitz++ with




#include <iostream>


int main(void)


        blitz::Array<double, 2> A(blitz::Range(1,128), blitz::Range(1,128));


        A = 1;


        double sum = 0.e0;

        sum = blitz::sum(A);


        std::cout << "Sum : " << sum << std::endl;





the icc (compiled with same flags) cannot guarantee proper alignment (e.g. we use new/malloc for allocation).

Thus vectorization is not used.



(gdb) disassemble _ZN5blitz28_bz_reduceWithIndexTraversalINS_17FastArrayIteratorIdLi2EEENS_9ReduceSumIddEEEENT0_12T_resulttypeET_S5_

Dump of assembler code for function _ZN5blitz28_bz_reduceWithIndexTraversalINS_17FastArrayIteratorIdLi2EEENS_9ReduceSumIddEEEENT0_12T_resulttypeET_S5_:


0x0000000000401f0a <_ZN5blitz28_bz_reduceWithIndexTraversalINS_17FastArrayIteratorIdLi2EEENS_9ReduceSumIddEEEENT0_12T_resulttypeET_S5_+58>:     add    %ebp,%r15d

0x0000000000401f0d <_ZN5blitz28_bz_reduceWithIndexTraversalINS_17FastArrayIteratorIdLi2EEENS_9ReduceSumIddEEEENT0_12T_resulttypeET_S5_+61>:     movslq %r15d,%r15

0x0000000000401f10 <_ZN5blitz28_bz_reduceWithIndexTraversalINS_17FastArrayIteratorIdLi2EEENS_9ReduceSumIddEEEENT0_12T_resulttypeET_S5_+64>:     addsd  (%r11,%r15,8),%xmm0

0x0000000000401f16 <_ZN5blitz28_bz_reduceWithIndexTraversalINS_17FastArrayIteratorIdLi2EEENS_9ReduceSumIddEEEENT0_12T_resulttypeET_S5_+70>:     lea    0x1(%r10,%rax,1),%r15d

0x0000000000401f1b <_ZN5blitz28_bz_reduceWithIndexTraversalINS_17FastArrayIteratorIdLi2EEENS_9ReduceSumIddEEEENT0_12T_resulttypeET_S5_+75>:     inc    %r10d



as can be compared between usage of addpd (packed) and addsd (non-packed). I will have a closer look into this, maybe with a few changes it could

be possible to add vectorization into blitz using the compiler optimizations. Also SSE (and other SIMD) provide some maths functions like 1./sqrt, which

needs to be activated prior to compilation (they have different round-off errors).


Concerning your project on efficient iterators, I am very keen to hear more about it.









On Sat, Jan 22, 2011 at 8:19 AM, David Coote <> wrote:

>Message: 5
>Date: Fri, 21 Jan 2011 14:10:19 +0900
>From: Paul Hilscher <>
>Subject: [Blitz-support] blitz++ development
>         <>
>Content-Type: text/plain; charset="iso-8859-1"
>Dear blitz++ support team, Dear Todd, Dear blitz++ developers, Dear all
>I am writing in concern of blitz++, which did not received a lot of support
>in recent years, e.g. the last major
>update was around 5 years ago.
>Currently, I am using blitz++ for an academic simulation code, where it
>greatly enhances readability and flexibility
>of my code. Unfortunately in recent years, the performance of blitz++ does
>lack behind native Fortran code,
>mainly because blitz++ does not support vectorization, e.g. SSE3 or AVX.

>I would like to implement vectorization into blitz++, but  in order not to
>reinvent the whole wheel again, I would like
>to use some code snippets from eigen (
>which already supports vectorization and OpenMP
>parallelization. Unfortunately I cannot due that due to license restriction.

This is something at least some compilers should do where
vectorisation is possible. For example, the Intel C++ compiler and
gcc both have extensive support for vectorisation on Intel CPU's. I'm
not sure why you would want to explicitly vectorise in Blitz when the
compiler manufacturers have already developed this capability.

Given the recent performance measurements I've been doing one project
that would definitely benefit Blitz performance would be efficient iterators.



Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires
February 28th, so secure your free ArcSight Logger TODAY!
Blitz-support mailing list