Menu

Aligned allocating

Developers
Mirage
2005-01-11
2012-10-31
  • Mirage

    Mirage - 2005-01-11

    It would be nice if the bitmaps were aligned to 16 bytes boundary. I don't mean the size but the starting address. This is especially important for SSE code. Because the performance drop could be 40-500% while using unaligned instructions.
    There are some functions to do this, but they are not available under all platforms (memalign, valloc).
    You can use this:

    void malloc_align(size_t amount,size_t alignment)
    {void
    mem_real=malloc(amount+alignment);
    char mem_align=(char)((unsigned int)(alignment-(unsigned int)mem_real
    %(unsigned int)alignment)+(unsigned int)mem_real);
    ((int)mem_align-1)=(int)mem_real;
    return(mem_align);}
    void free_align(void mem) {free((void)((int)mem-1));}

     
    • Hervé Drolon

      Hervé Drolon - 2005-01-12

      What makes you think it would improve the performance if we use a 16-bytes boundary alignement ? Can you give a pointer to a doc on the subject ?
      Also, what would be faster : memory allocations, pixel access, both ?

       
      • Mirage

        Mirage - 2005-01-12

        There is a small article:
        http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/52684.htm
        This would speed up the pixel access, which is always a bottleneck. Basically, it influences all the memory access.
        Don't think that it concerns only x86 with xmm registers. There are many internal routines in FreeImage that could gain that benefit, because, for instance, the old school instruction "rep movsd" always should be executed with an alignment of 8 Bytes (the starting address).
        I cannot force the alignment unless I copy the entire bitmap to my own one, which is a memory access.
        Memory allocation of such a huge object probably only depends on the OS or the hardware.
        The allocation of some smaller objects could be improved by using a good STL. The solution is STLport. It's definitely better than GNU C's STL. Example: http://complement.sourceforge.net/compare.pdf

         
    • Hervé Drolon

      Hervé Drolon - 2005-01-13

      So where should be used the malloc_align function ? only for the allocation of

       
    • Hervé Drolon

      Hervé Drolon - 2005-01-13

      So where should be used the malloc_align function ? only for the allocation of a FIBITMAP or everywhere we use malloc ?

       
      • Mirage

        Mirage - 2005-01-14

        I wouldn't place it everywhere. Sometimes it's better to save memory. I think the palette should be aligned as well. Then perhaps some bigger continuous memory blocks or some very often accessed objects.

        regards

         
    • Ryan Rubley

      Ryan Rubley - 2005-01-13

      BTW, for Linux, you have a few choices:

      SYNOPSIS
      #include <stdlib.h>

         int posix_memalign(void **memptr, size_t alignment, size_t size);
         void *memalign(size_t boundary, size_t size);
         void *valloc(size_t size);
      

      DESCRIPTION
      The function posix_memalign() allocates size bytes and places the address of the allocated memory in memptr. The address of the allo
      cated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void
      ).

         The obsolete function memalign() allocates size bytes and returns a pointer to the allocated memory.  The memory address will be a mul
         tiple of boundary, which must be a power of two.
      
         The obsolete function valloc() allocates size bytes and returns a pointer to the allocated memory.  The memory address will be a multi
         ple of the page size.  It is equivalent to memalign(sysconf(_SC_PAGESIZE),size).
      

      For Mac OSX, there is only

       void *valloc(size_t size);
      
       The valloc() function allocates size bytes of memory and returns a
       pointer to the allocated memory.  The allocated memory is aligned on a
       page boundary.  valloc() returns a NULL pointer if there is an error.
      
       
    • Hervé Drolon

      Hervé Drolon - 2005-01-14

      If I understand what does a malloc_align function, then it shouldn't be used to allocate a FIBITMAP but only for the allocation of the palette and the pixels block. Is that right ?
      Because of the way a FIBITMAP is allocated now, then this would mean breaking the FIBITMAP in multiple pointers and selecting the right malloc function according to the type of pointer (header, palette, pixels, ...) ?
      And what about the FreeImage_GetScanLine function ? Suppose that FreeImage_GetBits returns a starting address aligned to 16-bytes, that doesn't mean that FreeImage_GetScanLine will also return a starting address aligned to 16-bytes ?

       
      • Floris van den Berg

        Hi Herv,

        It would if each scanline takes up a multiple of 16 bytes.
        In theory it should be enough to allocate more memory for a bitmap than necessary, and put the pallette and bitmap data on aligned addresses. Then modify the GetBits and GetPalette functions to read the data from the new.

        Personally i don't see much in jumping through hoops to implement all this alignment. The speed issue is imho minimal, and we haven't seen speed complaints in the 4 years freeimage exiss... except for rotation and that has nothing to do with alignment.

        Floris

         
      • Mirage

        Mirage - 2005-01-14

        Yep, you don't have to do this with FIBITMAP.
        I just studied the source.. and this..

        unsigned dib_size = sizeof(FREEIMAGEHEADER);
        dib_size += sizeof(BITMAPINFOHEADER);
        dib_size += sizeof(RGBQUAD) * CalculateUsedPaletteEntries(bpp);
        dib_size += CalculatePitch(CalculateLine(width, bpp)) * height;
        bitmap->data = (BYTE *)malloc(dib_size * sizeof(BYTE));

        .. is nice, but I need 16B aligned pointer returned by FreeImage_GetBits and perhaps by GetPalette, too. So, you have to allocate these independently, because you cannot just insert some padding between the palette and the bitmap (you don't know yet, what pointer will malloc return).

        FreeImage_GetScanLine: It's up to the developer to choose the right dimensions and bpp (32), so it returns this corretly. It can be copied out of FreeImage for an advanced usage.

        BTW: It takes about 30ms to copy one 16-bytes aligned 32MB block to another on P4 3GHz by ANY method and it takes 70ms to copy the same block misaligned. So, you can see it really isn't a very big jump unless you do this many times.

        PS: rotation: this has everything to do with the alignment since it accesses the same bitmap, right? Not to mention mentioned rescaling.
        Well, I don't know, I would just suggest to read this awesome document: http://agner.org/assem/pentopt.pdf

         
        • Floris van den Berg

          <quote>
          dib_size += CalculatePitch(CalculateLine(width, bpp)) * height;
          </quote>

          Nice completely unaligned scanlines.

          <quote>
          So, you have to allocate these independently, because you cannot just insert some padding between the palette and the bitmap
          </quote>

          Why not?

          <quote>
          FreeImage_GetScanLine: It's up to the developer to choose the right dimensions and bpp (32),
          </quote>

          A 32-bit bitmap doesn't ensure 16 bytes alignment...

          <quote>
          PS: rotation: this has everything to do with the alignment since it accesses the same bitmap, right?
          </quote>

          Wrong. because the rotation process accesses the bitmap pixel by pixel, it accesses the bitmap on non 16 bytes aligned memory adresses. The alignment would only come in use when you for example copy entire scanlines. Besides that: rotation is something you usually do not do repeatedly in a loop.

          Floris

           
          • Mirage

            Mirage - 2005-01-15

            Has the anti-performance club already finished? I don't have to prove anything. I am just trying to help. I can do all this stuff by myself and share nothing or share a modification of the library. This is basically how modifications arise (very sad, I know).
            > Nice completely unaligned scanlines.
            ??? this code comes from FreeImage ;]
            > Why not?
            read twice
            > A 32-bit bitmap doesn't ensure 16 bytes alignment...
            that one is actually good - if there is a better one, will you let me know? 32bits = 4bytes = the best available bpp - dimensions.width = 4 = a perfect 16B alignment
            > rotation
            ok, that one is the winner!! - I wouldn't use any of the code from FreeImageToolkit, because it is just not optimized and it won't be until it is vectorized.
            Now, it uses a scalar code optimized by a compiler = not optimized. Some people may think that a memory access doesn't concern the scalar code, but it does. A simple mov instruction is only a pure example. I must again refer to this: http://agner.org/assem/pentopt.pdf
            BTW: 16bytes align is not the only one align. It is only required
            for sse instructions. 8B is fine for the scalar code and 4B is usually enough. (4B = 32b = 1 pixel of 32bpp image = the pixel accessed by the rotation code)

            best regards

             
    • Hervé Drolon

      Hervé Drolon - 2005-01-14

      I also think this would need a lot of modifications that may be complex to handle ...
      About the rotation and rescale functions, they have been updated in the CVS and should be faster now :)

      Herv

       
    • Mirage

      Mirage - 2005-01-14

      Ok, I'm stupid. It is possible to insert some padding:
      unsigned dib_size = sizeof(FREEIMAGEHEADER);
      dib_size += sizeof(BITMAPINFOHEADER);
      dib_size += (dibsize % 16 ? 16 - dibsize % 16 : 0);
      dib_size += sizeof(RGBQUAD) * CalculateUsedPaletteEntries(bpp);
      dib_size += (dibsize % 16 ? 16 - dibsize % 16 : 0);
      dib_size += CalculatePitch(CalculateLine(width, bpp)) * height;
      bitmap->data = (BYTE *)malloc_align(dib_size * sizeof(BYTE), 16);

      ... and change the other procedures accordingly..

       

Log in to post a comment.