It would be nice if the bitmaps were aligned to 16 bytes boundary. I don't mean the size but the starting address. This is especially important for SSE code. Because the performance drop could be 40-500% while using unaligned instructions.
There are some functions to do this, but they are not available under all platforms (memalign, valloc).
You can use this:
What makes you think it would improve the performance if we use a 16-bytes boundary alignement ? Can you give a pointer to a doc on the subject ?
Also, what would be faster : memory allocations, pixel access, both ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is a small article: http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/52684.htm
This would speed up the pixel access, which is always a bottleneck. Basically, it influences all the memory access.
Don't think that it concerns only x86 with xmm registers. There are many internal routines in FreeImage that could gain that benefit, because, for instance, the old school instruction "rep movsd" always should be executed with an alignment of 8 Bytes (the starting address).
I cannot force the alignment unless I copy the entire bitmap to my own one, which is a memory access.
Memory allocation of such a huge object probably only depends on the OS or the hardware.
The allocation of some smaller objects could be improved by using a good STL. The solution is STLport. It's definitely better than GNU C's STL. Example: http://complement.sourceforge.net/compare.pdf
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I wouldn't place it everywhere. Sometimes it's better to save memory. I think the palette should be aligned as well. Then perhaps some bigger continuous memory blocks or some very often accessed objects.
regards
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
DESCRIPTION
The function posix_memalign() allocates size bytes and places the address of the allocated memory in memptr. The address of the allo
cated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void ).
The obsolete function memalign() allocates size bytes and returns a pointer to the allocated memory. The memory address will be a mul
tiple of boundary, which must be a power of two.
The obsolete function valloc() allocates size bytes and returns a pointer to the allocated memory. The memory address will be a multi
ple of the page size. It is equivalent to memalign(sysconf(_SC_PAGESIZE),size).
For Mac OSX, there is only
void *valloc(size_t size);
The valloc() function allocates size bytes of memory and returns a
pointer to the allocated memory. The allocated memory is aligned on a
page boundary. valloc() returns a NULL pointer if there is an error.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If I understand what does a malloc_align function, then it shouldn't be used to allocate a FIBITMAP but only for the allocation of the palette and the pixels block. Is that right ?
Because of the way a FIBITMAP is allocated now, then this would mean breaking the FIBITMAP in multiple pointers and selecting the right malloc function according to the type of pointer (header, palette, pixels, ...) ?
And what about the FreeImage_GetScanLine function ? Suppose that FreeImage_GetBits returns a starting address aligned to 16-bytes, that doesn't mean that FreeImage_GetScanLine will also return a starting address aligned to 16-bytes ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It would if each scanline takes up a multiple of 16 bytes.
In theory it should be enough to allocate more memory for a bitmap than necessary, and put the pallette and bitmap data on aligned addresses. Then modify the GetBits and GetPalette functions to read the data from the new.
Personally i don't see much in jumping through hoops to implement all this alignment. The speed issue is imho minimal, and we haven't seen speed complaints in the 4 years freeimage exiss... except for rotation and that has nothing to do with alignment.
Floris
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
.. is nice, but I need 16B aligned pointer returned by FreeImage_GetBits and perhaps by GetPalette, too. So, you have to allocate these independently, because you cannot just insert some padding between the palette and the bitmap (you don't know yet, what pointer will malloc return).
FreeImage_GetScanLine: It's up to the developer to choose the right dimensions and bpp (32), so it returns this corretly. It can be copied out of FreeImage for an advanced usage.
BTW: It takes about 30ms to copy one 16-bytes aligned 32MB block to another on P4 3GHz by ANY method and it takes 70ms to copy the same block misaligned. So, you can see it really isn't a very big jump unless you do this many times.
PS: rotation: this has everything to do with the alignment since it accesses the same bitmap, right? Not to mention mentioned rescaling.
Well, I don't know, I would just suggest to read this awesome document: http://agner.org/assem/pentopt.pdf
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
<quote>
So, you have to allocate these independently, because you cannot just insert some padding between the palette and the bitmap
</quote>
Why not?
<quote>
FreeImage_GetScanLine: It's up to the developer to choose the right dimensions and bpp (32),
</quote>
A 32-bit bitmap doesn't ensure 16 bytes alignment...
<quote>
PS: rotation: this has everything to do with the alignment since it accesses the same bitmap, right?
</quote>
Wrong. because the rotation process accesses the bitmap pixel by pixel, it accesses the bitmap on non 16 bytes aligned memory adresses. The alignment would only come in use when you for example copy entire scanlines. Besides that: rotation is something you usually do not do repeatedly in a loop.
Floris
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Has the anti-performance club already finished? I don't have to prove anything. I am just trying to help. I can do all this stuff by myself and share nothing or share a modification of the library. This is basically how modifications arise (very sad, I know).
> Nice completely unaligned scanlines.
??? this code comes from FreeImage ;]
> Why not?
read twice
> A 32-bit bitmap doesn't ensure 16 bytes alignment...
that one is actually good - if there is a better one, will you let me know? 32bits = 4bytes = the best available bpp - dimensions.width = 4 = a perfect 16B alignment
> rotation
ok, that one is the winner!! - I wouldn't use any of the code from FreeImageToolkit, because it is just not optimized and it won't be until it is vectorized.
Now, it uses a scalar code optimized by a compiler = not optimized. Some people may think that a memory access doesn't concern the scalar code, but it does. A simple mov instruction is only a pure example. I must again refer to this: http://agner.org/assem/pentopt.pdf
BTW: 16bytes align is not the only one align. It is only required for sse instructions. 8B is fine for the scalar code and 4B is usually enough. (4B = 32b = 1 pixel of 32bpp image = the pixel accessed by the rotation code)
best regards
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I also think this would need a lot of modifications that may be complex to handle ...
About the rotation and rescale functions, they have been updated in the CVS and should be faster now :)
Herv
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It would be nice if the bitmaps were aligned to 16 bytes boundary. I don't mean the size but the starting address. This is especially important for SSE code. Because the performance drop could be 40-500% while using unaligned instructions.
There are some functions to do this, but they are not available under all platforms (memalign, valloc).
You can use this:
void malloc_align(size_t amount,size_t alignment)
{void mem_real=malloc(amount+alignment);
char mem_align=(char)((unsigned int)(alignment-(unsigned int)mem_real
%(unsigned int)alignment)+(unsigned int)mem_real);
((int)mem_align-1)=(int)mem_real;
return(mem_align);}
void free_align(void mem) {free((void)((int)mem-1));}
What makes you think it would improve the performance if we use a 16-bytes boundary alignement ? Can you give a pointer to a doc on the subject ?
Also, what would be faster : memory allocations, pixel access, both ?
There is a small article:
http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/optimization/52684.htm
This would speed up the pixel access, which is always a bottleneck. Basically, it influences all the memory access.
Don't think that it concerns only x86 with xmm registers. There are many internal routines in FreeImage that could gain that benefit, because, for instance, the old school instruction "rep movsd" always should be executed with an alignment of 8 Bytes (the starting address).
I cannot force the alignment unless I copy the entire bitmap to my own one, which is a memory access.
Memory allocation of such a huge object probably only depends on the OS or the hardware.
The allocation of some smaller objects could be improved by using a good STL. The solution is STLport. It's definitely better than GNU C's STL. Example: http://complement.sourceforge.net/compare.pdf
So where should be used the malloc_align function ? only for the allocation of
So where should be used the malloc_align function ? only for the allocation of a FIBITMAP or everywhere we use malloc ?
I wouldn't place it everywhere. Sometimes it's better to save memory. I think the palette should be aligned as well. Then perhaps some bigger continuous memory blocks or some very often accessed objects.
regards
BTW, for Linux, you have a few choices:
SYNOPSIS
#include <stdlib.h>
DESCRIPTION
The function posix_memalign() allocates size bytes and places the address of the allocated memory in memptr. The address of the allo
cated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void ).
For Mac OSX, there is only
If I understand what does a malloc_align function, then it shouldn't be used to allocate a FIBITMAP but only for the allocation of the palette and the pixels block. Is that right ?
Because of the way a FIBITMAP is allocated now, then this would mean breaking the FIBITMAP in multiple pointers and selecting the right malloc function according to the type of pointer (header, palette, pixels, ...) ?
And what about the FreeImage_GetScanLine function ? Suppose that FreeImage_GetBits returns a starting address aligned to 16-bytes, that doesn't mean that FreeImage_GetScanLine will also return a starting address aligned to 16-bytes ?
Hi Herv,
It would if each scanline takes up a multiple of 16 bytes.
In theory it should be enough to allocate more memory for a bitmap than necessary, and put the pallette and bitmap data on aligned addresses. Then modify the GetBits and GetPalette functions to read the data from the new.
Personally i don't see much in jumping through hoops to implement all this alignment. The speed issue is imho minimal, and we haven't seen speed complaints in the 4 years freeimage exiss... except for rotation and that has nothing to do with alignment.
Floris
Yep, you don't have to do this with FIBITMAP.
I just studied the source.. and this..
unsigned dib_size = sizeof(FREEIMAGEHEADER);
dib_size += sizeof(BITMAPINFOHEADER);
dib_size += sizeof(RGBQUAD) * CalculateUsedPaletteEntries(bpp);
dib_size += CalculatePitch(CalculateLine(width, bpp)) * height;
bitmap->data = (BYTE *)malloc(dib_size * sizeof(BYTE));
.. is nice, but I need 16B aligned pointer returned by FreeImage_GetBits and perhaps by GetPalette, too. So, you have to allocate these independently, because you cannot just insert some padding between the palette and the bitmap (you don't know yet, what pointer will malloc return).
FreeImage_GetScanLine: It's up to the developer to choose the right dimensions and bpp (32), so it returns this corretly. It can be copied out of FreeImage for an advanced usage.
BTW: It takes about 30ms to copy one 16-bytes aligned 32MB block to another on P4 3GHz by ANY method and it takes 70ms to copy the same block misaligned. So, you can see it really isn't a very big jump unless you do this many times.
PS: rotation: this has everything to do with the alignment since it accesses the same bitmap, right? Not to mention mentioned rescaling.
Well, I don't know, I would just suggest to read this awesome document: http://agner.org/assem/pentopt.pdf
<quote>
dib_size += CalculatePitch(CalculateLine(width, bpp)) * height;
</quote>
Nice completely unaligned scanlines.
<quote>
So, you have to allocate these independently, because you cannot just insert some padding between the palette and the bitmap
</quote>
Why not?
<quote>
FreeImage_GetScanLine: It's up to the developer to choose the right dimensions and bpp (32),
</quote>
A 32-bit bitmap doesn't ensure 16 bytes alignment...
<quote>
PS: rotation: this has everything to do with the alignment since it accesses the same bitmap, right?
</quote>
Wrong. because the rotation process accesses the bitmap pixel by pixel, it accesses the bitmap on non 16 bytes aligned memory adresses. The alignment would only come in use when you for example copy entire scanlines. Besides that: rotation is something you usually do not do repeatedly in a loop.
Floris
Has the anti-performance club already finished? I don't have to prove anything. I am just trying to help. I can do all this stuff by myself and share nothing or share a modification of the library. This is basically how modifications arise (very sad, I know).
> Nice completely unaligned scanlines.
??? this code comes from FreeImage ;]
> Why not?
read twice
> A 32-bit bitmap doesn't ensure 16 bytes alignment...
that one is actually good - if there is a better one, will you let me know? 32bits = 4bytes = the best available bpp - dimensions.width = 4 = a perfect 16B alignment
> rotation
ok, that one is the winner!! - I wouldn't use any of the code from FreeImageToolkit, because it is just not optimized and it won't be until it is vectorized.
Now, it uses a scalar code optimized by a compiler = not optimized. Some people may think that a memory access doesn't concern the scalar code, but it does. A simple mov instruction is only a pure example. I must again refer to this: http://agner.org/assem/pentopt.pdf
BTW: 16bytes align is not the only one align. It is only required for sse instructions. 8B is fine for the scalar code and 4B is usually enough. (4B = 32b = 1 pixel of 32bpp image = the pixel accessed by the rotation code)
best regards
I also think this would need a lot of modifications that may be complex to handle ...
About the rotation and rescale functions, they have been updated in the CVS and should be faster now :)
Herv
Ok, I'm stupid. It is possible to insert some padding:
unsigned dib_size = sizeof(FREEIMAGEHEADER);
dib_size += sizeof(BITMAPINFOHEADER);
dib_size += (dibsize % 16 ? 16 - dibsize % 16 : 0);
dib_size += sizeof(RGBQUAD) * CalculateUsedPaletteEntries(bpp);
dib_size += (dibsize % 16 ? 16 - dibsize % 16 : 0);
dib_size += CalculatePitch(CalculateLine(width, bpp)) * height;
bitmap->data = (BYTE *)malloc_align(dib_size * sizeof(BYTE), 16);
... and change the other procedures accordingly..