JPEGView - Image Viewer and Editor / Discussion / Open Discussion: Quality of Up/Down sampling implementations

maraskan-user - 2016-04-16

Recently I created a few sample images to check the output of my TFTs to see if they properly distinguish between very dark (1,1,1), (2,2,2) and so and and black (0,0,0) values. Basically I created black images of 1920x1200 size (which equals my screen resolution) in an image editor and wrote numbers of a very large font size (almost screen height) of different brightness levels on them. Watched in full screen in a dark room, one could can still see the numbering on a good screen.

But what rather surprised me is that when applying some up- or downsampling to the image, the large numbers in the images seemed to become darker. So I made a screenshot, and indeed, the (2,2,2) RGB values had become (1,1,1) and (1,1,1) had become (0,0,0) black. This happens for the low end of the brightness scale. I only tried (1,1,1) to (20,20,20) and they all show darkening, but the slightly brighter tones also have something that looks like dithering while downscaling and something like a slight gridlike pattern in areas that used be solid colors when upscaling, when closely examined in an image editor.

So I tried IrfanView's default Lanczos-filter for display resampling for comparison; it didn't show the darkening observed with JpegView nor the patterns. I tried ImageMagick, also with its default Lanczos filter for resizing the source images slightly, checked them in my editor; also no darkening/dithering/pattern.

This thing happens with both the HQ_SSE_MMX code path as well as the normal HQ one (but not with point PointSampling of course). It's actually even more pronounced on the non-SSE code path with intensity becoming 2 RGB units darker instead of 1.

Guess the filter kernel might need some rework...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Kleiner - 2016-04-18

What you see are rounding error artifacts caused by integer maths used to perform the filtering.
If you have suggestions how to improve precision of this integer math, your input is welcome.
Maybe filter normalization is incorrect, but I don't find anything there.
What I don't want is to use is floating point math for the filtering, that is too slow.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Kleiner - 2016-04-19

I was able to improve the precision of the integer math somewhat. Solid areas now keep the exact value on up and downsampling (tested on 1, 2, 4, 195, 250 values). Patterns inside solid areas do not appear anymore.
What remains are slight dithering like artifacts on gradients in the SSE code path, but that is less distracting.
The non-SSE code path is completly correct now as it uses 32 bit fixed point integer math, compared to 16 bits used in SSE path. 32 bit integer should result in equal (or even better) precision than using float.
Consider that the default filter performs some sharpening, thus expect some halo around gradients - that is the excpected result of this sharpening.
You can use Lanczos for downsampling by editing the INI file. Upsampling is always bicubic.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

maraskan-user - 2016-04-20

Awesome, thank you. :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

maraskan-user - 2016-04-21

Also, thanks for the tip to use Lanczos (e.G. Filter_Downsampling_No_Aliasing) for downsampling.
I had been using Filter_Downsampling_Best_Quality for a long time (you know, since it says "Best_Quality"... :) ) and actually had forgotten about the existence of this setting, but lately became aware that it gave results pretty similar to IM's "-scale" option which uses a slightly improved box filter.
Filter_Downsampling_No_Aliasing on the other hand is visually very similar to IM's Catrom filter which lately has become my filter of choice for downsampling.

On a different note, I notice that JpegView does its resizing without accounting for colorspace/gamma (see http://www.4p8.com/eric.brasseur/gamma.html and http://www.imagemagick.org/Usage/resize/#techniques ). This tends to make edges darker than they should be. Do you think JpegView's internal precision would be enough to handle resampling in linear RGB versus an image's (usually non-linear) native color space?

Edit:
Non-SSE path can indeed handle filtering in linear RGB. My own fork now uses a LUT in ApplyFilter() for conversion from 8-bit sRGB to 12-bit linear RGB before calling filter kernels and another LUT to convert back to 8-bit sRGB after them. 12-bit is the minimum precision to handle linear RGB without error. The output seems correct now, but the speed of the non-sse path has decreased by around 20%. No idea if there's a better way to do it.

Last edit: maraskan-user 2016-04-25

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Changed code section in BasicProcessing.cpp's ApplyFilter() function for doing filtering in linear RGB looks like this:

for (int n = 0; n < pKernel->FilterLen; n++)
    {
    nPixelValue1 += pKernel->Kernel[n] * sRGB8_LinRGB12[pSourcePixel[0]];
    nPixelValue2 += pKernel->Kernel[n] * sRGB8_LinRGB12[pSourcePixel[1]];
    nPixelValue3 += pKernel->Kernel[n] * sRGB8_LinRGB12[pSourcePixel[2]];
    pSourcePixel += nSourceBytesPerPixel;
    }

// 8191 is rounding correction because in filter 1.0 is 16383 but we shift by 14 what is a division by 16384
*pTargetPixel++ = (uint8)LinRGB12_sRGB8[max(0, min(4095, ((nPixelValue1+8191) >> 14)))];
*pTargetPixel++ = (uint8)LinRGB12_sRGB8[max(0, min(4095, ((nPixelValue2+8191) >> 14)))];
*pTargetPixel++ = (uint8)LinRGB12_sRGB8[max(0, min(4095, ((nPixelValue3+8191) >> 14)))];

LUTs were added in BasicProcessing.h and are included in attachment. I created them with an AutoHotkey script which also allows to customize input and output bit-depth and formatting (also attached).

Last edit: maraskan-user 2016-04-26

LUTs.txt

sRGB to LinearRGB LUT.ahk

David Kleiner - 2016-04-27

Thanks for your proposal. I made some experiments with your code and came to the following results
There is a slight visual improvement by filtering in linear RGB but on most images it is hardly visible
The non-vector path can do the filtering with sufficient accuracy because it uses 32 bit integer
* The non-vector path however has a quite large error (blocking artifacts) because it uses 8 bit only between the two resampling steps in x and y. This can be observed also when using filtering in sRGB. Luckily this path is almost never in use, because all modern CPUs support at least MMX.

However there is a problem with the vector (MMX, SSE, AVX) path when using linear RGB filtering. There the precision is not sufficient and artifacts appear. These are worse than the improvements, thus I leave the filtering as it is, providing exact results in 8 bit without any artifacts. The vector path has no block artifacts because it uses intermediate 16 bit images between x- and y-resampling.

Here is why 16 bit FP math is not sufficient:
As you mentioned, the linear filtering needs at least 12 exact bits in the filter calculation. The vector path calculates the filter with signed 1.15 (pixel) x 2.14 (filter) fixed point multiplication, resulting in a 3.29 FP result, taking the upper 16 bits from the result, thus 3.13. Thus each multiplication has 13 bits precision only, with bit 13 being unstable, as always with the last bit. When doing 16 multiplications (maximum filter lenght), the result can be 16 off in worst case, typically it is around 8 off. Thus 3-4 bits are lost here, making the result of the filter kernel exact to only 9 to 10 bits. That is enough for the 8 bit output, but it is not enough for filtering in linear RGB what would require 12 bits or more.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

maraskan-user - 2016-04-27

Thank you for your explanation, though since I have no experience with MMX/SSE/AVX, I might not be able to grasp it in its entirety. I read on wikipedia that SSE2 and AVX support higher precision float and integers as well, so I was wondering why you choose the specific 16 bit FP.

Anyhow, doing filtering in linear RGB does indeed only show its difference when downsampling images with tiny details with high contrast, like sharp lines and dots. Think stars in the dark night sky, or the lights of cities seen from space, like in the image http://eoimages.gsfc.nasa.gov/ve//1438/earth_lights_4800.tif.
Without linear filtering, these details become darker than they should be and sometimes can't even be seen anymore.

You mentioned blocking artefacts when using the non-vector code path. I didn't notice something like this yet and I'm not sure what I would need to look for. Could you provide me with some examples or images that show these versus the same image without them, please. I haven't been aware of such problems until now: the output of my JpegView Fork (when using non-vector path + Filter_Downsampling_No_Aliasing + Sharpening 0 + Linear RGB LUTs), is visually indistinguishable from the output of ImageMagick downsampling the same images to the same resolution while using the Catrom filter. My only gripe at the moment is that processing speed without cpu extensions is now 5x lower than when I was still using SSE (200 ms vs 40 ms for a 5 MPixel image)...

Last edit: maraskan-user 2016-04-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Kleiner - 2016-04-28

16 bit integer is the only format supported by MMX, SSE and AVX, only with different parallelism (64 to 256 bits). Of course it is also at least twice as fast as 32 bit. And precision is sufficient as long as you stay in sRGB.
The blocking artifacts can be seen on the attached examples. Both are magnifications by factor 16, with a strong contrast enhancement to make the discrete number isolines (1, 2, 3, 4 RGB values) visible. As you can see, the vector path results in the expected, smooth bicubic isocurves. The non-vector path does not because of the intemediate 8 bit precision. But the poor results of the non-vector path are sort of irrelevant (except for your branch...) because the vector path is way faster and modern CPUs all support SIMD.

NonVectorPath.png

VectorPath.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

maraskan-user - 2016-04-28

Ah, now I see. I didn't realize it before, since I usually don't upscale images that much. Yes, that does need some improvement with the non-vector path. I need to think about that some more. Ideally, I'd much rather get the vector path working at a twice the precision (and half the speed... which would still be 2.5 times the speed of the non-vector path) than trying to fix up the non-vector path in various ways which would probably make it even slower, still.

Concerning MMX, the wiki entry mentioned "MMX defined eight registers [...]. Each register is 64 bits wide and can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format: a single instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once." That's why I thought that it would have been possible to choose other integer widths as well.
About SSE it stated that it "used only a single data type for XMM registers" but SSE2 would later expand the usage of the XMM registers to include
- two 64-bit double-precision floating point numbers or
- two 64-bit integers or
- four 32-bit integers or
- eight 16-bit short integers or
- sixteen 8-bit bytes or characters."
which is why I'm a little confused why you said that 16 bit integer was the only format supported by MMX, SSE and AVX.

Last edit: maraskan-user 2016-04-28

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

maraskan-user - 2016-05-01

Ok, so I fixed the visible blocking in the non-vector path in my test branch. It uses int64 instead of int8, though, whitch makes it 4x slower. (Seriously... 2 independant (x/y) filtering loops without clamping blow up the numbers by 28 bit. Plus 12 bit source makes for 40 bits.) Back to trying something else...

Edit: Yes, directly using float is significantly faster than int64.

Last edit: maraskan-user 2016-05-01

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Kleiner - 2016-05-01

To fix the blocking artifacts, it should be sufficient to store the result of the x-resize filter in 16 bit instead of 8 and then do the y-resize on this 16 bit image. 64 bit should not be required.
When doing experiments with the vector path, consider that only since SSE 4.1, a signed 32 bit integer multiplication is available. This is required however for filtering as some filter elements are negative. Or you use float (available in SSE, but not MMX) but this needs additional conversions int -> float and back to int.

PS: Only for the 16 bit format a signed integer multiplication instruction is available for all MMX, SSE, AVX versions. That is why it is very hard to provide a generic MMX,SSE,AVX vector path with more bits.
But luckily 16 bit is enough if you stay in the source color space and the source has 8 bits - what is quite often the case when processing images. Intel also knew that, that is why it took many years until 32 bit integer was fully supported by the SIMD instructions.

Last edit: David Kleiner 2016-05-01

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

maraskan-user - 2016-05-02

Today I tried my luck (since I don't know very well what I'm doing...) in converting the vector path by using SSE4 intrinsics to handle 4x32bit INT versus 8x16bit INT but the image still looks almost like pure noise, even though some structure is recognizable. Speed would be pretty decent, though. Does the ApplyFilter_SSE() function seem reasonable? pSourceImg already is 32 bit wide. I'm using intrinsics for the 32 bit code path.

[Edit: I put the code in attachment "CodeSample_Int32Vector.cpp"]

Last edit: maraskan-user 2016-05-07

CodeSample_Int32Vector.cpp

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Kleiner - 2016-05-02

Hmm, it will not work like that.
Let me explain the filter loop for one channel in 16 bit, then you see how to do it in32 bit:

Input (A.B FP, means fixed point format with A bits for the integer part, B bits for fractional part).
Multiplication of such numbers results in:
A.B x C.D = (A + C).(B + D)
when using the normal integer multiplication.

xmm2: 8 source image pixels, each 10.6 FP, first 2 bits always zero, thus actually 8.6
xmm7: 8 filter elements, each 2.14 FP
xmm4: Sum(pixel[i] * filter[i], 10.6 FP

// Result of 16 x 16 bit multiplication is 12.20 FP.
// Taking high 16 bit, would result in 12.4 FP.
// To get back to 10.6 format, we need to shift left by 2, thus multiply by 4
// Premultiplication is better for accuracy, but we have only one bit left for that in xmm2, not two.
// Reason: xmm2 is multiplied signed.
// Thus: Premultiply by 2, postmultiply by 2.
// 2*a = a + a, thus we just add...

// Premultiplication: xmm2 = 2*xmm2
xmm2 = _mm_add_epi16(xmm2, xmm2);

xmm2 = _mm_mulhi_epi16(xmm2, xmm7);

// 2 * 10.6 x 2.14 = 12.20, taking upper 16 bits, thus 2 * 12.4

// Postmultiplication: xmm2 = 2*xmm2, thus now 4 * 12.4, thus 10.6 again
xmm2 = _mm_add_epi16(xmm2, xmm2);

// Sum up, saturate to signed 16 bit integer
xmm4 = _mm_adds_epi16(xmm4, xmm2);

For 32 bit, there is unfortunately no _mm_mulhi_ep32, not even with SSE 4!
Thus you cannot use the schema above and must do the FP calculations as you did in your integer path, having a rather low resulting precision, despite of 32 bits, i.e.
Pixels: 8.4 FP
Filter: 2.14 FP

Multiplication is then: 10.18, leaving room for some adds. You need this room, thus you cannot use more fractional bits for the pixels or the filter, 12 bits the the maximum.
Thus make sure you store the pixels in 8.4 FP in pSourceImage (but as 32 bit, thus 20 zero bits, then the 12 bits output from your LUT) and use the code shown below for each channel.
The filter must be 2.14 but also stored in 32 bit per filter element!
Don't use this emulated saturated add, it is not required in your non-vector path, thus also not here.

xmm7 = *pFilter; xmm2 = *pSource; xmm2 = _mm_mullo_epi32(xmm2, xmm7); xmm4 = _mm_add_epi32(xmm4, xmm2);

After the filter loop, you need to round, saturate and shift back to 8.4 FP:
Round: Add 0.5 in 2.14, thus 2^13 = 8192
Saturate: min(8.18, max(0, value)), 8.18 is 0x3FFFFFF
Shift back: value >> 14

Probably the 12 bits precision you get are not enough to avoid artifacts. But more is not doable with integer SSE, otherwise you need to use SSE floating point (float data type).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Kleiner - 2016-05-02

Post edit note:

There are other bugs in your code,e.g.
DECLARE_ALIGNED_DQWORD(ONE_XMM, 16383 - 42);

You need to saturate against integer value 0x3FFFFFF now, thus declare 4 aligned integer values of 0x3FFFFFF
DECLARE_ALIGNED_QUAD_INTEGER(ONE_XMM, 0x3FFFFFF);
load it into xmm0 and use
_mm_min_epi32(xmm, xmm0);

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Kleiner - 2016-05-02

Post post edit note:

You already use it in your code... There is the intrinsics
__m128i xmm0 = _mm_set1_epi32( 0x3FFFFFF);

to set 4 signed integers. That is easier than a macro.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

maraskan-user - 2016-05-03

Thanks for your detailed explanation. I'm gradually getting closer to unterstanding the workings of the the SSE part of the code. Since it was more straightforward, I got the float versions of the SSE path working first. Results look correct and speed is about 50% that of the 16-bit INT SSE path. I haven't implemented the linear-RGB tables yet, and there is still at least one major bug that leads to an exception with a very few select images, maybe due to their exact dimensions. I have very little experience with debugging in visual studio, so it might take a while to figure that out. Weirdly though, when using a value of "3" instead of "2" in this line of ApplyFilter_SSE()

int nNumberOfBlocksX = (nEndXAligned - nStartXAligned) >> 2;

prevents the crash, but the image then is garbled. I assume I need to use "2" to divide the image in 4 blocks instead of 8 as before, maybe related to also only using 4 values per register instead of 8 as before. The image only looks right with "2", but it also correlates with a crash with certain images, which makes me think I overlooked another hard coded value somewhere else that is implicitly dependant that I would need to adjust...

Edit:
I fixed the crashing by calling GetMemSize() with 1 byte more. I don't know yet why allocated memory is too little in some width/height situations, though.

int GetLineSize() const { return m_nPaddedWidth*4; } int GetMemSize() const { return ((GetLineSize()*3*m_nPaddedHeight)+1); }

Edit2:
Added the LinearRGB LUTs. Image display is correct now, but contrary to the non-vector path, speed did not decrease at all. Interesting.

Anyway, the code right now looks like this.
[Edit: I put the code in attachment "CodeSample.cpp".]

I also changed a line in CalculateXMMFilterKernels() from

pCurKernelXMM->Kernel[j].valueRepeated[k] = m_kernels.Kernels[i].Kernel[j];

to this

pCurKernelXMM->Kernel[j].valueRepeated[k] = (((float)(m_kernels.Kernels[i].Kernel[j])) / 16383.0);

so kernel ouputs a maximum value of "1.0". I suppose I'll create a modified version of CalculateFilterKernels() that outputs float instead of int directly.

Last edit: maraskan-user 2016-05-07

CodeSample.cpp
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

maraskan-user - 2016-05-07

I noticed another artifact that is present in both vector and non-vector path. It's slightly more visible when using linear color space for resize, though. The attached "source.png" was upscaled to a size of 1208x1138 once by JpegView and once by ImageMagick. (Of that output, for easier visibility I magnified a region by a factor of 6x, in the attached result images)
It looks like there is a kind of stepping effect with JpegView's result, and there's also several what looks like duplicate consecutive rows and lines that are around 73 pixels in original upscaled image (e.g. 438 pixels in 6x zoomed in version) apart.

[Edit]
Used ImageMagick command line was
convert.exe -colorspace RGB -filter catrom -resize x1138 -colorspace sRGB "%InputFullPath%" "%OutputFullPath%"

JpegView in this case used SSE path with Float and linear RGB, but the non-vector and non-linear path shows the same artifacts as well.

Last edit: maraskan-user 2016-05-07

Result_Magnified-6x_17.508x_IM_Catrom.png

Result_Magnified-6x_17.508x_JpegView.png

Source.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Kleiner - 2016-05-11

Reason is that only 32 kernels are used for bicubic resizing, causing artifacts for large resize factors as you have found. Except in the 'Resize image' dialog, upscaling factors in JPEGView are limited to x16, thus not yet showing the issue.
I have increased the number of filter kernels to 128 now, then the artifacts go away, except for factors > x100 what is not very relevant IMHO.

If you want to do the same in your linear version, change the following lines in ResizeFilter.cpp:

// We cannot enlarge to a factor above the number of kernels without artefacts #define NUM_KERNELS_RESIZE 128 #define NUM_KERNELS_RESIZE_LOG2 7
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

maraskan-user - 2016-05-11

Ah, that explains it! In my branch, I haven't been limiting the zoom factor itself, just the resulting image width and height to <=65535 and >=1. With the above change, the zoomed source image looks as smooth as IM's output. Thank you!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

maraskan-user - 2020-05-01

Hmm, what is the specific reason for using
MAX_IMAGE_PIXELS = 1024 * 1024 * 500;
instead of
MAX_IMAGE_PIXELS = 4294836225; // 65535 * 65535
since MAX_IMAGE_DIMENSION is 65535 anyway.

I'm asking because I'm getting crashes with an image with a size of 20848 x 29440 which amounts to 613 MPixels, but only in the HQ resampling step. And that's true for both my fork as well as the official one. I'm not sure why, but some calls to CProcessingThread::DoProcess() never complete with very large pictures, probably from something happening downstream in the resampling stage.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SylikC - 2021-05-12

@maraskan-user a fork you have, you say... I'm interested! Where can I find your fork? Does it have new features? I've started a fork myself and looking for help :D https://github.com/sylikc/jpegview

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- maraskan-user - 2021-05-12
  
  @SylikC It can't be found anywhere, I'm afraid. I took a sledge hammer to the original code and trew out anything I didn't use in order to easier understand the workings of and experiment with the remaining parts. My main concern was the inclusion of a more correct resampling for line art (e.g. comics and manga) in linear space ( lines will always come out too dark when calculated in log space), since no image viewer in existence seemed to do that either. It's a minimalist viewer with no user interface, only controlled by keys. If you still insist, I could upload the source for you somewhere.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quality of Up/Down sampling implementations

Lean and fast image viewer with minimal GUI

Forums

Help

Quality of Up/Down sampling implementations

Quality of Up/Down sampling implementations

Lean and fast image viewer with minimal GUI

Forums

Help

Quality of Up/Down sampling implementations document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Quality of Up/Down sampling implementations