I've taken some cursory look through ways to speed up server operation,
and one thing that I thought relevant is that if the Opus libraries are
not compiled using FIXED_POINT (FIXED_POINT would seem like a bad idea
on typical general-purpose architectures and indeed does not seem
enabled), then natively the Opus libraries deal with floating point and
explicitly convert to fixed point when using the fixed point API.
Now it turns out that floating point is a lot better suited to handling
mixing (particularly once one involves APX extensions) because it deals
a lot more gracefully with temporary and permanent overflow and also has
special SIMD instructions available that could greatly speed up
operation.
The obvious disadvantage is that a "natural" .wav dump when recording
would end up double the size it already has. But letting the wav
recorder reduce to int would seem like a sensible option, assuming that
queuing up the floats does not end up slower than converting and queuing
up the shorts.
At any rate, at least GCC (and there may be a reasonable expectation for
servers that they are compiled with GCC) offers a deluge of options for
compiling using AVX and similar intrinsics in a manner where the ELF
executables will pick the best version at runtime. So even in a binary
distribution, it's feasible to use stuff that may not be available for
all targeted platforms.
Has anybody experimented with converting at least the server-side
operation to floating point?
--
David Kastrup
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It certainly is! It is presented as a matter of maintaining precision for sound cards delivering more than 16 bits, and doing so at a cost (1% of processing power). However, the way I see it this provides an entry into significantly more efficient processing using the AVX extension for SIMD processing of floating point values (namely 8 32-bit floats at a time). I'll have to look at the patch in question first, though: I don't see that there is a lot to be gained for client-side processing: the real (and reasonably low-hanging) payoff would be at the server side. I'll take a look at what the patch presented there purports to do and then come back to say how I think this would relate to what I propose.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
And it would seem that the integrate_float2 branch is sort of supposed to do something similar? Last time master has been merged into it was in October. Not sure what the idea with this branch is.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I seem to recall a discussion in which it was concluded that floats wouldn't make much difference, but I'm not sure where that is. Perhaps on another ticket?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
a) I don't get the argument "someone else does it so it is unnecessary" (it's not like a good idea is exhausted once somebody followed through with it)
b) the biggest reason for me is that it would make the O(n^2) operation of mixing on the server amenable to SIMD via AVX (for x86-based servers) and thus could really speed up operations a lot with comparatively small code replacements (which I know how to do in GCC): a consideration which I really have not seen in the discussion
c) it also would make clipping behave a lot more gracefully than the integer variant of wrapping around
So essentially I don't share the conclusion.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Actually I think that the focus of the FP patch is quite different: it appears to be intended as an architectural thing from driver to client to server, using float everywhere, and particularly enabling 24bit operation in soundcards etc.
That's not my interest: I was interested just in optimising the server operation. Since the transport is done compressed (with Opus), there is not much of a point in using more than 16bt of resolution on the client side. On the server side, however, using a float-only workflow allows to use SIMD instructions for the mixing stage, and float is the "natural" format for the Opus decoder/encoder anyway.
I think that the problems with the FP patch's audio behavior were client-side: I don't have the experience particularly with Windows to help there, but if I had chosen to code from scratch, I'd not have touched the client code anyway so those problems would not have been an issue.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, one thing I did not keep in mind: server and client share a whole lot of code, particularly so because they are the same executable and the server does not need to run headless (like my server always does). So just making the server operate in float is somewhat more tricky. One way to do this may to template all the respective classes so that they deal either in short or in float.
That would cause code duplication and a bit of complication but it would have the advantage of making it rather easy to conduct comparisons and switch operation back and forth (like, when compiling for some headless client on server on an architecture weak on floating point).
It would also make it feasible to offer both float and short interfacing to the sound card without performance loss (though again at the cost of code duplication, most of which is done by the template engine).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I've taken some cursory look through ways to speed up server operation,
and one thing that I thought relevant is that if the Opus libraries are
not compiled using FIXED_POINT (FIXED_POINT would seem like a bad idea
on typical general-purpose architectures and indeed does not seem
enabled), then natively the Opus libraries deal with floating point and
explicitly convert to fixed point when using the fixed point API.
Now it turns out that floating point is a lot better suited to handling
mixing (particularly once one involves APX extensions) because it deals
a lot more gracefully with temporary and permanent overflow and also has
special SIMD instructions available that could greatly speed up
operation.
The obvious disadvantage is that a "natural" .wav dump when recording
would end up double the size it already has. But letting the wav
recorder reduce to int would seem like a sensible option, assuming that
queuing up the floats does not end up slower than converting and queuing
up the shorts.
At any rate, at least GCC (and there may be a reasonable expectation for
servers that they are compiled with GCC) offers a deluge of options for
compiling using AVX and similar intrinsics in a manner where the ELF
executables will pick the best version at runtime. So even in a binary
distribution, it's feasible to use stuff that may not be available for
all targeted platforms.
Has anybody experimented with converting at least the server-side
operation to floating point?
--
David Kastrup
I'm not sure if this is related?
https://github.com/jamulussoftware/jamulus/issues/544
It certainly is! It is presented as a matter of maintaining precision for sound cards delivering more than 16 bits, and doing so at a cost (1% of processing power). However, the way I see it this provides an entry into significantly more efficient processing using the AVX extension for SIMD processing of floating point values (namely 8 32-bit floats at a time). I'll have to look at the patch in question first, though: I don't see that there is a lot to be gained for client-side processing: the real (and reasonably low-hanging) payoff would be at the server side. I'll take a look at what the patch presented there purports to do and then come back to say how I think this would relate to what I propose.
Actually, the related issue rather is https://github.com/jamulussoftware/jamulus/pull/535/commits/1d7dec739a4a7a06cfe70e4f76d85e577ae24f7f
And it would seem that the
integrate_float2
branch is sort of supposed to do something similar? Last time master has been merged into it was in October. Not sure what the idea with this branch is.I seem to recall a discussion in which it was concluded that floats wouldn't make much difference, but I'm not sure where that is. Perhaps on another ticket?
I see https://github.com/jamulussoftware/jamulus/issues/544#issuecomment-753603959 but
a) I don't get the argument "someone else does it so it is unnecessary" (it's not like a good idea is exhausted once somebody followed through with it)
b) the biggest reason for me is that it would make the O(n^2) operation of mixing on the server amenable to SIMD via AVX (for x86-based servers) and thus could really speed up operations a lot with comparatively small code replacements (which I know how to do in GCC): a consideration which I really have not seen in the discussion
c) it also would make clipping behave a lot more gracefully than the integer variant of wrapping around
So essentially I don't share the conclusion.
Actually I think that the focus of the FP patch is quite different: it appears to be intended as an architectural thing from driver to client to server, using float everywhere, and particularly enabling 24bit operation in soundcards etc.
That's not my interest: I was interested just in optimising the server operation. Since the transport is done compressed (with Opus), there is not much of a point in using more than 16bt of resolution on the client side. On the server side, however, using a float-only workflow allows to use SIMD instructions for the mixing stage, and float is the "natural" format for the Opus decoder/encoder anyway.
I think that the problems with the FP patch's audio behavior were client-side: I don't have the experience particularly with Windows to help there, but if I had chosen to code from scratch, I'd not have touched the client code anyway so those problems would not have been an issue.
Ok, one thing I did not keep in mind: server and client share a whole lot of code, particularly so because they are the same executable and the server does not need to run headless (like my server always does). So just making the server operate in float is somewhat more tricky. One way to do this may to template all the respective classes so that they deal either in short or in float.
That would cause code duplication and a bit of complication but it would have the advantage of making it rather easy to conduct comparisons and switch operation back and forth (like, when compiling for some headless client on server on an architecture weak on floating point).
It would also make it feasible to offer both float and short interfacing to the sound card without performance loss (though again at the cost of code duplication, most of which is done by the template engine).
This feels like something to discuss on a new Github ticket perhaps. Maybe reference https://github.com/jamulussoftware/jamulus/issues/544?