- switch from qsort() to built-in quicksort (about 2x as fast a sort, ~10% faster overall)
- use pool allocator for active-edge allocations (~10% faster overall)
- use new rasterizer (about 30% faster, ~10% faster overall)
When compiling with -O3, gcc would complain that 'linear' might not be
initialized if comp is superior to 4.
In fact passing a value > 4 is an error anyway, but gcc does not know
that. I changed the switch case to support comp > 4. I don't think it
should affect the performances.
The original AC decoding logic handled ZRL (runs of 16 zeros)
incorrectly.
The problem is that the original flow set r=16 and skipped the
final coeff write when s=0. This is not actually correct. The
problem is the intervening "refinement" bits.
With the original logic, even once we decrement r to 0, we keep
reading more refinement bits for subsequent coefficients until
we find the next currently-unsent AC in the current scan. That is,
it works as if it was trying to place 17 new AC values, and only
bails at the last minute from actually setting that 17th value.
This is wrong. Once we've found the 16th previously-unsent AC, we
need to stop reading refinement bits, otherwise we get out of sync
with the bit stream (which expects us to read a huffman code next).
The easiest fix is to just do what the JPEG standard implicitly
assumes anyway: treat ZRL as a run of 15 zeros followed by an
explicit magnitude-zero AC coeff. (That is, leave s=0 and actually
write s). So this is what this fix does.
mingw redefines __forceinline (though I don't think this applies to all
versions of mingw.) Therefore, don't redefine it if mingw has already
defined it.