What you report is not a software bug. I have not seen any paper generalizing the slicing apporaches into a single framework as we do. The zlib is a variation of slicing-by-4 not slicing-by-8. The design of slicing-by-8 was made possible after the generalization reported in our paper.
There's nothing novel about this algorithm. Anyone tasked with developing an efficient CRC implementation is almost inevitably going to reinvent it. At least one widely deployed library (zlib) has been using it for years. The reason the byte-by-byte algorithm is so widespread is not that it's the state of the art, but that it's simple and there are public-domain implementations. This code might...