Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#78 File sorting fails on hexadecimal encoding

open
nobody
None
5
2010-09-30
2010-09-30
Anonymous
No

I use comix 4.0.4 on openSUSE 11.2

i have an archive with the following files in it:
xyz001
xyz002
...
xyz009
xyz00A
...
xyz00F
xyz010
...
xyz019
xyz01A

this filelist is not correctly handled by comix' "alphanumeric_sort()" [filehandler.py], the result is:
xyz00A
...
xyz00F
xyz001
xyz01A
xyz002
...
xyz009
xyz010
...
xyz019

i don't see any sensible way to make alphanumeric sort handle this case correctly, so my suggestion is to introduce a GUI switch to decide whether "alphanumeric" or "lexicographic" sort should be used.

Discussion

  • Alan Horkan
    Alan Horkan
    2011-06-05

    If I were using a file manager like Nautilus or Thunar I'd just sort by date instead, it is usually the simplest answer to bad file names messing up sort order.

    This choice of file names that will not sort easily seems broken to me. I wonder how you ended up with that naming scheme.
    I would unpack, rename, and repack. (Thunar includes a nice bulk rename tool.)

     
  • Charles M.
    Charles M.
    2011-08-18

    I think this is actually a more important problem than the poster realizes.

    I think the most important criteria for the page sort order is consistency with other CBR readers. I've got about 4000 CBRs and almost without exception they were created expecting a lexicographic case-insensitive sort order, such as: page01.jpg, page0203.jpg, page04.jpg. Comix puts the page0203.jpg at the end. I can obviously change the CBR file to correct it, but it's a pain. I'd say about 1 in 20 of my CBRs has this problem.

    Furthermore, when I do fix or create a CBR myself, I have to consider that if I want to give a copy to someone using one of those other CBR readers, or if I ever want to change to another CBR reader, I need to make the page order consistent in both sort orders. Not that hard, but again a bit of a pain, and for me undermines any benefit of the alpha numeric sort feature.

    Personally I consider the lexicographic case insensitive sorting unavoidably part of the de facto standard for the CBR/CBZ/CBT format. I can imagine that someone who creates their own CBRs and doesn't intend to ever use another CBR reader might view the alpha numeric sort order as a feature. But in my opinion it's a fairly minor one since the point of CBRs isn't to hack around inside the archive anyway. I'm sure a lot of Comix users aren't even aware of the internals of CBR files.

    To fix this, I would just change the sort order and not offer a configuration option to change it. Very easy code change too. If I did have a config option I'd suggest a checkbox labeled "Use old page sort order (deprecated)" that would default to unchecked and pop up a warning dialog when checked, explaining some of the issues, instead of radio buttons presenting it as a neutral choice.

     
  • Alan Horkan
    Alan Horkan
    2011-10-14

    charles, comix hasn't been updated since 2009 so changes might not happen.

    my comment is to is to suggest to anyone who might want to fix this that there is a potential workaround, and a sort by date option would also be useful to anyone using Comix not just for comics but as a general image viewer.

    I tested the original CDisplay and CDisplayEx and they both put page0203.jpg to the end of the reading order. If you know software that does handle badly named files I'd be interested to know, they are in effect breaking the defacto standard.
    It might helpful for users to implement a more complicated file sorting algorithm and that is a good thing but I still think inconsistently naming files "page01.jpg, page0203.jpg, page04.jpg" is broken by design and anyone packaging comics should know better and it would be a bad thing to encourage further messing.

     
  • Charles M.
    Charles M.
    2011-10-17

    I found an ugly but easy way to change to lexicographic page sorting, just comment out or delete lines 554 and 555 of src/filehandler.py in the comix-4.0.4 source code. They look like this:

    if s.isdigit():
    return int(s)

    Could be more efficient and neat, as is, just something to tide over the desperate like me.

    I tried out CDisplayEx and as Alan observed it sorted "naturally" as 01, 04, then 0203. However I went back to CDisplay and, for me, it's sorting lexicographically, i.e. 01, 0203, then 04. I also looked at evince 2.30.3 on Ubuntu 10.04 with locale set to en_US.UTF-8, and it's lexicographic too.

    For CDisplay, I'm using CDisplay 1.8.1.0 with this exe: CDisplay.exe 1,575,936 bytes, modified April 20, 2004 3:03:44 PM. I'm running it on Windows XP Media Center Edition SP 3, US English version. I suspect Alan may be using a later version of Windows. CDisplay appears to be calling CompareStringA in kernel32.dll which seems to have added a mode to sort naturally in Windows 7. So maybe CDisplay is pulling it's page ordering logic from there or another shared library function that Microsoft has changed since XP.

    The only other thing I could think of to reproduce Alan's results with CDisplay was to see if perhaps it used the order the files were added to the archive to provide the sort order, but that seemed not to matter either for zip or rar.

    I wrote a script to scan my comic book collection and found that 624 out of 8752 of them have page order differences depending on the sort algorithm used. Of those 10 or so files with ambiguous page ordering that I examined, the lexicograpic ordering produced the appropriate page order. That agrees with my experience over the years with CDisplay.

    The lexicographic page ordering that CDisplay is producing on XP is pretty important since over the years a lot of CBR files were put together with that as the reference implementation. Now that that base of files has been widely propagated, it would be a lot of work to correct them.

    Since some reader programs are using natural sorting and other are using lexicographic, sadly I don't think we can really say there is a de facto standard any more. Additionally I'd guess at least some of these readers are using localized collation logic, so even the same reader might come up with a different page order in different locales.

    Fundamentally using file sorting to determine the page ordering is a bad idea. The implementations of sort algorithms change over time and vary between locales, character sets, libraries, and OSes. Recently, I've looked at the implementation of natural sorting in GLib (g_utf8_collate_key_for_filename), in Perl's Sort::Naturally library, and in comix and each one is slightly different and so changing between (hypothetical) CBR readers that used them could produce unexpected changes in page order. Reliance on sorting logic of any kind to generate page order is going to be an ongoing pain for the CBR format, and natural sorting is going to be even worse because each implementation makes slightly different choices, probably due to its greater complexity over lexicographic sorting.

    I'd personally rather see a CBR v2 format which would eliminate use of any sort algorithms and instead get the page order from an included manifest file, explicitly listing the image files in the order to display them. It would be more burden on the creator to produce the manifest, though likely that could be automated, but it would really clear up this issue.

    In the meantime, I think the least bad choice is to change back to lexicograpic sort order (and tell CDisplayEx that they should too). Primarily I say this because, of the two sort algos in use, lexicograpic has gotten baked into more existing CBRs and it's also the simplest to implement and also the simplest for the many different CBR readers to coalesce on. I don't think allowing users to select between different sort modes is going to really be a solution. It isn't like Nautilus or Thunar, where you can rely on the fact that users know they are looking at files and can see the names and mod times and could thus take a good guess at what might be a useful sort order.

    In a CBR reader, the user likely doesn't know that the pages in the CBR files are actually archived files and it's also likely they don't know what ordering scheme the creator of the CBR intended for it to be read in. When they start reading a CBR with the pages ordered incorrectly it can sometimes be difficult to realize what is going on. There're like "is this one of those stories that starts in the middle of the action, or has my CBR reader got the page ordering incorrect again?"

    Adding yet another user selectable sort order like mod date order is not a help either. It would generate a correct page order frequently assuming most scanners probably scan pages in order, but when there was a later page edit, again the order will be wrong. Instead of 2 schemes which are right 90% of the time, we'll have 3. Which isn't really better for users who are reading comics, but it is worse for CBR authors.

    As it is, anyone creating a CBR file for distribution should try to accomodate for the current set of readers by setting up the page names so they will display correctly in either lexicographic or any of the natural sort order variants. Typically they should stick with 01.jpg, 02.jpg, 03.jpg, etc. They should eliminate as much as they can from the filenames, never vary case and always use sufficient leading zeros, and should try to keep it really simple.

    Adding mod date order would just one more sort order they ought to comply with, for those users who have their CBR reader in that mode. If a page in the middle of the CBR needs to be rescanned, the author would have to start playing around with date stamps, which is usually more cumbersome than changing filenames.

    Charles