when non-latin filenames (Cyrillic in my case) are processed by ddru_ntfsfindbad, they are completely unreadable in logfile because as far as I see in sources ntfs name attributes (utf-16 encoded) are processed by simply copying each even character from attribute... is it possible to use iconv or so to get right name strings? Moreover, I think that output locale optional parameter could be useful.
Thanks in advance!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You are correct that I used each even character from the attribute file name, as I did not know how to convert when I wrote it. As I now look at iconv, I am thinking you would like "UTF−16" converted to type "char", as "char" would be local to the machine. I would think this would convert it to a readable form, and would actually be the proper way to do it. I probably won't be able to get to this for a few days, but I will definitely look into it and see what I can do.
Scott
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for a quick reply!
But it's impossible to convert utf-16 to some "common" char because of codepages existence. Cyrillic converted from utf-16 to CP866 (DOS russian) is unreadable under Win because there is CP1251 used. Many systems have multibyte locales (Ubuntu locale is UTF-8, that is variable-width encoding). iconv can deal with all that (at least its console version, I think that it is based on iconv() too) but to determine current locale in runtime and to have an optional command string parameter (say, in straight literal form of iconv parameters) to override it could be useful I think.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have just uploaded ddrutility-2.3-beta1 in the "other and testing" section in the files area. Ddru_ntfsfindbad now properly converts the NTFS file names from UTF-16 to UTF-8, and has the added option of -e, --encoding which will allow the manual change of the file name encoding. But I don't think you will have to use the option as the default seems to work. I tested by manually editing an MFT entry to include UTF-16 Cyrillic characters, and the results were visible in the output file.
Please test and let me know if it works properly, and if it works for you without having to use the option manually.
Scott
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There appears to be a major memory bug in my release of 2.3-beta1. Ddru_ntfsfindbad eats memory like crazy and will crash on a large MTF. It appears I have more work to do.
Scott
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hm, I've just tested 2.3 beta1 (maybe mft was not long enough (~100K records) -- all works ok in default encoding to utf-8. With -e option set to another encoding (CP1251 or CP866) it throws an error during MFT records processing, for example:
priconv failed: in string ' ', length 56, out string '18 ��� ', length 1021
Invalid multibyte sequence.
processing inode 7719 of 10510
When UTF-8 (my system locale) is set with -e parameter, all is ok.
But I found one more bug when more than one name attribute is in one FILE record: when attribute with namespace 2(DOS) is followed by namespace 1(Win32) then ok, right name (namespace 1) is used. But when first is namespace 1 and then 2 -- using DOS namespace what is wrong as it looks like 45D61~1 or so. It seems that in any case last attribute is used. I attached here three records with two name fields each.
The "invalid multibyte sequence" error is from iconv itself. It seems to happen when the conversion cannot be done properly. I don't consider this a bug, however I am not sure I am handling it in the right way yet. So that is why it exits. I have not had time to see what is really happening in the conversion. For instance, I can normally convert my small test disk to ASCII, but if I add a Cyrillic character I get the error, and don't yet know if it was partially converted or just not at all.
As for the DOS filenames, that is not supposed to happen. If there is more than one it is supposed to pick anything better then type 0x02 (DOS 8.3). Maybe I broke something, or don't have something right. Thanks for the MFT records, they could be very useful in my testing.
It may take me several days or longer to work all these things out, but I will definitely address them as best possible. First I have to fix my major memory leak :)
Scott
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just uploaded 2.3-beta2. It fixes the memory leak, and I think it fixes the issue with DOS file names in the output. Please test and report if it fixes the dos file name issue for you. I did not spend much time on it, only enough to address these immediate issues without much testing. It is still beta for a reason.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Brilliant! Thank you thank you thank you :) works like a charm. At partition with ~650K records it consumes 1.5G memory, is it right? so, if even it's not right, it is not annoying at all :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have found another memory leak. It appears that iconv itself has a major memory leak. In the beta2 release it is causing at least 8 times more memory use than normal (it was 4 times worse than that when I forgot to iconv_close in beta1). I am not sure yet how to avoid this problem. So this is going to stay beta for awhile yet. This memory leak does not exist in the original stable version.
Iconv has been reported to have a memory leak in 2004 and 2009 (and I even saw something from 1999). I am using xubuntu 12.04 which is only 2 years old. One would think the f#(%!#@ bug would be fixed by now!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hmm. Maybe it's possible to use something else for conversion? For example, I googled that: http://site.icu-project.org/
===
What is ICU?
ICU is a cross-platform Unicode based globalization library. It includes support for locale-sensitive string comparison, date/time/number/currency/message formatting, text boundary detection, character set conversion and so on.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, as far as I see there are at least 2 versions of iconv() -- glibc version and libiconv that shoud be installed separately: https://www.gnu.org/software/libiconv/
Maybe libiconv version has not memory issues?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First, since ddrutility is compiled by the user, it would make it difficult to add ICU or include a specific version of iconv. Either it has to be standard, or I have to be able to easily include it in the source.
Second, I think I am figuring out how to deal with the iconv memory leak! Iconv seems VERY picky about always using the EXACT same memory location for input and output for every call, and it also only likes being opened and closed once during the program run. Actually it may not like even being closed before the program ends under some conditions. I do not have a beta to release yet as I want to work on a few things, but as far as I can tell I should be able to eliminate or very much suppress the memory leak :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2014-04-16
Yes, calling iconv_open() once is very good idea :) but how can it use the same memory location for input and output? when output is wider than input it means that next input will be overwritten or I am misunderstanding something...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To clarify, I mean the input needs to always point to the same memory location, and the output always needs to point to the same location. The input and output memory locations are separate. Iconv changes the input and output pointers so they both have to be put back before the next call. The memory locations must only be set once in the program, and also can never be "free"d after the first call to iconv or the program crashes. And as I also found out in longer runs, it is not a good idea to close iconv before program exit, but instead let it be closed by program exit (or it can crash). Iconv is buggy as hell, but I think I can work with it. Although more testing may be needed to be sure.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just uploaded 2.3-beta3. Ddru_ntfsfindbad now has about the best possible fix/workaround for the iconv memory leak. It should use much less memory now, probably no more than the previous stable version. I have also improved how it handles unconvertible characters. It now does not exit, but instead skips the unconvertible characters in the name and produces output on the screen for how many were skipped for each inode. I also found that you can add //TRANSLIT to the encoding type for it to make an attempt at converting the characters to the closest thing it can, or to a default character (usually the "?"). For example, try the option --encoding CP866//TRANSLIT and see what happens. There is also //IGNORE, but that is basically what is now done by default, and using that addition only suppresses all messages in this case.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
when non-latin filenames (Cyrillic in my case) are processed by ddru_ntfsfindbad, they are completely unreadable in logfile because as far as I see in sources ntfs name attributes (utf-16 encoded) are processed by simply copying each even character from attribute... is it possible to use iconv or so to get right name strings? Moreover, I think that output locale optional parameter could be useful.
Thanks in advance!
You are correct that I used each even character from the attribute file name, as I did not know how to convert when I wrote it. As I now look at iconv, I am thinking you would like "UTF−16" converted to type "char", as "char" would be local to the machine. I would think this would convert it to a readable form, and would actually be the proper way to do it. I probably won't be able to get to this for a few days, but I will definitely look into it and see what I can do.
Scott
Thank you for a quick reply!
But it's impossible to convert utf-16 to some "common" char because of codepages existence. Cyrillic converted from utf-16 to CP866 (DOS russian) is unreadable under Win because there is CP1251 used. Many systems have multibyte locales (Ubuntu locale is UTF-8, that is variable-width encoding). iconv can deal with all that (at least its console version, I think that it is based on iconv() too) but to determine current locale in runtime and to have an optional command string parameter (say, in straight literal form of iconv parameters) to override it could be useful I think.
I have just uploaded ddrutility-2.3-beta1 in the "other and testing" section in the files area. Ddru_ntfsfindbad now properly converts the NTFS file names from UTF-16 to UTF-8, and has the added option of -e, --encoding which will allow the manual change of the file name encoding. But I don't think you will have to use the option as the default seems to work. I tested by manually editing an MFT entry to include UTF-16 Cyrillic characters, and the results were visible in the output file.
Please test and let me know if it works properly, and if it works for you without having to use the option manually.
Scott
There appears to be a major memory bug in my release of 2.3-beta1. Ddru_ntfsfindbad eats memory like crazy and will crash on a large MTF. It appears I have more work to do.
Scott
Hm, I've just tested 2.3 beta1 (maybe mft was not long enough (~100K records) -- all works ok in default encoding to utf-8. With -e option set to another encoding (CP1251 or CP866) it throws an error during MFT records processing, for example:
priconv failed: in string ' ', length 56, out string '18 ��� ', length 1021
Invalid multibyte sequence.
processing inode 7719 of 10510
When UTF-8 (my system locale) is set with -e parameter, all is ok.
But I found one more bug when more than one name attribute is in one FILE record: when attribute with namespace 2(DOS) is followed by namespace 1(Win32) then ok, right name (namespace 1) is used. But when first is namespace 1 and then 2 -- using DOS namespace what is wrong as it looks like 45D61~1 or so. It seems that in any case last attribute is used. I attached here three records with two name fields each.
I am glad the utf-8 works.
The "invalid multibyte sequence" error is from iconv itself. It seems to happen when the conversion cannot be done properly. I don't consider this a bug, however I am not sure I am handling it in the right way yet. So that is why it exits. I have not had time to see what is really happening in the conversion. For instance, I can normally convert my small test disk to ASCII, but if I add a Cyrillic character I get the error, and don't yet know if it was partially converted or just not at all.
As for the DOS filenames, that is not supposed to happen. If there is more than one it is supposed to pick anything better then type 0x02 (DOS 8.3). Maybe I broke something, or don't have something right. Thanks for the MFT records, they could be very useful in my testing.
It may take me several days or longer to work all these things out, but I will definitely address them as best possible. First I have to fix my major memory leak :)
Scott
I just uploaded 2.3-beta2. It fixes the memory leak, and I think it fixes the issue with DOS file names in the output. Please test and report if it fixes the dos file name issue for you. I did not spend much time on it, only enough to address these immediate issues without much testing. It is still beta for a reason.
Brilliant! Thank you thank you thank you :) works like a charm. At partition with ~650K records it consumes 1.5G memory, is it right? so, if even it's not right, it is not annoying at all :)
I have found another memory leak. It appears that iconv itself has a major memory leak. In the beta2 release it is causing at least 8 times more memory use than normal (it was 4 times worse than that when I forgot to iconv_close in beta1). I am not sure yet how to avoid this problem. So this is going to stay beta for awhile yet. This memory leak does not exist in the original stable version.
Iconv has been reported to have a memory leak in 2004 and 2009 (and I even saw something from 1999). I am using xubuntu 12.04 which is only 2 years old. One would think the f#(%!#@ bug would be fixed by now!
Hmm. Maybe it's possible to use something else for conversion? For example, I googled that: http://site.icu-project.org/
===
What is ICU?
ICU is a cross-platform Unicode based globalization library. It includes support for locale-sensitive string comparison, date/time/number/currency/message formatting, text boundary detection, character set conversion and so on.
Well, as far as I see there are at least 2 versions of iconv() -- glibc version and libiconv that shoud be installed separately: https://www.gnu.org/software/libiconv/
Maybe libiconv version has not memory issues?
First, since ddrutility is compiled by the user, it would make it difficult to add ICU or include a specific version of iconv. Either it has to be standard, or I have to be able to easily include it in the source.
Second, I think I am figuring out how to deal with the iconv memory leak! Iconv seems VERY picky about always using the EXACT same memory location for input and output for every call, and it also only likes being opened and closed once during the program run. Actually it may not like even being closed before the program ends under some conditions. I do not have a beta to release yet as I want to work on a few things, but as far as I can tell I should be able to eliminate or very much suppress the memory leak :)
Yes, calling iconv_open() once is very good idea :) but how can it use the same memory location for input and output? when output is wider than input it means that next input will be overwritten or I am misunderstanding something...
To clarify, I mean the input needs to always point to the same memory location, and the output always needs to point to the same location. The input and output memory locations are separate. Iconv changes the input and output pointers so they both have to be put back before the next call. The memory locations must only be set once in the program, and also can never be "free"d after the first call to iconv or the program crashes. And as I also found out in longer runs, it is not a good idea to close iconv before program exit, but instead let it be closed by program exit (or it can crash). Iconv is buggy as hell, but I think I can work with it. Although more testing may be needed to be sure.
I just uploaded 2.3-beta3. Ddru_ntfsfindbad now has about the best possible fix/workaround for the iconv memory leak. It should use much less memory now, probably no more than the previous stable version. I have also improved how it handles unconvertible characters. It now does not exit, but instead skips the unconvertible characters in the name and produces output on the screen for how many were skipped for each inode. I also found that you can add //TRANSLIT to the encoding type for it to make an attempt at converting the characters to the closest thing it can, or to a default character (usually the "?"). For example, try the option --encoding CP866//TRANSLIT and see what happens. There is also //IGNORE, but that is basically what is now done by default, and using that addition only suppresses all messages in this case.
Thank you for nice work :) tested v2.3 in some cases and had no issues with language support!