CONVERTCP / Wiki / Home

Steffen - 2018-04-22

Replacing file content with the converted text

UPDATE: Version 8.2 supports overwriting of the original file. Pass a minus sign along with option /o.

Original text:
I already wrote some examples in the "readme.txt" about how to convert a single file. You may ask why CONVERTCP doesn't support overwriting of a file with the converted content. The reason is that the tool doesn't read the entire content of the file at once. Especially for large files it's more memory-efficient to read only chunk-wise in order to avoid running out of RAM space. This also makes that CONVERTCP shows a pretty high performance because as long as a chunk of text is read and converted, the former chunk can be written to the new file at the same time. You can imagine that writing to and reading from the same file and at the same time would corrupt the file. Also what if the conversion fails for whatever reason? You may lose your data. Good practice is to write to a temporary file and only overwrite the file if the conversion succeeded.
Example for a BAT or CMD script:

convertcp 1 65001 /b /i "test.txt" /o "test.txt.temp~" if not errorlevel 1 move /y "test.txt.temp~" "test.txt"

Don't worry about the MOVE command. There is no physical moving of data performed as long as the file is moved on the same logical drive. Only the file addressing will be updated.

Last edit: Steffen 2021-07-25
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Steffen - 2018-04-22

Convert all files of a certain directory

Not the built-in options make CONVERTCP flexible to use. It's rather the command-line interface. You don't need to drag every single file to a window, you also don't need to browse files or folders in a dialog window. Just use the possibilities that the Windows command line already provide.
Say, you want to convert all .txt files in the current directory from your default OEM code page to UTF-8 (similar to the example above) then you can just use a FOR loop in a Batch script.

for %%i in ("*.txt") do (convertcp 1 65001 /b /i "%%~i" /o "%%~i.~tmp" && move /y "%%~i.~tmp" "%%~i")

Last edit: Steffen 2018-04-22
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Steffen - 2018-04-22

CONVERTCP and non-ASCII file names

The source code of a Batch script has to be ASCII, or at least encoded in the default OEM code page or the code page you have set using CHCP. However you'll get trouble to write file names that contain characters that are not supported. This is a limitation of the command interpreter. It's not a limitation of CONVERTCP as you may have observed while running my latest example above. So working with a wildcard character (like the asterisk) in a loop is already a good workaround.
But what if you only want to convert one single file with a file name that you can't write in a command line? There are parameter variables available in a batch script. E.g. the first parameter passed to the script can be found in %1. Exactly as the FOR variables in the other script parameter variables support Unicode, too. Just write your CONVERTCP command line using parameter %1 like that:

@convertcp 1 65001 /b /i "%~1" /o "%~1.~tmp" && move /y "%~1.~tmp" "%~1"

Now you can drag and drop your text file onto the Batch script file even if it has such a weird name as "𤽜€ЭÈ음.txt".

Last edit: Steffen 2018-04-22
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Steffen - 2018-04-22

About flushing the output stream buffer

Maybe you are wondering what the /f option is for. I have to admit you won't find much about it in the "readme.txt".
Flushing is basically not needed. Written data will be buffered by the file system because the physical writing to the drive is slow. That way the performance can be improved enormously. There is no drawback as long as the file is not accessed concurrently. Appending new data to the buffer by another write operation will not corrupt the data in the buffer. That's the reason why the default behavior of CONVERTCP is that the buffer will not be flushed automatically, also not before the file was closed. Converting multiple files in a loop keeps being very fast in that case.
This having said, there is still a risk just in case you immediately access the new file while there might be yet unwritten data in the buffer. E.g. if you convert a file using CONVERTCP in a script and you want to process the new file in the very next line of the script then you might get trouble because physical writing of the data to the drive may still take a few hundreds of milliseconds (even if the file was already closed). Option /f forces the buffer flushing before CONVERTCP terminates which should protect you from unwritten data. Even though this doesn't necessarily mean that the data was already written to the physical drive memory. Drives may have an additional buffer and it depends on the driver settings whether or not the request of Windows to flush that additional buffer will be ignored. The latter is nothing that I'm even able to influence and this possible behavior of a drive would also cause issues for other programs that write data to the drive.

Last edit: Steffen 2018-04-22

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Steffen - 2018-04-26

Threading in CONVERTCP explained

As already written CONVERTCP reads the incoming stream chunk-wise if possible. The advantage is that an already converted chunk can be written using an asynchronuous thread at the same time as the next chunk of text is read and converted. That leads to a good performance. Furthermore the memory usage is limited to the buffer size the chunks need. This size doesn't increase even not if very large files are converted. Sounds like a good concept, doesn't it? That's the reason why threading should be used.
Things are getting complicated if the end of the read chunk is somewhere inside of a sequence of multiple bytes that represents a single character. There are charsets that have rules to recognize where the a character ends. I already use these rules for the processing of UTF-8 and UTF-16 streams in order to adjust the chunk boundaries accordingly. But there exist charsets without those rules. Such streams could get corrupted if their size exceed 1 MB (which is the default chunk size). If you list the code pages using option /l you'll find in the second column whether or not you can convert incoming streams greater than 1 MB chunk-wise without the risk of damaging their content. So the conclusion is that threading is not always as good as it seems to be.
That's the point where an automatic valuation comes in. If an incoming stream has a charset where threading is risky then the whole file will be read into the buffer. Due to internal limits of the used API functions the size of the incoming stream is still restricted. But now it's 511 MB rather than only 1 MB. The needed buffer size might increase tremendously. If you have only little RAM space the tool may crash (even if that should be quite unlikely on modern computers).
Run CONVERTCP with option /l. If you find a "No" in the second column for the code page of your incoming stream then you can only convert streams up to 511 MB. If you need to convert a stream greater than 511 MB from such a code page then don't use CONVERTCP at all.

Last edit: Steffen 2018-04-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Steffen - 2018-05-02

How to find the right Code Page ID and why CONVERTCP doesn't detect code pages automatically

The supported code pages depend on the installed code pages on your computer. Option /l lists the installed code pages along with a short description. A lot of these IDs are not self-explanatory though. If you're familiar with .NET or HTML you may already know some MIME names that are listed on the Microsoft page
https://msdn.microsoft.com/en-us/library/dd317756.aspx
This table might be already helpful if you want to convert HTML source text. Usually the MIME name is given as charset attribute in the source and you can look up the related code page ID in the table.
But what if you don't know what the encoding of your source text is? Well, the best is if you get a handshake from the person who sent you the file. This person should tell you the encoding. Seriously, not even established text editors are able to reliably guess single-byte code pages (such as ANSI code pages) if they differ from your defaults.
What else can you do? Some encodings might use a Byte Order Mark. In a HEX Editor you'll see it as first byte sequence in a file. In Wikipedia you can find a table of what you may see (it's in the second column):
https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding
The Unicode standard doesn't require Byte Order Marks though. Even in such cases a HEX editor could be useful. If you see an ASCII code followed by a 00 byte, then the next ASCII code followed by 00 etc. then you may consider the file is UTF-16 LE encoded. Vice versa (00 bytes first) would indicate UTF-16 BE. If you see a sequence of ASCII code and the first non-ASCII character consists of more than one byte then it's an indication for UTF-8.
Also the origin of a file might be of interest. If you know the file was created on a Linux distribution then the encoding is most likely UTF-8. ANSI encoded files may have the default encoding of the country they were created.
What if you get the name of the encoding but you don't know the related code page ID? I uploaded a list in "Code Page Aliases.pdf". In the first column you will find several aliases, such as MIME names, IATA numbers, names used on different operating systems, and names used in different programming languages. Not all of them may exactly represent the Windows code pages of the second column (that you pass to CONVERTCP) but they should be at least close to these encodings. You can also pass the alias directly rather than the related code page ID.

This is all quite complicated. So why is there no auto-detection implemented?
EDIT: As per version 8, auto-detection is supported. However, it's not foolproof and the below explanation is still perfectly valid.
It's because there is no real detection. It would be rather a guessing. As written above, ANSI encodings can't be guessed. Unicode encodings might be easy if the files have a Byte Order Mark. But if not, thousands of characters may have to read to evaluate the character values. Also heuristic or statistic methodes might have to be applied for guessings. That's how text editors act. But you can imagine that those things are only good for a text editor where it doesn't matter if it takes a second to guess an encoding (but may still fail). For a command line utility like CONVERTCP it would be a desaster. It would take ages if you call it for hundreds of files in a directory. And since guessing is still guessing you'd eventually like to blame me for corrupted or lost data.

Last edit: Steffen 2021-06-24

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Steffen - 2019-04-13

How to verify if the conversion to another encoding was correct?

Beginning with version 6.0, CONVERTCP supports option /v which influence the return value of the utility. If you pass /v, CONVERTCP verifies whether all characters from the input have been converted without having used any replacement characters or approximated ASCII characters. Only in this case CONVERTCP returns 0. If one or more characters have been found that do not match the same Unicode code point in the used code page for the output, the return value will be 1.
Besides of that, CONVERTCP will still try to convert the text and will still silently replace invalid characters as in versions older than 6.0.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Steffen - 2019-06-10

About Virtual Terminal processing for version 6.3 and newer

Most of the current command line utilities don't support Virtual Terminal processing yet. In this case ANSI escape sequences are not used to control the console output and their textual expressions get printed to the screen.
Even if I expect that VT processing will be only barely used along with CONVERTCP, it won't hurt to enable it once that Windows 10 provides this possibility.
Example using CONVERTCP v. 6.3:

>nul chcp 65001 echo +ABsAWw-93;42m+JYgliCWIJZMlkyWTJZIlkiWSJZElkSWR- +ABsAWw-1E+ABsAWw-4;94;41m VT Processing +ABsAWw-0m|convertcp "UTF-7" "UTF-8"

Virtual Terminal processing affects the output to the console window only. Thus, CONVERTCP has to print to the window directly by omitting option /o and any redirections of the standard output stream. It's supported on Windows version 10.0.10586 onwards if the new console host is used.
The behavior on older Windows versions and for writing to files as well as for redirections keeps being the same as before.

Last edit: Steffen 2019-06-12
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

CONVERTCP Wiki

Text File Codepage Converter for the Windows command line

Home

Discussion