| encoding(3tcl) | Tcl Built-In Commands | encoding(3tcl) |
encoding - Manipulate encodings
encoding option ?arg arg ...?
Strings in Tcl are logically a sequence of Unicode characters. These strings are represented in memory as a sequence of bytes that may be in one of several encodings: modified UTF-8 (which uses 1 to 4 bytes per character), or a custom encoding start as 8 bit binary data.
Different operating system interfaces or applications may generate strings in other encodings such as Shift-JIS. The encoding command helps to bridge the gap between Unicode and these other formats.
Performs one of several encoding related operations, depending on option. The legal options are:
The -profile option determines the command behavior in the presence of conversion errors. See the PROFILES section below for details. Any premature termination of processing due to errors is reported through an exception if the -failindex option is not specified.
If the -failindex is specified, instead of an exception being raised on premature termination, the result of the conversion up to the point of the error is returned as the result of the command. In addition, the index of the source byte triggering the error is stored in var. If no errors are encountered, the entire result of the conversion is returned and the value -1 is stored in var.
The -profile and -failindex options have the same effect as described for the encoding convertfrom command.
Operations involving encoding transforms may encounter several types of errors such as invalid sequences in the source data, characters that cannot be encoded in the target encoding and so on. A profile prescribes the strategy for dealing with such errors in one of two ways:
The following profiles are currently implemented with strict being the default if the -profile is not specified.
When converting from Tcl strings to an external encoding format using encoding convertto, characters that cannot be represented in the target encoding are replaced by an encoding-dependent character, usually the question mark ?.
When converting an encoded byte sequence to a Tcl string using encoding convertfrom, invalid bytes are replaced by the U+FFFD REPLACEMENT CHARACTER code point.
When encoding a Tcl string with encoding convertto, code points that cannot be represented in the target encoding are transformed to an encoding-specific fallback character, U+FFFD REPLACEMENT CHARACTER for UTF targets and generally `?` for other encodings.
These examples use the utility proc below that prints the Unicode code points comprising a Tcl string.
proc codepoints s {join [lmap c [split $s {}] {
string cat U+ [format %.6X [scan $c %c]]}]
}
Example 1: convert a byte sequence in Japanese euc-jp encoding to a TCL string:
% codepoints [encoding convertfrom euc-jp "\xA4\xCF"] U+00306F
The result is the unicode codepoint “\u306F”, which is the Hiragana letter HA.
Example 2: Error handling based on profiles:
The letter A is Unicode character U+0041 and the byte "\x80" is invalid in ASCII encoding.
% codepoints [encoding convertfrom -profile tcl8 ascii A\x80] U+000041 U+000080 % codepoints [encoding convertfrom -profile replace ascii A\x80] U+000041 U+00FFFD % codepoints [encoding convertfrom -profile strict ascii A\x80] unexpected byte sequence starting at index 1: '\x80'
Example 3: Get partial data and the error location:
% codepoints [encoding convertfrom -profile strict -failindex idx ascii AB\x80] U+000041 U+000042 % set idx 2
Example 4: Encode a character that is not representable in ISO8859-1:
% encoding convertto iso8859-1 A\u0141 A? % encoding convertto -profile strict iso8859-1 A\u0141 unexpected character at index 1: 'U+000141' % encoding convertto -profile strict -failindex idx iso8859-1 A\u0141 A % set idx 1
encoding, unicode
| 8.1 | Tcl |