Character Set Conversion (xm_charconv)
This module provides tools for converting strings between different character sets (code pages).
All the encodings available to iconv are supported.
On GNU/Linux systems execute iconv -l
for a list of encoding names.
To examine the supported platforms, see the list of installer packages in the Available Modules chapter. |
Configuration
The xm_charconv module accepts the following directives in addition to the common module directives.
- AutodetectCharsets
-
This optional directive accepts a comma-separated list of character set names. When
auto
is specified as the source encoding for convert() or convert_fields(), these character sets will be tried for conversion. This directive is not related to the LineReader directive or the corresponding InputType registered when LineReader is specified.
- BigEndian
-
This optional boolean directive specifies the endianness to use during the encoding conversion. If this directive is not specified, it defaults to the host’s endianness. This directive only affects the registered InputType, and is only applicable if the LineReader directive is set to a non-Unicode encoding and the CharBytes directive is set to 2 or 4.
- CharBytes
-
This optional integer directive specifies the byte-width of the encoding to use during conversion. Accepted values are 1 (the default), 2, and 4. Most variable width encodings will work with the default value. This directive only affects the registered InputType and is only applicable if the LineReader directive is set to a non-Unicode encoding.
- LineReader
-
If this optional directive is specified with an encoding, an InputType will be registered using the name of the xm_charconv module instance. The following Unicode encodings are supported: UTF-8, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, and UTF-7. For other encodings, it may be necessary to also set the BigEndian and/or CharBytes directives.
Examples
This configuration shows an example of character set auto-detection. The input file can contain lines with different encodings, and the module normalizes output to UTF-8.
<Extension converter>
Module xm_charconv
AutodetectCharsets utf-8, euc-jp, utf-16, utf-32, iso8859-2
</Extension>
<Input filein>
Module im_file
File "tmp/input"
Exec convert_fields("auto", "utf-8");
</Input>
This configuration uses the InputType registered via the LineReader directive to read a file with the ISO-8859-2 encoding.
<Extension converter>
Module xm_charconv
LineReader ISO-8859-2
</Extension>
<Input filein>
Module im_file
File "tmp/input/iso-8859-2.in"
InputType converter
</Input>