Character Set Conversion (xm_charconv)
This module provides tools for converting strings between different character sets (code pages).
All the encodings available to iconv are supported.
On GNU/Linux systems execute iconv -l
for a list of encoding names.
The functionality of xm_charconv can be combined with other modules providing data conversion such as xm_crypto or xm_zlib.
To examine the supported platforms, see the list of installation packages. |
Configuration
The xm_charconv module accepts the following directives in addition to the common module directives.
Optional directives
This optional directive accepts a comma-separated list of character set names.
When |
|
If this optional directive is specified with an encoding, a data converter will be registered to convert from the specified encoding. If this directive is not specified, it defaults to UTF-8. |
|
If this optional directive is specified with an encoding, a data converter will be registered to convert to tghe specified encoding. If this directive is not specified, it defaults to UTF-8. |
Data conversion
The xm_charconv module implements a data converter to be used with the im_file module. It is specified in the InputType directive of im_file module and is invoked using dot notation:
<InstanceName>.<DataConverterName>
Where <InstanceName>
is the given name of the xm_charconv instance and
<DataConverterName>
is the name of the converter being invoked.
The following data converter is available:
- convert
-
This data converter is used to convert data from encoding specified in InputEncoding to encoding specified in OutputEncoding. The converter should be specified in the InputType directive before the input reader function.
Examples
This configuration shows an example of character set auto-detection. The input file can contain lines with different encodings, and the module normalizes output to UTF-8.
<Extension converter>
Module xm_charconv
AutodetectCharsets utf-8, euc-jp, utf-16, utf-32, iso8859-2
</Extension>
<Input filein>
Module im_file
File "tmp/input"
Exec convert_fields("auto", "utf-8");
</Input>
This configuration uses the data converter registered via the InputEncoding directive to read a file with the ISO-8859-2 encoding.
<Extension converter>
Module xm_charconv
InputEncoding ISO-8859-2
</Extension>
<Input filein>
Module im_file
File "tmp/input/iso-8859-2.in"
InputType converter.convert
</Input>
This configuration uses a data converter with xm_multiline as an InputType to read a file with UCS-2BE encoding. Each log record in this file spans 3 lines.
<Extension converter>
Module xm_charconv
InputEncoding UCS-2BE
</Extension>
<Extension multiline>
Module xm_multiline
FixedLineCount 3
</Extension>
<Input filein>
Module im_file
File 'tmp/input/ucs-2be.in'
InputType converter.convert, multiline
</Input>
This configuration uses the data converter registered via the OutputEncoding directive to store log data into a file with the ISO-8859-2 encoding.
<Extension converter>
Module xm_charconv
OutputEncoding ISO-8859-2
</Extension>
<Input filein>
Module im_file
File "tmp/input/utf-8.in"
</Input>
<Output fileout>
Module om_file
File "tmp/iso-8859-2.out"
OutputType converter.convert
</Output>