Character Set Conversion (xm_charconv)
This module provides tools for converting strings between different character sets (code pages).
All the encodings available to iconv are supported.
On GNU/Linux systems execute iconv -l
for a list of encoding names.
Configuration
The xm_charconv module accepts the following directives in addition to the common module directives.
Optional directives
This optional directive accepts a comma-separated list of character set names.
When |
|
If this optional directive is specified with an encoding, a data converter will be registered to convert from the specified encoding. If this directive is not specified, it defaults to UTF-8. |
|
If this optional directive is specified with an encoding, a data converter will be registered to convert to the specified encoding. If this directive is not specified, it defaults to UTF-8. |
Data conversion
The xm_charconv module implements a data converter to be used with the im_file module. It is specified in the InputType directive of im_file module and is invoked using dot notation:
<InstanceName>.<DataConverterName>
Where <InstanceName>
is the given name of the xm_charconv instance and
<DataConverterName>
is the name of the converter being invoked.
The following data converter is available:
- convert
-
This data converter is used to convert data from encoding specified in InputEncoding to encoding specified in OutputEncoding. The converter should be specified in the InputType directive before the input reader function.
Examples
This configuration shows an example of character set auto-detection. The input file can contain lines with different encodings, and the module normalizes output to UTF-8.
<Extension converter>
Module xm_charconv
AutodetectCharsets utf-8, euc-jp, utf-16, utf-32, iso8859-2
</Extension>
<Input filein>
Module im_file
File "tmp/input"
Exec convert_fields("auto", "utf-8");
</Input>
This configuration uses the data converter registered via the InputEncoding directive to read a file with the ISO-8859-2 encoding.
<Extension converter>
Module xm_charconv
InputEncoding ISO-8859-2
</Extension>
<Input filein>
Module im_file
File "tmp/input/iso-8859-2.in"
InputType converter.convert
</Input>
This configuration uses a data converter along with xm_multiline as an InputType to read a file with UCS-2BE encoding. Logs in this file take 3 lines each.
<Extension converter>
Module xm_charconv
InputEncoding UCS-2BE
</Extension>
<Extension multiline>
Module xm_multiline
FixedLineCount 3
</Extension>
<Input filein>
Module im_file
File 'tmp/input/ucs-2be.in'
InputType converter.convert, multiline
</Input>
This configuration uses the data converter registered via the OutputEncoding directive to store log data into a file with the ISO-8859-2 encoding.
<Extension converter>
Module xm_charconv
OutputEncoding ISO-8859-2
</Extension>
<Input filein>
Module im_file
File "tmp/input/utf-8.in"
</Input>
<Output fileout>
Module om_file
File "tmp/iso-8859-2.out"
OutputType converter.convert
</Output>