gendict - Compiles word list into ICU string trie
dictionary
gendict [ --uchars |
--bytes --transform transform
] [ -h, -?, --help ]
[ -V, --version ] [ -c,
--copyright ] [ -v,
--verbose ] [ -i,
--icudatadir directory ]
input-file output-file
gendict reads the word list from dictionary-file and
creates a string trie dictionary file. Normally this data file has the
.dict extension.
Words begin at the beginning of a line and are terminated by the
first whitespace. Lines that begin with whitespace are ignored.
- -h, -?,
--help
- Print help about usage and exit.
- -V,
--version
- Print the version of gendict and exit.
- -c,
--copyright
- Embeds the standard ICU copyright into the output-file.
- -v,
--verbose
- Display extra informative messages during execution.
- -i,
--icudatadir directory
- Look for any necessary ICU data files in directory. For example,
the file pnames.icu must be located when ICU's data is not built as
a shared library. The default ICU data directory is specified by the
environment variable ICU_DATA. Most configurations of ICU do not
require this argument.
- --uchars
- Set the output trie type to UChar. Mutually exclusive with
--bytes.
- --bytes
- Set the output trie type to Bytes. Mutually exclusive with
--uchars.
- --transform
- Set the transform type. Should only be specified with --bytes.
Currently supported transforms are: offset-<hex-number>,
which specifies an offset to subtract from all input characters. It should
be noted that the offset transform also maps U+200D to 0xFF and U+200C to
0xFE, in order to offer compatibility to languages that require these
characters. A transform must be specified for a bytes trie, and when
applied to the non-value characters in the input-file must produce
output between 0x00 and 0xFF.
- input-file
- The source file to read.
-
output-file
- The file to write the output dictionary to.
The input-file is assumed to be encoded in UTF-8. The
integers in the input-file that are used as values must be made up of
ASCII digits. They may be specified either in hex, by using a 0x prefix, or
in decimal. Either --bytes or --uchars must be specified.
- ICU_DATA
- Specifies the directory containing ICU data. Defaults to
${prefix}/share/icu/72.1/. Some tools in ICU depend on the presence
of the trailing slash. It is thus important to make sure that it is
present if ICU_DATA is set.
Copyright (C) 2012 International Business Machines Corporation and
others
http://www.icu-project.org/userguide/boundaryAnalysis.html