DOKK / manpages / debian 12 / ucto / ucto.1.en

General Commands Manual

NAME

ucto - Unicode Tokenizer

SYNOPSIS

ucto [[options]] [input‐file] [[output‐file]]

DESCRIPTION

ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.

OPTIONS

-c configfile

read settings from a file

-d value

set debug mode to 'value'

-e value

set input encoding. (default UTF8)

-N value

set UTF8 output normalization. (default NFC)

--filter=[YES|NO]

disable filtering of special characters, (default YES) These special characters can be specified in the [FILTER] block of the configuration file.

-f

OBSOLETE. use --filter=NO

-L language

Automatically selects a configuration file by language code. The language code is generally a three-letter iso-639-3 code. For example, 'fra' will select the file tokconfig‐fra from the installation directory

--detectlanguages=<lang1,lang2,..langn>

try to detect all the specified languages. The default language will be 'lang1'. (only useful for FoLiA output)

-l

Convert to all lowercase

-u

Convert to all uppercase

-n

Emit one sentence per line on output

-m

Assume one sentence per line on input

--normalize=class1,class2,..,classn

map all occurrences of tokens with class1,...class to their generic names. e.g --normalize=DATE will map all dates to the word {{DATE}}. Very useful to normalize tokens like URL's, DATE's, E-mail addresses and so on.

--add-tokens="file"

Add additional tokens to the [TOKENS] block of the default language. The file should contain one TOKEN per line.

--passthru

Don't tokenize, but perform input decoding and simple token role detection

--filterpunct

remove most of the punctuation from the output. (not from abreviations and embedded punctuation like John's)

-P

Disable Paragraph Detection

-Q

Enable Quote Detection. (this is experimental and may lead to unexpected results)

-s <string>

Set End‐of‐sentence marker. (Default <utt>)

-V

Show version information

-v

set Verbose mode

-F

Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nPQvs) For files with an '.xml' extension, -F is the default.

--inputclass="cls"

When tokenizing a FoLiA XML document, search for text nodes of class 'cls'. The default is "current".

--outputclass="cls"

When tokenizing a FoLiA XML document, output the tokenized text in text nodes with 'cls'. The default is "current". It is recommended to have different classes for input and output.

--textclass="cls"(obsolete)

use 'cls' for input and output of text from FoLiA. Equivalent to both --inputclass='cls' and --outputclass='cls')

This option is obsolete and NOT recommended. Please use the separate --inputclass= and --outputclass options.

-X

Output FoLiA XML. (this disables usage of most other options: -nPQvs)

--id <DocId>

Use the specified Document ID for the FoLiA XML

-x <DocId> (obsolete)

Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nPQvs).

obsolete Use -X and --id instead

BUGS

likely

AUTHORS

Maarten van Gompel proycon@anaproy.nl

Ko van der Sloot Timbl@uvt.nl

2018 nov 13