Odeum - the inverted API of QDBM
#include <depot.h>
#include <cabin.h>
#include <odeum.h>
#include <stdlib.h>
typedef struct { int id; int score; } ODPAIR;
ODEUM *odopen(const char *name, int omode);
int odclose(ODEUM *odeum);
int odput(ODEUM *odeum, const ODDOC *doc, int wmax, int
over);
int odout(ODEUM *odeum, const char *uri);
int odoutbyid(ODEUM *odeum, int id);
ODDOC *odget(ODEUM *odeum, const char *uri);
ODDOC *odgetbyid(ODEUM *odeum, int id);
int odgetidbyuri(ODEUM *odeum, const char *uri);
int odcheck(ODEUM *odeum, int id);
ODPAIR *odsearch(ODEUM *odeum, const char *word, int max, int
*np);
int odsearchdnum(ODEUM *odeum, const char *word);
int oditerinit(ODEUM *odeum);
ODDOC *oditernext(ODEUM *odeum);
int odsync(ODEUM *odeum);
int odoptimize(ODEUM *odeum);
char *odname(ODEUM *odeum);
double odfsiz(ODEUM *odeum);
int odbnum(ODEUM *odeum);
int odbusenum(ODEUM *odeum);
int oddnum(ODEUM *odeum);
int odwnum(ODEUM *odeum);
int odwritable(ODEUM *odeum);
int odfatalerror(ODEUM *odeum);
int odinode(ODEUM *odeum);
time_t odmtime(ODEUM *odeum);
int odmerge(const char *name, const CBLIST *elemnames);
int odremove(const char *name);
ODDOC *oddocopen(const char *uri);
void oddocclose(ODDOC *doc);
void oddocaddattr(ODDOC *doc, const char *name, const char
*value);
void oddocaddword(ODDOC *doc, const char *normal, const char
*asis);
int oddocid(const ODDOC *doc);
const char *oddocuri(const ODDOC *doc);
const char *oddocgetattr(const ODDOC *doc, const char
*name);
const CBLIST *oddocnwords(const ODDOC *doc);
const CBLIST *oddocawords(const ODDOC *doc);
CBMAP *oddocscores(const ODDOC *doc, int max, ODEUM
*odeum);
CBLIST *odbreaktext(const char *text);
char *odnormalizeword(const char *asis);
ODPAIR *odpairsand(ODPAIR *apairs, int anum, ODPAIR *bpairs,
int bnum, int *np);
ODPAIR *odpairsor(ODPAIR *apairs, int anum, ODPAIR *bpairs, int
bnum, int *np);
ODPAIR *odpairsnotand(ODPAIR *apairs, int anum, ODPAIR *bpairs,
int bnum, int *np);
void odpairssort(ODPAIR *pairs, int pnum);
double odlogarithm(double x);
double odvectorcosine(const int *avec, const int *bvec, int
vnum);
void odsettuning(int ibnum, int idnum, int cbnum, int
csiz);
void odanalyzetext(ODEUM *odeum, const char *text, CBLIST
*awords, CBLIST *nwords);
void odsetcharclass(ODEUM *odeum, const char *spacechars, const
char *delimchars, const char *gluechars);
ODPAIR *odquery(ODEUM *odeum, const char *query, int *np,
CBLIST *errors);
Odeum is the API which handles an inverted index. An inverted
index is a data structure to retrieve a list of some documents that include
one of words which were extracted from a population of documents. It is easy
to realize a full-text search system with an inverted index. Odeum provides
an abstract data structure which consists of words and attributes of a
document. It is used when an application stores a document into a database
and when an application retrieves some documents from a database.
Odeum does not provide methods to extract the text from the
original data of a document. It should be implemented by applications.
Although Odeum provides utilities to extract words from a text, it is
oriented to such languages whose words are separated with space characters
as English. If an application handles such languages which need
morphological analysis or N-gram analysis as Japanese, or if an application
perform more such rarefied analysis of natural languages as stemming, its
own analyzing method can be adopted. Result of search is expressed as an
array contains elements which are structures composed of the ID number of
documents and its score. In order to search with two or more words, Odeum
provides utilities of set operations.
Odeum is implemented, based on Curia, Cabin, and Villa. Odeum
creates a database with a directory name. Some databases of Curia and Villa
are placed in the specified directory. For example, `casket/docs',
`casket/index', and `casket/rdocs' are created in the case that a database
directory named as `casket'. `docs' is a database directory of Curia. The
key of each record is the ID number of a document, and the value is such
attributes as URI. `index' is a database directory of Curia. The key of each
record is the normalized form of a word, and the value is an array whose
element is a pair of the ID number of a document including the word and its
score. `rdocs' is a database file of Villa. The key of each record is the
URI of a document, and the value is its ID number.
In order to use Odeum, you should include `depot.h', `cabin.h',
`odeum.h' and `stdlib.h' in the source files. Usually, the following
description will be near the beginning of a source file.
#include <depot.h>
#include <cabin.h>
#include <odeum.h>
#include <stdlib.h>
A pointer to `ODEUM' is used as a database handle. A database
handle is opened with the function `odopen' and closed with `odclose'. You
should not refer directly to any member of the handle. If a fatal error
occurs in a database, any access method via the handle except `odclose' will
not work and return error status. Although a process is allowed to use
multiple database handles at the same time, handles of the same database
file should not be used.
A pointer to `ODDOC' is used as a document handle. A document
handle is opened with the function `oddocopen' and closed with `oddocclose'.
You should not refer directly to any member of the handle. A document
consists of attributes and words. Each word is expressed as a pair of a
normalized form and a appearance form.
Odeum also assign the external variable `dpecode' with the error
code. The function `dperrmsg' is used in order to get the message of the
error code.
Structures of `ODPAIR' type is used in order to handle results of
search.
- typedef struct { int
id; int score; } ODPAIR;
- `id' specifies the ID number of a document. `score' specifies the score
calculated from the number of searching words in the document.
The function `odopen' is used in order to get a database
handle.
- ODEUM *odopen(const char
*name, int omode);
- `name' specifies the name of a database directory. `omode' specifies the
connection mode: `OD_OWRITER' as a writer, `OD_OREADER' as a reader. If
the mode is `OD_OWRITER', the following may be added by bitwise or:
`OD_OCREAT', which means it creates a new database if not exist,
`OD_OTRUNC', which means it creates a new database regardless if one
exists. Both of `OD_OREADER' and `OD_OWRITER' can be added to by bitwise
or: `OD_ONOLCK', which means it opens a database directory without file
locking, or `OD_OLCKNB', which means locking is performed without
blocking. The return value is the database handle or `NULL' if it is not
successful. While connecting as a writer, an exclusive lock is invoked to
the database directory. While connecting as a reader, a shared lock is
invoked to the database directory. The thread blocks until the lock is
achieved. If `OD_ONOLCK' is used, the application is responsible for
exclusion control.
The function `odclose' is used in order to close a database
handle.
- int odclose(ODEUM
*odeum);
- `odeum' specifies a database handle. If successful, the return value is
true, else, it is false. Because the region of a closed handle is
released, it becomes impossible to use the handle. Updating a database is
assured to be written when the handle is closed. If a writer opens a
database but does not close it appropriately, the database will be
broken.
The function `odput' is used in order to store a document.
- int odput(ODEUM *odeum,
const ODDOC *doc, int wmax, int over);
- `odeum' specifies a database handle connected as a writer. `doc' specifies
a document handle. `wmax' specifies the max number of words to be stored
in the document database. If it is negative, the number is unlimited.
`over' specifies whether the data of the duplicated document is
overwritten or not. If it is false and the URI of the document is
duplicated, the function returns as an error. If successful, the return
value is true, else, it is false.
The function `odout' is used in order to delete a document
specified by a URI.
- int odout(ODEUM *odeum,
const char *uri);
- `odeum' specifies a database handle connected as a writer. `uri' specifies
the string of the URI of a document. If successful, the return value is
true, else, it is false. False is returned when no document corresponds to
the specified URI.
The function `odoutbyid' is used in order to delete a document
specified by an ID number.
- int odoutbyid(ODEUM
*odeum, int id);
- `odeum' specifies a database handle connected as a writer. `id' specifies
the ID number of a document. If successful, the return value is true,
else, it is false. False is returned when no document corresponds to the
specified ID number.
The function `odget' is used in order to retrieve a document
specified by a URI.
- ODDOC *odget(ODEUM
*odeum, const char *uri);
- `odeum' specifies a database handle. `uri' specifies the string of the URI
of a document. If successful, the return value is the handle of the
corresponding document, else, it is `NULL'. `NULL' is returned when no
document corresponds to the specified URI. Because the handle of the
return value is opened with the function `oddocopen', it should be closed
with the function `oddocclose'.
The function `odgetbyid' is used in order to retrieve a document
by an ID number.
- ODDOC
*odgetbyid(ODEUM *odeum, int id);
- `odeum' specifies a database handle. `id' specifies the ID number of a
document. If successful, the return value is the handle of the
corresponding document, else, it is `NULL'. `NULL' is returned when no
document corresponds to the specified ID number. Because the handle of the
return value is opened with the function `oddocopen', it should be closed
with the function `oddocclose'.
The function `odgetidbyuri' is used in order to retrieve the ID of
the document specified by a URI.
- int odgetidbyuri(ODEUM
*odeum, const char *uri);
- `odeum' specifies a database handle. `uri' specifies the string the URI of
a document. If successful, the return value is the ID number of the
document, else, it is -1. -1 is returned when no document corresponds to
the specified URI.
The function `odcheck' is used in order to check whether the
document specified by an ID number exists.
- int odcheck(ODEUM *odeum,
int id);
- `odeum' specifies a database handle. `id' specifies the ID number of a
document. The return value is true if the document exists, else, it is
false.
The function `odsearch' is used in order to search the inverted
index for documents including a particular word.
- ODPAIR *odsearch(ODEUM
*odeum, const char *word, int max, int *np);
- `odeum' specifies a database handle. `word' specifies a searching word.
`max' specifies the max number of documents to be retrieve. `np' specifies
the pointer to a variable to which the number of the elements of the
return value is assigned. If successful, the return value is the pointer
to an array, else, it is `NULL'. Each element of the array is a pair of
the ID number and the score of a document, and sorted in descending order
of their scores. Even if no document corresponds to the specified word, it
is not error but returns an dummy array. Because the region of the return
value is allocated with the `malloc' call, it should be released with the
`free' call if it is no longer in use. Note that each element of the array
of the return value can be data of a deleted document.
The function `odsearchnum' is used in order to get the number of
documents including a word.
- int odsearchdnum(ODEUM
*odeum, const char *word);
- `odeum' specifies a database handle. `word' specifies a searching word. If
successful, the return value is the number of documents including the
word, else, it is -1. Because this function does not read the entity of
the inverted index, it is faster than `odsearch'.
The function `oditerinit' is used in order to initialize the
iterator of a database handle.
- int oditerinit(ODEUM
*odeum);
- `odeum' specifies a database handle. If successful, the return value is
true, else, it is false. The iterator is used in order to access every
document stored in a database.
The function `oditernext' is used in order to get the next key of
the iterator.
- ODDOC
*oditernext(ODEUM *odeum);
- `odeum' specifies a database handle. If successful, the return value is
the handle of the next document, else, it is `NULL'. `NULL' is returned
when no document is to be get out of the iterator. It is possible to
access every document by iteration of calling this function. However, it
is not assured if updating the database is occurred while the iteration.
Besides, the order of this traversal access method is arbitrary, so it is
not assured that the order of string matches the one of the traversal
access. Because the handle of the return value is opened with the function
`oddocopen', it should be closed with the function `oddocclose'.
The function `odsync' is used in order to synchronize updating
contents with the files and the devices.
- int odsync(ODEUM
*odeum);
- `odeum' specifies a database handle connected as a writer. If successful,
the return value is true, else, it is false. This function is useful when
another process uses the connected database directory.
The function `odoptimize' is used in order to optimize a
database.
- int odoptimize(ODEUM
*odeum);
- `odeum' specifies a database handle connected as a writer. If successful,
the return value is true, else, it is false. Elements of the deleted
documents in the inverted index are purged.
The function `odname' is used in order to get the name of a
database.
- char *odname(ODEUM
*odeum);
- `odeum' specifies a database handle. If successful, the return value is
the pointer to the region of the name of the database, else, it is `NULL'.
Because the region of the return value is allocated with the `malloc'
call, it should be released with the `free' call if it is no longer in
use.
The function `odfsiz' is used in order to get the total size of
database files.
- double odfsiz(ODEUM
*odeum);
- `odeum' specifies a database handle. If successful, the return value is
the total size of the database files, else, it is -1.0.
The function `odbnum' is used in order to get the total number of
the elements of the bucket arrays in the inverted index.
- int odbnum(ODEUM
*odeum);
- `odeum' specifies a database handle. If successful, the return value is
the total number of the elements of the bucket arrays, else, it is
-1.
The function `odbusenum' is used in order to get the total number
of the used elements of the bucket arrays in the inverted index.
- int odbusenum(ODEUM
*odeum);
- `odeum' specifies a database handle. If successful, the return value is
the total number of the used elements of the bucket arrays, else, it is
-1.
The function `oddnum' is used in order to get the number of the
documents stored in a database.
- int oddnum(ODEUM
*odeum);
- `odeum' specifies a database handle. If successful, the return value is
the number of the documents stored in the database, else, it is -1.
The function `odwnum' is used in order to get the number of the
words stored in a database.
- int odwnum(ODEUM
*odeum);
- `odeum' specifies a database handle. If successful, the return value is
the number of the words stored in the database, else, it is -1. Because of
the I/O buffer, the return value may be less than the hard number.
The function `odwritable' is used in order to check whether a
database handle is a writer or not.
- int odwritable(ODEUM
*odeum);
- `odeum' specifies a database handle. The return value is true if the
handle is a writer, false if not.
The function `odfatalerror' is used in order to check whether a
database has a fatal error or not.
- int odfatalerror(ODEUM
*odeum);
- `odeum' specifies a database handle. The return value is true if the
database has a fatal error, false if not.
The function `odinode' is used in order to get the inode number of
a database directory.
- int odinode(ODEUM
*odeum);
- `odeum' specifies a database handle. The return value is the inode number
of the database directory.
The function `odmtime' is used in order to get the last modified
time of a database.
- time_t odmtime(ODEUM
*odeum);
- `odeum' specifies a database handle. The return value is the last modified
time of the database.
The function `odmerge' is used in order to merge plural database
directories.
- int odmerge(const char
*name, const CBLIST *elemnames);
- `name' specifies the name of a database directory to create. `elemnames'
specifies a list of names of element databases. If successful, the return
value is true, else, it is false. If two or more documents which have the
same URL come in, the first one is adopted and the others are
ignored.
The function `odremove' is used in order to remove a database
directory.
- int odremove(const char
*name);
- `name' specifies the name of a database directory. If successful, the
return value is true, else, it is false. A database directory can contain
databases of other APIs of QDBM, they are also removed by this
function.
The function `oddocopen' is used in order to get a document
handle.
- ODDOC
*oddocopen(const char *uri);
- `uri' specifies the URI of a document. The return value is a document
handle. The ID number of a new document is not defined. It is defined when
the document is stored in a database.
The function `oddocclose' is used in order to close a document
handle.
- void oddocclose(ODDOC
*doc);
- `doc' specifies a document handle. Because the region of a closed handle
is released, it becomes impossible to use the handle.
The function `oddocaddattr' is used in order to add an attribute
to a document.
- void oddocaddattr(ODDOC
*doc, const char *name, const char *value);
- `doc' specifies a document handle. `name' specifies the string of the name
of an attribute. `value' specifies the string of the value of the
attribute.
The function `oddocaddword' is used in order to add a word to a
document.
- void oddocaddword(ODDOC
*doc, const char *normal, const char *asis);
- `doc' specifies a document handle. `normal' specifies the string of the
normalized form of a word. Normalized forms are treated as keys of the
inverted index. If the normalized form of a word is an empty string, the
word is not reflected in the inverted index. `asis' specifies the string
of the appearance form of the word. Appearance forms are used after the
document is retrieved by an application.
The function `oddocid' is used in order to get the ID number of a
document.
- int oddocid(const ODDOC
*doc);
- `doc' specifies a document handle. The return value is the ID number of a
document.
The function `oddocuri' is used in order to get the URI of a
document.
- const char
*oddocuri(const ODDOC *doc);
- `doc' specifies a document handle. The return value is the string of the
URI of a document.
The function `oddocgetattr' is used in order to get the value of
an attribute of a document.
- const char
*oddocgetattr(const ODDOC *doc, const char *name);
- `doc' specifies a document handle. `name' specifies the string of the name
of an attribute. The return value is the string of the value of the
attribute, or `NULL' if no attribute corresponds.
The function `oddocnwords' is used in order to get the list handle
contains words in normalized form of a document.
- const CBLIST
*oddocnwords(const ODDOC *doc);
- `doc' specifies a document handle. The return value is the list handle
contains words in normalized form.
The function `oddocawords' is used in order to get the list handle
contains words in appearance form of a document.
- const CBLIST
*oddocawords(const ODDOC *doc);
- `doc' specifies a document handle. The return value is the list handle
contains words in appearance form.
The function `oddocscores' is used in order to get the map handle
contains keywords in normalized form and their scores.
- CBMAP *oddocscores(const
ODDOC *doc, int max, ODEUM *odeum);
- `doc' specifies a document handle. `max' specifies the max number of
keywords to get. `odeum' specifies a database handle with which the IDF
for weighting is calculate. If it is `NULL', it is not used. The return
value is the map handle contains keywords and their scores. Scores are
expressed as decimal strings. Because the handle of the return value is
opened with the function `cbmapopen', it should be closed with the
function `cbmapclose' if it is no longer in use.
The function `odbreaktext' is used in order to break a text into
words in appearance form.
- CBLIST
*odbreaktext(const char *text);
- `text' specifies the string of a text. The return value is the list handle
contains words in appearance form. Words are separated with space
characters and such delimiters as period, comma and so on. Because the
handle of the return value is opened with the function `cblistopen', it
should be closed with the function `cblistclose' if it is no longer in
use.
The function `odnormalizeword' is used in order to make the
normalized form of a word.
- char
*odnormalizeword(const char *asis);
- `asis' specifies the string of the appearance form of a word. The return
value is is the string of the normalized form of the word. Alphabets of
the ASCII code are unified into lower cases. Words composed of only
delimiters are treated as empty strings. Because the region of the return
value is allocated with the `malloc' call, it should be released with the
`free' call if it is no longer in use.
The function `odpairsand' is used in order to get the common
elements of two sets of documents.
- ODPAIR
*odpairsand(ODPAIR *apairs, int anum, ODPAIR *bpairs, int bnum, int
*np);
- `apairs' specifies the pointer to the former document array. `anum'
specifies the number of the elements of the former document array.
`bpairs' specifies the pointer to the latter document array. `bnum'
specifies the number of the elements of the latter document array. `np'
specifies the pointer to a variable to which the number of the elements of
the return value is assigned. The return value is the pointer to a new
document array whose elements commonly belong to the specified two sets.
Elements of the array are sorted in descending order of their scores.
Because the region of the return value is allocated with the `malloc'
call, it should be released with the `free' call if it is no longer in
use.
The function `odpairsor' is used in order to get the sum of
elements of two sets of documents.
- ODPAIR
*odpairsor(ODPAIR *apairs, int anum, ODPAIR *bpairs, int bnum, int
*np);
- `apairs' specifies the pointer to the former document array. `anum'
specifies the number of the elements of the former document array.
`bpairs' specifies the pointer to the latter document array. `bnum'
specifies the number of the elements of the latter document array. `np'
specifies the pointer to a variable to which the number of the elements of
the return value is assigned. The return value is the pointer to a new
document array whose elements belong to both or either of the specified
two sets. Elements of the array are sorted in descending order of their
scores. Because the region of the return value is allocated with the
`malloc' call, it should be released with the `free' call if it is no
longer in use.
The function `odpairsnotand' is used in order to get the
difference set of documents.
- ODPAIR
*odpairsnotand(ODPAIR *apairs, int anum, ODPAIR *bpairs, int bnum, int
*np);
- `apairs' specifies the pointer to the former document array. `anum'
specifies the number of the elements of the former document array.
`bpairs' specifies the pointer to the latter document array of the sum of
elements. `bnum' specifies the number of the elements of the latter
document array. `np' specifies the pointer to a variable to which the
number of the elements of the return value is assigned. The return value
is the pointer to a new document array whose elements belong to the former
set but not to the latter set. Elements of the array are sorted in
descending order of their scores. Because the region of the return value
is allocated with the `malloc' call, it should be released with the `free'
call if it is no longer in use.
The function `odpairssort' is used in order to sort a set of
documents in descending order of scores.
- void odpairssort(ODPAIR
*pairs, int pnum);
- `pairs' specifies the pointer to a document array. `pnum' specifies the
number of the elements of the document array.
The function `odlogarithm' is used in order to get the natural
logarithm of a number.
- double
odlogarithm(double x);
- `x' specifies a number. The return value is the natural logarithm of the
number. If the number is equal to or less than 1.0, the return value is
0.0. This function is useful when an application calculates the IDF of
search results.
The function `odvectorcosine' is used in order to get the cosine
of the angle of two vectors.
- double
odvectorcosine(const int *avec, const int *bvec, int vnum);
- `avec' specifies the pointer to one array of numbers. `bvec' specifies the
pointer to the other array of numbers. `vnum' specifies the number of
elements of each array. The return value is the cosine of the angle of two
vectors. This function is useful when an application calculates similarity
of documents.
The function `odsettuning' is used in order to set the global
tuning parameters.
- void odsettuning(int
ibnum, int idnum, int cbnum, int csiz);
- `ibnum' specifies the number of buckets for inverted indexes. `idnum'
specifies the division number of inverted index. `cbnum' specifies the
number of buckets for dirty buffers. `csiz' specifies the maximum bytes to
use memory for dirty buffers. The default setting is equivalent to
`odsettuning(32749, 7, 262139, 8388608)'. This function should be called
before opening a handle.
The function `odanalyzetext' is used in order to break a text into
words and store appearance forms and normalized form into lists.
- void
odanalyzetext(ODEUM *odeum, const char *text, CBLIST *awords, CBLIST
*nwords);
- `odeum' specifies a database handle. `text' specifies the string of a
text. `awords' specifies a list handle into which appearance form is
store. `nwords' specifies a list handle into which normalized form is
store. If it is `NULL', it is ignored. Words are separated with space
characters and such delimiters as period, comma and so on.
The function `odsetcharclass' is used in order to set the classes
of characters used by `odanalyzetext'.
- void
odsetcharclass(ODEUM *odeum, const char *spacechars, const char *delimchars,
const char *gluechars);
- `odeum' specifies a database handle. `spacechars' spacifies a string
contains space characters. `delimchars' spacifies a string contains
delimiter characters. `gluechars' spacifies a string contains glue
characters.
The function `odquery' is used in order to query a database using
a small boolean query language.
- ODPAIR
*odquery(ODEUM *odeum, const char *query, int *np, CBLIST
*errors);
- `odeum' specifies a database handle. 'query' specifies the text of the
query. `np' specifies the pointer to a variable to which the number of the
elements of the return value is assigned. `errors' specifies a list handle
into which error messages are stored. If it is `NULL', it is ignored. If
successful, the return value is the pointer to an array, else, it is
`NULL'. Each element of the array is a pair of the ID number and the score
of a document, and sorted in descending order of their scores. Even if no
document corresponds to the specified condition, it is not error but
returns an dummy array. Because the region of the return value is
allocated with the `malloc' call, it should be released with the `free'
call if it is no longer in use. Note that each element of the array of the
return value can be data of a deleted document.
If QDBM was built with POSIX thread enabled, the global variable
`dpecode' is treated as thread specific data, and functions of Odeum are
reentrant. In that case, they are thread-safe as long as a handle is not
accessed by threads at the same time, on the assumption that `errno',
`malloc', and so on are thread-safe.
If QDBM was built with ZLIB enabled, records in the database for
document attributes are compressed. In that case, the size of the database
is reduced to 30% or less. Thus, you should enable ZLIB if you use Odeum. A
database of Odeum created without ZLIB enabled is not available on
environment with ZLIB enabled, and vice versa. If ZLIB was not enabled but
LZO, LZO is used instead.
The query language of the function `odquery' is a basic language
following this grammar:
expr ::= subexpr ( op subexpr )*
subexpr ::= WORD
subexpr ::= LPAREN expr RPAREN
Operators are "&" (AND), "|" (OR), and
"!" (NOTAND). You can use parenthesis to group sub-expressions
together in order to change order of operations. The given query is broken
up using the function `odanalyzetext', so if you want to specify different
text breaking rules, then make sure that you at least set "&",
"|", "!", "(", and ")" to be
delimiter characters. Consecutive words are treated as having an implicit
"&" operator between them, so "zed shaw" is actually
"zed & shaw".
The encoding of the query text should be the same with the
encoding of target documents. Moreover, each of space characters, delimiter
characters, and glue characters should be single byte.