btparse::doc::bt_split_names(3) | btparse | btparse::doc::bt_split_names(3) |
bt_split_names - splitting up BibTeX names and lists of names
bt_stringlist * bt_split_list (char * string, char * delim, char * filename, int line, char * description); void bt_free_list (bt_stringlist *list); bt_name * bt_split_name (char * name, char * filename, int line, int name_num); void bt_free_name (bt_name * name);
When BibTeX files are used for their original purpose---bibliographic entries describing scholarly publications---processing lists of names (authors and editors mostly) becomes important. Although such name-processing is outside the general-purpose database domain of most of the btparse library, these splitting functions are provided as a concession to reality: most BibTeX data files use the BibTeX conventions for author names, and a library to process that data ought to be capable of processing the names.
Name-processing comes in two stages: first, split up a list of names into individual strings; second, split up each name into "parts" (first, von, last, and jr). The first is actually quite general: you could pick a delimiter (such as 'and', used for lists of names) and use it to divide any string into substrings. "bt_split_list()" could then be called to break up the original string and extract the substrings. "bt_split_name()", however, is quite specific to four-part author names written using BibTeX conventions. (These conventions are described informally in any BibTeX documentation; the description you will find here is more formal and algorithmic---and thus harder to understand.)
See bt_format_names for information on turning split-up names back into strings in a variety of ways.
bt_stringlist * bt_split_list (char * string, char * delim, char * filename, int line, char * description)
Splits "string" into substrings delimited by "delim" (a fixed string). The splitting is done according to the rules used by BibTeX for splitting up a list of names, in particular:
For instance, if the delimiter is "and", then the string
Candy and Apples AnD {Green Eggs and Ham}
splits into three substrings: "Candy", "Apples", and "{Green Eggs and Ham}".
If there are extra delimiters at the extremities of the string---say, an "and" at the beginning of the string---then they are included in the first/last string; no warning is currently printed, but this may change. Successive delimiters ("and and") result in a warning and a NULL string being added to the list of substrings. For instance, the string
and Joe Q. Blow and and Smith, Jr., John
would split into three substrings: "and Joe Q. Blow", "NULL", and "Smith, Jr., John".
(If these rules seem somewhat odd, don't blame me: I just implemented BibTeX's observed behaviour and added warning messages for one of the more obvious and easily-detected mistakes.)
The substrings are returned as a "bt_stringlist" structure:
typedef struct { char * string; int num_items; char ** items; } bt_stringlist;
There is currently no elegant interface to this structure: you just have to poke around in it yourself. The fields are:
"filename", "line", and "description" are all used for generating warning messages. "filename" and "line" simply describe where the string came from, and "description" is a brief (one word) description of the substrings. For instance, if you are splitting a list of names, supply "name" for "description"---that way, warnings will refer to "name X" rather than "substring x".
void bt_free_list (bt_stringlist *list)
Frees a "bt_stringlist" structure as returned by "bt_split_list()". That is, it frees the copy of the string you passed to "bt_split_list()", and then frees the structure itself.
bt_name * bt_split_name (char * name, char * filename, int line, int name_num)
Splits a single BibTeX-style author name into four parts: first, von, last, and jr. This can handle almost all names in the style of the major Western European languages, but not quite. (Alas!)
A name is split by first dividing into tokens; tokens are separated by whitespace or commas at brace-level zero. Thus the name
van der Graaf, Horace Q.
has five tokens, whereas the name
{Foo, Bar, and Sons}
consists of a single token.
How tokens are divided into parts depends on the form of the name. If the name has no commas at brace-level zero (as in the second example), then it is assumed to be in either "first last" or "first von last" form. If there are no tokens that start with a lower-case letter, then "first last" form is assumed: the final token is the last name, and all other tokens form the first name. Otherwise, the earliest contiguous sequence of tokens with initial lower-case letters is taken as the `von' part; if this sequence includes the final token, then a warning is printed and the final token is forced to be the `last' part.
If a name has a single comma, then it is assumed to be in "von last, first" form. A leading sequence of tokens with initial lower-case letters, if any, forms the `von' part; tokens between the `von' and the comma form the `last' part; tokens following the comma form the `first' part. Again, if there are no token following a leading sequence of lowercase tokens, a warning is printed and the token immediately preceding the comma is taken to be the `last' part.
If a name has more than two commas, a warning is printed and the name is treated as though only the first two commas were present.
Finally, if a name has two commas, it is assumed to be in "von last, jr, first" form. (This is the only way to represent a name with a `jr' part.) The parsing of the name is the same as for a one-comma name, except that tokens between the two commas are taken to be the `jr' part.
The one case not properly handled by BibTeX name conventions is a name with a 'jr' part not separated from the last name by a comma; for example:
Henry Ford Jr. George Herbert Walker Bush III
Both of these would be incorrectly interpreted by both BibTeX and bt_split_name(): the "Jr." or "III" token would be taken as the last name, and the other tokekens as a two- or four-part first name. The workaround is to shoehorn the 'jr' into the last name:
Henry {Ford Jr.} George Herbert Walker {Bush III}
but this will make it impossible to extract the last name on its own, e.g. to generate "author-year" style citations. This design flaw may be fixed in a future version of btparse.
The split-up name is returned as a "bt_name" structure:
typedef struct { bt_stringlist * tokens; char ** parts[BT_MAX_NAMEPARTS]; int part_len[BT_MAX_NAMEPARTS]; } bt_name;
Again, there's no nice interface to this structure; you'll just have to access the fields individually. They are:
for (i = 0; i < name->part_len[BTN_FIRST]; i++) { printf ("token %d of first name: %s\n", i, name->parts[BTN_FIRST][i]); }
void bt_free_name (bt_name * name)
Frees the "bt_name" structure created by "bt_split_name()" (including the "bt_stringlist" structure inside the "bt_name").
btparse, bt_format_names
Greg Ward <gward@python.net>
2023-01-30 | btparse, version 0.89 |