SGML markup of dictionaries with special reference to comparative and etymological data

Jeff Good and Ronald Sprouse
Comparative Bantu Online Dictionary and UC Berkeley and

Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.

Abstract: One of the main goals of The Comparative Bantu Online Dictionary (CBOLD) is to develop a comparative electronic database of Bantu dictionaries and word lists which will allow researchers to examine data quickly across a large number of languages. To that end, we have been developing standards which will allow our sources to be compared with each other even though they were not originally designed for such a purpose. These involve a system for maintaining a distinction between original data and data we have added to a source, expanding existing SGML DTD's for detailed linguistic markup, the use of a <preface> element to store reference information with each of our sources, and the creation of a language metadata element where we can specify which languages the source's data covers as well as state all major alternate names for the relevant languages.

1. Introduction

The Comparative Bantu Online Dictionary (CBOLD) has the goal of producing a lexicographic database that will be a community resource to support and enhance the theoretical, descriptive, and historical linguistic study of the Bantu languages. The major parts in this endeavor are: (i) collection of primary source data--mostly in the form of dictionaries and word lists, from both print and electronic primary sources; (ii) conversion of this data into machine readable form if necessary; (iii) creation of systems where different data sources can be usefully compared; (iv) analysis and annotation of the data CBOLD has collected. In addition, it is our goal that the database will be modularly expandable by other researchers.

The focus of this paper is on plans for implementing part (iii) of the project. The general issues we are attempting to resolve are not peculiar to Bantu linguistics, and any project with comparative goals will face similar problems. In particular our work has relevance for the following issues in electronic language documentation and description: (1) the problem of analyzing and annotating a pre-existing source while maintaining a clear distinction between the original source data and what has been added; (2) the SGML markup of dictionaries and lexicons; (3) the integration of a comparative word list within a pre-existing dictionary or lexicon; (4) the storage of reference data with a dictionary; and, finally, (5) the creation of a metadata standard that allows researchers to find data on the language they are interested in without having to ensure that the name they use for the language matches ours. We address each of these issues in the order listed above.

2. Adding data to a pre-existing source

CBOLD considers itself a linguistic resource. As such, in converting data into standardized formats, we place a high priority on retaining grammatical and lexical information; we place a lower priority on retaining typesetting and other stylistic information in the original texts. In addition to retaining source information, we often create new data based on the original dictionary based on our own linguistic analysis. These additions might include an uncontroversial parsing of a headword into its component morphemes, or it could involve the more complicated process of finding an etymological reconstruction of a particular dictionary entry. Whatever kind of new data we create, it is important to incorporate it into the original source in order to make all the information we have on a language as widely available as possible. At the same time, we want to maintain a distinction between our version of the original source and any data we might add to it.

Accordingly, we have developed two SGML attributes which can be used to modify any element in a document. The first is what we are calling the provenance attribute. We use this to specify where the information in our source came from. The default is that it was present in the original electronic version of the source. Other possible values are the names of any organization which created the data--this is typically "CBOLD" in our sources, but the system is designed to allow data from any contributor to be distinguished from any other. The second attribute we use is the technique attribute. This is used to specify how the data that was added to the source was created. We currently use only two value for this attribute: "manual" and "automatic". These distinguish between data directly entered by a person and data created by a computational process. Entity definitions for these attributes can be found at

3. SGML markup of dictionaries and lexicons

3.1 The Text Encoding Initiative standards

The Guidelines for Electronic Text Encoding and Interchange (1999), known as TEI P3, offer a comprehensive set of SGML elements and attributes for encoding dictionary entries. These guidelines are designed to electronically encode a wide variety of lexical information across a range of possible dictionary formats. The guidelines were produced with standard types of Western dictionaries in mind, including bilingual dictionaries. The TEI P3 dictionary standards have been broadly sufficient for our needs. However, we have extended them as we fully convert our dictionaries to SGML format. These modifications are discussed in section 3.2. A full description of the TEI P3 dictionary standards would be much too long to be discussed here. Importantly, we have found that the majority of needs we have had for dictionary markup were anticipated in the TEI P3 guidelines. Here we will mention how two elements <xr> and <etym> have been particularly useful to us.

The <xr> element contains data which refers the reader to some other location within its own, or another, document. Similar tags, like <xref> and <xptr> can only be used to refer to other locations within a document--therefore, they are not as valuable for creating comparative data as <xr>.

A second important element is <etym>. It is used for marking etymological information. A tag for etymological information would, of course, be a basic part of any scheme for dictionary markup. The importance of it for the creation of electronic sources is how it can be combined with <xr> in order to "link" an entry in one source to an entry in a different source--when the two entries are etymologically related. At CBOLD, the specific implementation of this is to put a <xr> reference in the <etym> field of a dictionary which points to an entry in a standard list of Bantu reconstructions. We could also use such a tag to link related entries in sources of attested languages, though we have not specifically implemented this.

An example of how this works can be seen by examining the two marked up entries located at The first is a marked up entry from a dictionary of Ganda (Snoxall 1967) and the second is a marked up entry from a list of Bantu reconstructions (Guthrie 1967). The <xr> field of the Ganda entry refers the reader to the entry for the reconstruction. In order to make this pointing as precise as possible we have added two attributes to <xr> which are not defined in the TEI P3 guidelines. The first is a filename attribute which specifies the name of the file with the marked up reconstructions and the second is extid which we use to point to a unique identifier in the document specified in filename. We also have put specific text in the <etym> portion of the entry which is intended for the human reader--our automatic systems refer to the filename and reconid attributes when processing comparative information. (Note the use of the provenance and method attributes, which were mentioned above, in the Ganda entry.) Although, our use of <xr> is fairly limited, we believe it is a powerful element in electronic linguistic documentation since it can be used to refer any two documents to each other--including linking sound files to a text document. TEI P3 also has defined elements for marking the timing of the speech in texts. Thus, much of the architecture already exists to electronically document and link recordings.

3.2 Modifications to the TEI P3 standards

3.2.1 Expanding and refining the labelling of grammatical categories

An area where the TEI P3 specifications need extension is in the types of grammatical information allowed to be specified in an entry. This is not surprising given TEI P3's apparent focus on encoding dictionaries of major Western languages. There is a basic <gram> element for grammatical information and a set of possible tags and attributes which cover basic grammatical categories. The tag set is inadequate for CBOLD's needs in two ways. First, the categories defined in the guidelines are limited to the following inflectional elements: <gen> (for gender), <number>, <case>, <per> (for person), <tns> (for tense), <mood>, and <itype> (for "inflectional class"). These categories do not cover important areas of Bantu morphosyntax, including subject- and object-marking of verbs and noun class, though gender closely approximates the latter (for more on this, see section 3.2.2). Clearly, other linguistic families would need to add categories of their own.

More seriously, there is a fundamental problem with the <gram> element in the TEI P3 standards for our purposes--it lacks any sense of internal structure to its grammatical categories. For example, nowhere in the TEI P3 standards is it encoded that <tns> and <mood> are more closely related grammatical categories than say, <tns> and <case>. As CBOLD marks up data from more and more languages, it will be important to allow researchers to do complex searches sensitive to fine grammatical details. Their task will be made much easier if grammatical information is organized in a way that reflects the general typological properties of language. We have not yet tackled this problem. However, if the general linguistics community wants to take advantage of the TEI P3 standard, then this deficiency will have to be addressed

3.2.2 The use of general and family-specific linguistic terminology

A similar, but more readily solvable, problem has to do with the labelling of grammatical categories. Here, CBOLD's work with Bantu provides a good example. Bantuists tend to use the term "noun class" to refer to groups of nouns which show similar behavior with respect to their morphosyntactic agreement patterns. This term is roughly equivalent to the more general term "gender". In marking our dictionaries, it is preferable to us to use an element <nclass> over <gen> since the primary users of our sources will be Bantuists. However, we do not want our use of this family-specific term to make it difficult for non-Bantuists to search or interpret our data. Therefore, we mark the noun class/gender elements with the SGML entity <%nclass> instead of with <nclass> or <gen>. A given researcher can then use either of the two entity definitions which can be seen at depending on which notation is more convenient for them. The DTD definition for the nclass is given at (One advantage of this solution has been that we have been able to give a more narrow definition to the <nclass> element than TEI P3 gives to the <gen> element--which is useful for us when we do error-checking.) We believe that a solution like this will be of use to any group which wants to directly mark its sources by using some traditional terminology but which, at the same time, wants to identify the family-specific terms with more commonly used terms.

4. The creation of a comparative word list within a dictionary

In performing comparative work we need to add two sorts of information to our dictionaries. The first was discussed in section 3.1. where we covered how we add etymological information. However, at the same time, we want basic "word-list" information. All of our sources contain only one direction of translation from a Bantu language to a European language (usually English, but sometimes French or Dutch). This makes it very hard to look for "basic" terms in the language. Even though we can search through gloss fields for particular text matches, this is not always helpful. Basic words like "tree" or "run" usually appear in many entries which makes it nearly impossible, without some special markup, to examine basic lexical items across languages.

Accordingly, we have defined an element <basicterm> for dictionary entries which identifies a particular entry with a set of basic terms used in comparative Bantu research. The DTD for <basicterm> can be found at The <basicterm> element can contain an <xr> element to point to a word list file external to the document. It can also contain text stating what basic term the word should be identified with. We have been working on incorporating both kinds of data into our <basicterm> element as we mark our dictionaries. A sample entry marked with the <basicterm>, again from Ganda (Snoxall 1967), can be seen at The entry is the basic word for "dog" in the language.

5. The storage of reference data with a dictionary

Another aspect of being able to compare our sources to each other is by making sure the data we are comparing is compatible--for example, the transcription systems either have to be the same or be normalized during processing. At present, we do not want to normalize the original data in our sources. Instead, we want critical information to be included with each source that will allow either a human user or a computational system to normalize when appropriate.

Since standard print dictionaries include important reference information in their prefaces, we have decided to use the preface element defined in the TEI P3 standards as the location of all reference information for our sources. Sometimes this reference information was with the original source and sometimes we have added it.

The addition of a preface makes our electronic sources nicely parallel to print ones. However, for linguistic research, a preface consisting simply of text marked for formatting will not be of particular use. Instead, a system of elements needs to be developed which identifies particular sorts of information stored in the preface. Perhaps the most important elements will be those required to mark up transcription systems in a way that is useful to both human users and computational processors.

An example of two Bantu-specific preface elements we have defined can be found at These elements specify whether a language has a five or seven vowel system and the language's Guthrie Number. The <vowelsystem> element is particularly useful in interpreting vowel transcription in Bantu. Thus, we add it to each source as we process it into SGML. We also put "internal" reference information in our prefaces so that a given source contains the information we need to maintain our database--we can extract these elements from the preface of each document to create an electronic "index card" describing each of our sources.

An important consequence of putting basic grammatical information in the preface is that prefaces are not specific to dictionaries, but are allowed in the TEI P3 guidelines to appear in any kind of document. Thus, the elements developed for the preface can be used to give important grammatical information for any kind of electronic source.

6. A metadata element for identifying the language(s) described in a given source

A final problem we have run across in creating a true comparative resource is that most, if not all, Bantu languages are known by more than one name. This problem is slightly less important for Bantu researchers than researchers of other families since there is a common designation system for Bantu languages known as the "Guthrie number". However, ideally, each of our resources would be accompanied by information that lists possible alternate names for the language it documents. This list would be what researchers access when querying a particular language. This list does not solve every problem--in particular it does not resolve how to deal with the same name being applied to two different languages. However, it is an important step in the right direction and should be sufficient for most cases. We presently are planning on linking a language metadata element to all of our SGML sources--in addition to using the Dublin Core Element Set from the Dublin Core Metadata Initiative. Our proposed structure for the metadata element can be found at Other projects have made similar proposals. What we believe is most important is for a list of language names to either directly accompany, or be readily accessible from, all of our sources.

7. Conclusion

The main goal of this paper was to discuss the issues that arise in making a truly comparative linguistic database and to mention some of the solutions CBOLD is implementing to resolve them. In general, the TEI P3 architecture is largely sufficient for CBOLD's basic markup needs, though we have extended its system of marking grammatical categories to increase its utility to the academic linguistic community. Many other issues such as: distinguishing between original data and data added to a source, making reference information available with a source, creating word lists from dictionaries, and ensuring that a researcher can find a language by searching without having to precisely match CBOLD's naming conventions can be readily solved with minor extensions to either the TEI P3 standards or other commonly-used SGML standards.


Dublin Core Metadata Initiative.

Guthrie, Malcolm .1967. Comparative Bantu: An Introduction to the Comparative Linguistics and Prehistory of the Bantu Languages. Greggs: London.

Sperberg-McQueen, C. M. and Lou Burnard. 1999. Guidelines for Electronic Text Encoding and Interchange.

Snoxall, R.A. 1967. Luganda-English Dictionary. Clarendon: Oxford.