Classification

In brief

Historical Thesaurus data consists of a fine-grained conceptual hierarchy containing almost all of the recorded words in English, arranged semantically. This hierarchy of semantic categories is unique in its depth and level of detail, consisting of almost a quarter of a million concepts. Each category is then nested within other, wider categories – so that, for example, the category of words meaning Profligacy, dissoluteness and debauchery is within Licentiousness, itself adjacent to Guilt and Wickedness and within the major category Morality.

This hierarchical structure differs from the organization of many other thesauri; Historical Thesaurus categories relate to others not just linearly, but can operate either horizontally (on the same hierarchical level) or vertically (on a higher or lower level, either containing or being contained by another category). In addition, each concept is able to contain a series of subcategories within itself, separate from the main sequence.

The three primary divisions of the Thesaurus are the External World, the Mental World, and the Social World. These in turn are divided into 377 major categories, such as Food, Thought, or War. Each category is therefore given a nested reference code, or hierarchy number, such as “01.07.02.02.06.01 n.” for the category Whisky. Each new set of two digits indicates another level of depth in the hierarchy, from 01.07.02.02 n. Intoxicating liquor all the way upwards to 01 n. The world. A comprehensive series of changes have been made to the Thesaurus hierarchy since its first publication, and category numbers online may not correspond to those in the printed version – more information is available here.

You can also download the first three levels of the Thesaurus hierarchy as a printable PDF here.

Classification: principles and practices

Much time was spent during the initial decades of the project on the immense task of devising a system of classification which would do justice to such a large and varied body of material as was present in the OED. Such a system had to be flexible enough to accommodate changes in the vocabulary over the years and the cultural changes they reflected. We were also keen to include a much greater degree of semantic discrimination than is found in most thesauri. While the resultant framework inevitably coincides with those of other thesauri at certain points, as a whole it offers a uniquely detailed system for semantic classification.

At the root of the Historical Thesaurus’ system of classification is the contention that, within certain limits, each section should be allowed to develop its own semantic profile. However, as the body of data grew, it was felt that a high-level overall structure should be developed into which the categories assigned to individual classifiers could eventually be slotted. Samuels and Kay undertook the task of identifying key components in the OED definitions which would form the basis of major sections.[1] Further work led to a set of major tier two categories, forming a framework for the Thesaurus as a whole. Within this framework there is provision for seven main category levels and five subcategories, in a taxonomy which begins with the most general ways of expressing a concept and moves hierarchically downwards to the most specific. In linguistic terms, we are applying an organizing principle of hyponymy, encapsulating relationships such as ‘type of’ or ‘part of’. All twelve of these levels are sometimes used in classifying the objects of the material world, where a good deal of detail can be specified, but are rarely needed in the broader divisions of abstract categories.[2]

There was much discussion of how parts of speech should be represented within this structure, with arguments put forward both for allowing semantics to override grammar, resulting in intermingled parts of speech, and for identifying a leading part of speech in each category, so that concrete categories would be presented with nouns first but more abstract ones, such as Thought or Goodness, might begin with verbs or adjectives respectively. In the end, in the interests of ease of use, it was decided to maintain a consistent pattern of nouns, adjectives, adverbs, verbs within each category and subcategory, followed by phrases, interjections, conjunctions, and prepositions.

As it now stands, the Historical Thesaurus is presented in three major sections, reflecting the following broad semantic divisions: the External World, the Mental World, and the Social World. This tripartite division reflects the fact that we are dealing with a world-view, or set of world-views, in English which has been recorded over a period of about 1300 years, and often incorporating much earlier views.

Our historical perspective enabled us to tackle one of the key problems of any system of classification: where do you start? For Roget, the answer was to start with abstractions, notions such as relationships of similarity, comparability, etc., which would inform later sections. For the Historical Thesaurus, the solution was the opposite: to start in the External World with the most readily observable phenomena of the universe (the land, sea, sky, etc.), followed by living beings, their characteristics, and their physical needs. At the end of this section come attempts to quantify and interpret the world through systems such as space, time, and measurement. The Mental World presents the vocabulary of mental processes, such as perception, emotion, will, and (perhaps more marginally) possession. Here there is a logical progression, since much of the lexis of this section derives metaphorically from that of the External World.[3] The Social World deals with the vocabulary of people as they organize into social groups, develop systems such as law and morality, exploit the environment, communicate, and enjoy themselves. It contains the largest number of categories, reflecting huge changes in social organization and activities over the years; categories such as Leisure and Entertainment, which are tiny in A Thesaurus of Old English (TOE), have grown almost beyond recognition in the modern period, reflecting changing life-styles. The same, of course, applies to major scientific categories in the External World, such as Physics or Medicine.

Categories such as Law, Clothing, or Will, described loosely as semantic fields, formed the basic unit of classification, being of a size (between about 10,000 and 20,000 slips) which an individual could reasonably be expected to tackle, and which could be expected to yield interesting results. The decision to base the classification on such fields led to major departures from Roget’s framework, particularly in the transfer of categories from the External World to the Social World. Sometimes, the individual produced a detailed classification of a particular area, which could later be added to, as in the case of work done by postgraduate students[4] and research assistants, and, as the project became better known, by colleagues from other universities, notably Reinhard Gleissner of Regensburg University, Hans Peters, now at the University of Dusseldorf, and Angus Somerville of Brock University, Canada. In other cases, trainee lexicographers did preliminary sorting, which was later refined by an experienced editor.

It was acknowledged from the start that each section should be allowed to develop its own structure. Within the general taxonomic framework, classifiers were given a free hand, being told simply to ‘sort, sort, and sort again’ until an acceptable structure emerged; in other words, the classification was ‘bottom up’ from the data rather than imposed ‘top down’. Our aim was to produce a folk taxonomy, informed by what Hallig and von Wartburg describe as ‘naïve realism’, setting forth “the intelligent average individual’s view of the world, based on pre-scientific general concepts made available by language”.[5] However, since in some sections, such as Animals and Plants,[6] we found that an established scientific taxonomy was the best way of dealing with some of the data, we ended up with what we describe as a ‘modified folk taxonomy’, where the naïve view may be combined with one that is more informed. From this point of view, the ideal classifier is a person who combines linguistic sophistication with a degree of appropriate subject knowledge, and many classifications were assigned with this in mind. Using scientific terminology also helped to solve the problem of devising explanatory headings for complex scientific categories.

There is obviously no single way of structuring a thesaurus, as comparative studies of the semantic systems of the relatively few existing thesauri have shown.[7] Nor is any semantic category likely to be wholly clear-cut. Initially, it might seem that a field like Music or Faith, two of the earliest to be tackled, has well-defined content, but even here problems arise, such as where to assign religious music. A more serious problem for the Thesaurus was how, or indeed whether, to draw a boundary between Faith and cognate areas which might now be excluded from it, such as the Occult and the Paranormal. In the end a decision was made to set up two fields, one for all supernatural beings and manifestations, regardless of originating creed, and the other for organized religion; experimentation while compiling TOE confirmed that such a model was appropriate for the long historical sweep of the Historical Thesaurus. These two categories are quite widely separated in the thesaurus as a whole, appearing in the External World (as an attempt to explain the world) and the Social World (as a social construct) respectively.

The need to make such decisions recurred throughout the work. In some cases, the sheer weight of vocabulary simply overwhelmed the taxonomy, with a category that is properly a subcategory rising in status because its degree of lexicalization reflected its considerable degree of importance to speakers of the language. Thus, for example, historic and important sports, such as cricket, football, or baseball, have their own categories and arrays of subcategories. In all cases, it must be stressed that we are classifying the language used to discuss a topic, not the subject-matter itself, which may sometimes lead those who are knowledgeable about a topic to find a particular category defective. Our category of Named regions of earth, for example, reflects the priorities of a dictionary and would not satisfy a geographer; it indicates the various ways in which people have referred to parts of the world but is not an encyclopedic list of polities. A comparison of our category and such an encyclopedic list, however, would reveal the shifts in how things are in the world against how they have been discussed across time.

Within each heading in the Historical Thesaurus, meanings are grouped according to a loose principle of synonymy. There is no claim that these words are exactly synonymous, i.e. could replace one another in all contexts (if such a condition exists), but rather that they share enough of their meaning to be classified together. Although the project was started before the current cognitive semantics paradigm became dominant, that paradigm has retrospectively proved sympathetic to the problems involved in categorizing large quantities of lexical data. The development of prototype theory, which allows for fuzzy sets containing both good and less good examples of the central concept, challenges the either/or basis of Aristotelian category assignment and liberates semanticists from a narrow notion of synonymy as an organizing principle.[8] The Thesaurus’ synonym groupings are prototypical in nature, with a clear core of obvious members shading off into the less obvious, and ultimately into cognate categories.

Such flexibility is especially essential for historical semantics, allowing connections to be made and new systems identified. Thus, for instance, where a form occurs in category X, and also appears, usually with a later starting date, in category Y, there may be evidence of gradual semantic change through a polysemous chain of meaning. One example among many is the adjective sensitive, which first occurs in the fifteenth century in our subcategory 01.09|06 adj. Endowed with physical sensation :: having function of sensation. Several meanings related to physical sensation develop within that wider category, until in 1816 the word is recorded with reference to mental sensation, classified in a subcategory of 02.04.01.02 Emotional perception. If we examine the latter category and those surrounding it, we find other words with a similar transfer of meaning from concrete to abstract, such as soft, tender, sensible, and various forms of the verbs touch and impress. Interestingly, the expression sensitive plant has made a similar journey, this time from a botanical term for a literally irritable plant to a way of describing a sensitive person. However, the most recent meaning of sensitive, as in “sensitive information”, has made a lonely journey to Knowledge within 02.01.12.10.01.01|02 (adj.) Not made public :: not printed/published, where no other words of comparable origin are to be found.

The other common organizing principle of thesauri, antonymy or oppositeness of meaning, has proved less suitable for systematic use in the Historical Thesaurus, since oppositions vary in both nature and extent. Where there are substantial categories containing obvious oppositions, as in Love/Hatred or Pleasure/Suffering, they are generally placed together, but in other categories, such as Truth, there can be a progression of meaning covering several oppositions, in this case moving from Validity through Veracity, Sincerity, Exaggeration, Falsehood, and Error to Deceit. Where opposites are too few to merit a separate category, they are usually classified after the meaning they negate. The same principle applies to phrases. If there are only a few, they are classified with the appropriate part of speech; where there are many, they are grouped in a separate category of phrases.

Each category and subcategory in the Thesaurus has a Modern English gloss acting as a heading; wherever possible, the keyword in the heading is drawn from the category, but sometimes a word of more general scope is used.

Recurrent headings may be given in abbreviated form, as in ‘not’ [= not having the quality described above], ‘one who’ (= one who performs a certain action) or ‘instance of’ (= instance of a particular condition). Throughout the Thesaurus, headings are intended to link in this fashion, so that reading back from the lowest to the highest level will construct an approximate definition revealing the layers of taxonomic structure with which a meaning engages. Indeed, it would be possible to produce a novel kind of dictionary by this process, turning inside out the thesaurus which was constructed by turning the dictionary inside out, but that would be a project for another day.

Notes on some categories

In creating the Thesaurus, we bore in mind as much as possible the mental world-picture not only of ourselves but also of previous generations, as it was represented in the wide diachronic scope of the data available. A case in point for the Thesaurus is the positioning of 01.01.10 The universe near the end of the list of categories inside The earth. To a modern mind, it would seem more logical to make The universe the main tier two category, with The earth subordinate to it, but the weight of the evidence suggested that the immediately-present world was far more salient in earlier periods.

One of the most problematic categories was 02.06 Possession. Hans-Jürgen Diller[9] queries why it is in the Mental World rather than in the Social World, since many aspects of possession, such as materials, trade and commerce, are to be found in 03.11 Occupation and work. Although to some extent we shared his doubts, our decision was influenced by a comment under have in OED2, describing have, alongside be and do, which both occur primarily in the Thesaurus’ External World, as “the most generalized representatives of the verbal classes”, predicating, in its weakened senses, “merely a static relation between the subject and object”. The presence of this relationship suggested a mental process. We therefore decided that this more abstract notion of possession, along with associated concepts such as giving, taking, wealth and poverty, should be separated from the huge body of material in the Social World. Somewhat similar issues were raised by the split between Language in the Mental World as an intellectual activity, and Communication as a social activity, involving information transmitted in books, journalism, correspondence, broadcasting, telecommunications, etc., in the Social World.

The changes in the Social World overall were more radical. Initially we tried to make a distinction between the External World as wholly natural, and the Social World for man’s social activities, including his operations upon the world, but this proved impractical. In categories such as 01.01.05.01.03 (n.) Channel of water and its subcategories, there was often no way of telling whether the item in question was man-made or not. Given this, it seemed unhelpful to separate channels from clearly man-made things in the same domain of meaning, such as locks and dams. It was therefore decided to leave activities connected with physical existence, such as Farming and Food, in the External World, while moving those with a more clearly social dimension, such as Kinship and Inhabiting and dwelling, including buildings, to the Social World. In some cases, there was a clear distinction between the physical and the social, leading to a category in each section, as in 01.14 Movement and 03.10 Travel and travelling. A similar kind of separation was made between 01.17 The supernatural, which occurs at the end of the External World as a way of explaining the universe, and covers supernatural creatures and practices of all kinds, from angels and witches to spiritualism, and 03.07 Faith in the Social World, which covers all aspects of organized religion. Such a division seemed best to encapsulate how the lexis of this area, and the attitudes it reflects, has developed over the years.

Sometimes, a concept might end up with a place in all three sections. One which got moved around a lot as classifiers claimed or rejected it was gemstones, which can be minerals (the External World), a means of adornment (the Mental World, under Beauty), or an industrial material (the Social World), and in fact were put in all these categories along with the associated vocabulary for each. At all times, this sort of split was always preferable to illogical or overly-didactic schemes of arrangement which did not account for actual language use.

[1] C. Kay & M. L. Samuels, “Componential Analysis in Semantics: Its Validity and Applications”. Transactions of the Philological Society, 1975, 49-81.

[2] See Christian Kay & Irené Wotherspoon, “Semantic Relationships in the Historical Thesaurus of English”. Lexicographica 21, 2005, 47-57.

[3] For an extended examination of Thesaurus data in such terms, see the thesis by Reay cited in the Bibliography and Kathryn Allan, Metaphor and Metonymy: A Diachronic Approach. Oxford: Blackwell, 2008.

[4] See, for example, the theses listed under Chase, Coleman, Sylvester, and Thornton in the Bibliography, each of which contains a classification as well as analysis of various linguistic aspects of the data. The thesis by Chase also made a significant contribution to the development of the Thesaurus’s notation.

[5] Cited in Ullmann, Semantics. Oxford: Blackwell, 1962, 255.

[6] In the case of Plants, both types of taxonomy were attempted. See the study reported in Cerwyss O’Hare, “Folk Classification in the HTE Plants Category”. Categorization in the History of English, Christian J. Kay & Jeremy J. Smith (eds). Amsterdam: John Benjamins, 2004, 179-191.

[7] See, for example, Andreas Fischer, “The notional structure of thesauruses” in Kay & Smith (eds), op. cit., 2004, 41-58; Werner Hüllen, A History of Roget’s Thesaurus: Origins, Development, and Design. Oxford: Oxford University Press, 2004; Christian Kay and Marc Alexander, “Historical and Synchronic Thesauruses” in Philip Durkin (ed.) The Oxford Handbook of Lexicography. Oxford: Oxford University Press, 2015.

[8] See, for example, John Taylor, Linguistic Categorization. Third edition. Oxford: Oxford University Press, 2003.

[9] H.-J. Diller, “A Lexical Field Takes Shape: the Use of Corpora and Thesauri in Historical Semantics”. Anglistik 19 (1), 2008, 123-140.

Top