Our Second Edition

Summary

This site presents the second edition of the Historical Thesaurus of English. Our second edition was launched in 2020, twelve years after the first edition of the Thesaurus was completed (in 2008, with the print edition appearing in 2009), and incorporates a significant series of updates since then. Four major revisions of the Thesaurus database have been released in the interim, with the first release of our second edition constituting version 5 of our data.

There are four aspects to our revisions: the Thesaurus structure itself, updated dates of attestation, new words and concepts, and new tools to work with our data.

We gratefully acknowledge the work of our colleagues at the Oxford English Dictionary for their assistance and for sharing the results of their work on revising the OED. The Thesaurus’ first edition is largely based on the second edition of the OED (1989) and its supplements, and there have been for some time close links between the OED and Thesaurus staff. We have greatly benefited from the work done since 1990, when work began on rewriting the OED in order to create a third edition (presently underway online only). This rigorous revision has dramatically improved the dating of entries, led to re-evaluation of word sub-senses, and brought many new words into the dictionary. The results of this process are therefore of great value to improving the Thesaurus’ data, and so our team are working with the OED to incorporate such updated information into the Thesaurus, forming the core of the second edition’s improvements. Revision of the OED is an ongoing process and the same will therefore be true of the second edition of the Thesaurus, with periodic updates to our data reflecting those made to the dictionary.

In detail

1. Revisions to the Thesaurus and its structure

The nature of a thesaurus is that until the final word to be categorized is slotted into place it is very difficult to envisage the structure of the data in full context. Our principal principle has always been one of ground-up categorisation, where the words themselves (along with their definitions and example citations, as well as other information we glean from expert sources) give rise to their interrelationships and so their structure. These categories exist in a wider structure in a principled three-part large-scale construction we describe on our classification page. Once the first edition of the Thesaurus was published, the editors worked on research projects which aimed to use the full database (not selected extracts), such as Mapping Metaphor and the SAMUELS tagger. This led us to become aware of areas of the Thesaurus which were inconsistent or less than ideally usable, and so we began a process of revising the Thesaurus structure, spread over the four major versions of our database released since the first edition in 2009.

The first set of structural changes were minor, and were to remove what were previously ‘00’ categories (such as re-editing 01.01.05.00 Water – a category containing different types of water, while later categories were focused on bodies and collections of water, like rivers, springs, fountains, and so forth – in order to provide space within 01.01.05 itself for what was previously placed in the .00 category) and to deal with small isolate categories – the most substantial being what was 02.06 Refusal/denial, containing a surprisingly small number of words in English for which a disproportionately significant place high in the hierarchy was reserved, and is now more appropriately placed within Language, at 02.07.06.23.

The team then worked on a much more substantial revision of the category system, in order to rationalize the higher levels of the hierarchy. This was a major reorganisation and period of re-editing of the hierarchy, resulting in version 4 of the database. This moved around the majority of the full Thesaurus hierarchy, creating prominent spots to bring together concepts such as Health and disease, People, Space, Time, Movement, Attention and judgement, Goodness and badness, Law, and Trade and finance. With these changes, the so-called ‘tier 2’ categories (those which are two levels down in the hierarchy) bring together significant concepts in English at an appropriate and prominent level. A secondary consideration when undertaking this major reorganisation was to rationalize – as far as the language allows – categories at this tier level to ensure they are of a broadly significant size. Enabling not just more coherent browsing and interpretation of the data, this set of changes brings together concepts at a prominent level for research (such as when the SAMUELS tagger calculates ‘distances’ between senses in two categories in order to compare how similar one meaning is to another).

Categories have also been renamed throughout this process for clarity where relevant (such as Early man being renamed to Protohuman).

Overall, the second edition has 198,831 categories which have been re-edited in their location and hierarchy (out of 235,249 in the first edition, almost 85%). Revised categories are indicated by a second edition sign (a 2 in a circle) next to a category in the online Thesaurus.

We provide a converter on this site to help locate second edition categories from first edition category codes.

2. Revisions to dates of attestation

The comprehensive revision of the Oxford English Dictionary has resulted in updates to the evidence presented there as to dates of attestation. One of the significant pieces of work undertaken by the Thesaurus team over the last fifty years has been to develop a short statement of approximate dates of attestation of a word sense based on OED data, supplemented by further research and the use of modern dictionaries (for example, where OED1’s entries remain unrevised from the early twentieth century).

By using new OED attestation dates, in addition to algorithms in accordance with our own attestation dating policies, we have updated and revised our dates for over 160,000 words. This involves: 91,364 antedatings, where OED3 now provides an earlier date of attestation than we had in the first edition; 37,575 cases of postdating, bringing attestations of use closer to the present day; and 119,581 verifications and extensions of a word’s currency by using more precise recent citations (ideally since 1945) to confirm a word continues to be used in modern English.

Revised datings are indicated in the text which appears when you hover over the second edition sign (a 2 in a circle) next to a word in the online Thesaurus.

3. New words

The second edition of the Thesaurus contains 804,830 word forms, up from 2009’s 793,733 count. These 11,097 new words have been taken from OED3, and include both neologisms included in the OED from recent years and words previously unrecorded in both the OED and the Thesaurus. (For details on what has been added, the OED’s blogs contain excellent updates regarding words in their quarterly updates.) A new word can be identified by ‘New in our second edition’ appearing when you hover over the second edition sign (a 2 in a circle) next to a word in the online Thesaurus.

4. New tools

Alongside the data updates, our second edition contains new ways to work with Historical Thesaurus data.

Next to all words in the Thesaurus, a small ‘sparkline’ visualization appears showing the approximate dates of attestation of that entry. This visually displays the date information given in the text, and allows a user to scan their eye down a category in order to quickly find a date of particularly long (or short) attestation, or instantly see which words are or are not in use in recent times.

Categories have powerful tools available to re-sort and visualize their contents. Within the menu icon for each main category, options enable re-sorting entries by first attested date (the default sort), alphabetically, or by length of attestation. Previously, the printed version only gave date-based sorting – which was ideal for many functions of a historical thesaurus – but the new length of attestation sort gives a researcher a new and powerful tool. This sort displays the senses with the longest dates of attestation first, which often brings the primary or most common expression for a concept to the top of the list. For example, the category for words for War (03.03.01 n.) in the default sort begins with twelve words only used in Old English (some only found in poetry), whereas when sorted with the longest-lived words at the top, war (attested a1225 to present) comes to the head of the list, along with some other words with long currency but which have dropped out of use. At the same time, word senses with short lifespans (or hapax legomena) are likewise clustered at the end of the list.

In addition, there is a fuller timeline visualization available for every category in the Thesaurus (for main categories, from the same menu which enables re-sorting, and from a dedicated button for subcategories). This visually displays a category (in any of the sort modes above) in more detail than the sparkline visualizations, and can be exported for teaching or research purposes.

Finally, we have a dedicated visualization section which allows users to view and explore changes in the size of categories. For advanced users, this permits research into the patterns and rate of lexicalization of a category (how many new words are coined within and around that category, acting as an indication of how popular or culturally significant a concept is). Users can limit their results to select categories which display particular patterns of growth or contraction such as bursts, slumps, peaks, and troughs of activity, and view these either as graphs of usage or as a heatmap of when categories lexicalize faster or slower than the rest of English does.

Top