penn tree bank 1/n
Building a Large Annotated Corpus of
English: The Penn Treebank
Mitchell P Marcus*
University of Pennsylvania
Beatrice Santorinit
Northwestern University
Mary Ann Marcinkiewiczt
University of Pennsylvania
1. Introduction
There is a growing consensus that significant, rapid progress can be made in both text
understanding and spoken language understanding by investigating those phenom-
ena that occur most centrally in naturally occurring unconstrained materials and by
attempting to automatically extract information about language from very large cor-
pora. Such corpora are beginning to serve as important research tools for investigators
in natural language processing, speech recognition, and integrated spoken language
systems, as well as in theoretical linguistics. Annotated corpora promise to be valu-
able for enterprises as diverse as the automatic construction of statistical models for
the grammar of the written and the colloquial spoken language, the development of
explicit formal theories of the differing grammars of writing and speech, the investi-
gation of prosodic phenomena in speech, and the evaluation and comparison of the
adequacy of parsing models.
In this paper, we review our experience with constructing one such large annotated
corpus-the Penn Treebank, a corpus' consisting of over 4.5 million words of American
English. During the first three-year phase of the Penn Treebank Project (1989-1992), this
corpus has been annotated for part-of-speech (POS) information. In addition, over half
of it has been annotated for skeletal syntactic structure. These materials are available
to members of the Linguistic Data Consortium; for details, see Section 5.1.
The paper is organized as follows. Section 2 discusses the POS tagging task. After
outlining the considerations that informed the design of our POS tagset and pre-
senting the tagset itself, we describe our two-stage tagging process, in which text
is first assigned POS tags automatically and then corrected by human annotators.
Section 3 briefly presents the results of a comparison between entirely manual and
semi-automated tagging, with the latter being shown to be superior on three counts:
speed, consistency, and accuracy. In Section 4, we turn to the bracketing task. Just as
with the tagging task, we have partially automated the bracketing task: the output of
1 A distinction is sometimes made between a corpus as a carefully struct
together to jointly meet some design principles, and a collection, which
opportunistic in construction. We acknowledge that from this point of
Penn Treebank form a collection.
materials gathered
may be much
view, the raw
more
materials of the
。1993 Association for Computational Linguistics
Computational Linguistics
the POS tagging phase is automatically parsed and simplified to yield a skeletal syn-
tactic representation, which is then corrected by human annotators. After presenting
the set of syntactic tags that we use, we illustrate and discuss the bracketing process. In
particular, we will outline various factors that affect the speed with which annotators
are able to correct bracketed structures, a task that-not surprisingly-is considerably
more difficult than correcting POS-tagged text. Finally, Section 5 describes the com-
position and size of the current Treebank corpus, briefly reviews some of the research
projects that have relied on it to date, and indicates the directions that the project is
likely to take in the future.
2. Part-of-Speech Tagging
2.1 A Simplified POS Tagset for English
The POS tagsets used to annotate large corpora in the past have traditionally been
fairly extensive. The pioneering Brown Corpus distinguishes 87 simple tags (Francis
1964; Francis and Kucera 1982) and allows the formation of compound tags; thus, the
contraction I'm is tagged as PPSS+BEM (PPSS for "non-third person nominative per-
sonal pronoun" and BEM for "am, 'm".2 Subsequent projects have tended to elaborate
the Brown Corpus tagset. For instance, the Lancaster-Oslo/Bergen (LOB) Corpus uses
about 135 tags, the Lancaster UCREL group about 165 tags, and the London-Lund Cor-
pus of Spoken English 197 tags. The rationale behind developing such large, richly
articulated tagsets is to approach "the ideal of providing distinct codings for all classes
of words having distinct grammatical behaviour" (Garside, Leech, and Sampson 1987,
p. 167).
2.1.1 Recoverability. Like the tagsets just mentioned, the Penn Treebank tagset is based
on that of the Brown Corpus. However, the stochastic orientation of the Perm Tree-
bank and the resulting concern with sparse data led us to modify the Brown Corpus
tagset by paring it down considerably. A key strategy in reducing the tagset was to
eliminate redundancy by taking into account both lexical and syntactic information.
Thus, whereas many POS tags in the Brown Corpus tagset are unique to a particular
lexical item, the Penn Treebank tagset strives to eliminate such instances of lexical re-
dundancy. For instance, the Brown Corpus distinguishes five different forms for main
verbs: the base form is tagged VB, and forms with overt endings are indicated by
appending D for past tense, G for present participle/ gerund, N for past participle,
and Z for third person singular present. Exactly the same paradigm is recognized for
have, but have (regardless of whether it is used as an auxiliary or a main verb) is as-
signed its own base tag HV. The Brown Corpus further distinguishes three forms of
do-the base form (DO), the past tense (DOD), and the third person singular present
(DOZ) 4 and eight forms of be-the five forms distinguished for regular verbs as well
as the irregular forms am (BEM), are (BER), and was (BEDZ). By contrast, since the
distinctions between the forms of VB on the one hand and the forms of BE, DO, and
HV on the other are lexically recoverable, they are eliminated in the Penn Treebank,
as shown in Table 1.5
2 Countin
both
simple and
c and tags
the Brown
Corpus to
3Au
可
erview
of the re
of these and other tagsets to
contains 187 tags.
ther and to the Brown Corpus
giv
p :!ndix B of Garside, Leech, anc
n (1987).
n ind
articiple of do are tagged VBG
N in the Brown Corpus,
respectively-presumably because
The irregular present tense forms
are are tagged as
are never used as aux
verbs in American English.
the Penn Treebank (see
Section 2.1.3), just like any other non-third person singular present tense form
314