Corpus – based lexicography

Yüklə 33,79 Kb.

səhifə	1/2
tarix	03.06.2022
ölçüsü	33,79 Kb.
	#88656

1 2

3 tayyor

Corpus – based lexicography
Corpus Linguistics and Lexicography WOLFGANG TEUBERTCorpus Linguistics—More Than a Slogan?During the last decade, it has been common practice among the linguisticcommunity in Europe—both on the continent and on the British Isles—touse corpus linguistics to verify the results of classical linguistics. In NorthAmerica, however, the situation is different. There, the Philadelphia-basedLinguistic Data Consortium, responsible for the dissemination of languageresources, is addressing the commercially oriented market of language engi-neering rather than academic research, the latter often being more interestedin universal grammar or semantic universals than in the idiosyncrasies ofnatural languages.
American corpus linguists such as Doug Biber or NancyIde and general linguists who are corpus users by conviction such as CharlesFillmore are almost better known in Europe than in the United States, whichis even more astonishing when we take into account that the ﬁrst real corpusin the modern sense, the Brown Corpus, was compiled in Providence, R.I.,during the sixties.Meanwhile, European corpus linguistics is gradually becoming a sub-discipline in its own right. Unfortunately, during the last few years, thislead to a slight bias towards those ‘self-centred’ issues such as the problemsof corpus compilation, encoding, annotation and validation, the proceduresneeded for transforming raw corpus data into artiﬁcial intelligence applica-tions and automatic language processing software, not to mention the problemof standardisation with regard to form and content (cf. the long-term projectEAGLES [Expert Advisory Group on Language Engineering Standards] and 126 WOLFGANG TEUBERTthe transatlantic TEI [Text Encoding Initiative]).
Today, these issues oftentend to prevail over the original gain that the analysis of corpora may con-tribute to our knowledge of language. But it was exactly this corpus-speciﬁcknowledge that the ﬁrst generation of European corpus linguists such as StureAllen, Vladislav Andrushenko, Stig Johannson, Ferenc Kiefer, Bernard Que-mada, Helmut Schnelle, or John Sinclair had in mind. In West Germany, theInstitut f¨ur Deutsche Sprache was among the ﬁrst institutions that consideredthe collection of corpus data as one of their permanent tasks; its corpora dateback as early as the late sixties, although at that time most corpus data wasstill only used for the veriﬁcation of research results gained from traditionalmethods.
But has today’s corpus linguistics really advanced from there?The recent textbooks claiming to provide an introduction to corpus lin-guistics still do not add up to more than a dozen—all of them in English.Unfortunately, except for the commendable books of Stubbs 1996 and Biber,Conrad and Reppen 1998, they do deplorably little to establish corpus lin-guistics as a linguistic discipline in its own right. Instead, they are focussingon the use of corpora and corpus analysis in traditional linguistics (syn-tax, lexicology, stylistics, diachrony, variety research) and applied linguistics(language teaching, translation, language technology). Corpus Linguistics byTony McEnery and Andrew Wilson (McEnery and Wilson 1996) may serveas an example of this kind. Forty pages describe the aspects of encoding; 20 pages deal with quantitative analysis; 25 pages describe the usefulness ofcorpus data for computer linguistics with another 30 pages covering the useof corpora in speech, lexicology, grammar, semantics, pragmatics, discourseanalysis, sociolinguistics, stylistics, language teaching, diachrony, dialectol-ogy, language variation studies, psycholinguistics, cultural anthropology andsocial psychology and the ﬁnal 20 pages contain a case study on sublan-guages and closure. McEnery and Wilson’s book reﬂects the current state ofcorpus linguistics. In fact, it more or less corresponds to the topics coveredat the annual meetings held by the venerable IACME, an association dealingwith English language corpora (cf. Renouf 1998). Semantics are mainly leftaside.Surprisingly, when judged by their commercial value, it is not the writtenlanguage corpora that are most successful, but rather speech corpora that canclaim the highest prices. Speech corpora are special collections of some care-fully selected text samples (words, phrases, sentences) spoken by numerousdifferent speakers under various acoustic conditions.
They caused the ﬁnal CORPUS LINGUISTICS AND LEXICOGRAPHY 127 breakthrough in automatic speech recognition that computer models basedon cognitive linguistics failed to achieve for many years. The recognitionof speech patterns was only made possible by a combination of categorialand probabilistic approaches towards a connectionist model trained on largespeech corpora. Thus, speech analysis can thus be seen as an early impetusfor the establishment of corpus linguistics as an independent discipline withits own theoretical background.Lexicography is the second major ﬁeld where corpus linguistics notonly introduced new methods, but also extended the entire scope of research,however, without putting too much emphasis on the theoretical aspects ofcorpus-based lexicography. Here again, it was John Sinclair who lead theway as initiator of the ﬁrst strictly corpus-based dictionary of general lan-guage (COBUILD 1987).
Britain was also the site of the ﬁrst corpus-basedcollocation dictionaries (such as Kjellmer 1994). Bilingual lexicography mayalso beneﬁt from a corpus-oriented approach: a fact that is evident whencomparing the traditional Le Robert & Collins English-French Dictionaryedited by B.T.S. Atkins with Valerie Grundy and Marie-H´el`ene Corr´eard’sOxford-Hachette Dictionary which covers the same language pair. Here, theuse of (monolingual) corpora lead to a remarkably greater number of multi-word translation units (collocations, set phrases) and to context proﬁles thathad been written with the target language in mind. W¨orter und Wortgebrauchin Ost und West [Words and Word Usage in East and West Germany] (1992) by Manfred W. Hellmann may serve as the only German example of that era,using the corpus for lemma selection rather than semantic description. Onlyrecently, in 1997, did a true corpus-based dictionary appear: Schl¨usselw¨orterder Wendezeit [Keywords during German Uniﬁcation] by Dieter Herberg,Doris Steffens and Elke Tellenbach.
Thus, at least in the ﬁeld of written language, corpus linguistics is still inits infancy as a discipline with its own theoretical background—a statementwhich holds true not only for Germany but also for most other Europeancountries. In this orientation phase, where corpus linguistics is still in theprocess of deﬁning its position, most publications are in English, the languagethat has become interlingua of the modern world. But this does not mean thatcorpus linguistics is dominated mainly by English and American scholars:this can be clearly seen when browsing through any issue of the InternationalJournal of Corpus Linguistics. Still, German linguistics appears somewhatunderrepresented in this discussion. One exception is Hans J¨urgen Heringer. His innovative study on ‘distributive semantics’ shows a growing receptionof the programme for corpus linguistics which is outlined below.
In his bookDas h¨ochste der Gef¨uhle [The most sublime of feelings] (Heringer 1999), hedescribes the validation of semantic cohesion between adjacent words on thebasis of larger corpora. Above all, it is this area between lexis and syntaxwhere corpus linguistics offers new insights.Corpus Linguistics—A ProgrammeCorpus linguistics believes in structuralism as deﬁned by John R. Firth; there-fore, it insists on the notion that language as a research object can only beobserved in the form of written or spoken texts. Neither language-independentcognition nor propositional logic can provide information on the nature ofnatural languages.
For these are, as stated in an apophthegm by Mario Wan-druszka, characterised by a mixture of analogy and anomaly. The quest for auniversal structure of grammar and lexicon which is typical for the follow-ers of Chomsky or Lakoff cannot meet the demands of these two aspects.1Instead, corpus linguistics is closer to the semantic concept inherent in thecontinental European structuralism of Ferdinand de Saussure, which regardsthe meaning as inseparable from the form, that is, the word, the phrase,the text. In this theory, the meaning does not exist per se. Corpus linguis-tics rejects the ubiquitous concept of the meaning being ‘pure information,’encoded into language by the sender and decoded by the receiver. Corpuslinguistics, instead, holds that content cannot be separated from form, ratherthey constitute the two aspects under which texts can be analysed.
The word,the phrase, the text is both form and meaning.The above statement clearly outlines the programme of corpus linguis-tics. It is mainly interested in those phenomena on the fringe between syntaxand lexicon, the two subjects of classical linguistics. It deals with the pat-terns and structures of semantic cohesion between text elements that areinterpreted as compounds, multi-word units, collocations and set phrases. Inthese phenomena, the importance of the context for the meaning becomesevident.Corpus linguistics extends our knowledge of language by combiningthree different approaches: the (procedural) identiﬁcation of language databy categorial analysis, the correlation of language data by statistical methods and ﬁnally the (intellectual) interpretation of the results. Whilst the ﬁrst two steps should be done automatically as much as possible, the last step requires human intentionality, as any interpretation is an act involving consciousness and, therefore, not transmutable into an algorithmic procedure. This is the main difference between corpus linguistics and computational linguistics, which reduces language to a set of procedures.
Corpus linguistics assumes that language is a social phenomenon, to be observed and described above all in accessible empirical data—as it were, communication acts. Corpora are cross-sections through a universe of discourse which incorporates virtually all communication acts of any selected language community, be it monolingual (e.g., German or English), bilingual (e.g., South Tyrolean, Welsh) or multilingual (e.g., Western European). However, the majority of texts that are preserved and made accessible through corpora in principle only have a limited life-span: most printed texts such as newspaper texts are out of public reach within a very short time.
If we consider language as a social phenomenon, we do not know—and do not want to know—what is going on in the minds of the people, how the speaker or the hearer is understanding the words, sentences and texts that they speak or hear. Language as a social phenomenon manifests itself only in texts that can be observed, recorded, described and analysed.
Most texts happen to be communication acts, that is, interactions between members of a language community. An ideal universe of discourse would be the sum of all communication acts ever uttered by members of a language community. Therefore, it has an inherent diachronic dimension. Of course, this ideal universe of discourse would be far too large for linguistics to explore it in its entirety. It would have to be broken down into cross-sections with regard to the phenomena that we want to describe. There is no such thing as a ‘one-size-ﬁts-all’-corpus. It is the responsibility of the linguist to limit the scope of the universe of discourse in such a way that it may be reduced to a manageable corpus, by means of parameters such as language (sociolect, terminology, jargon), time, region, situation, external and internal textual characteristics, to mention just a few.
When looking towards language as a social phenomenon, we assume that meaning is expressed in texts. What a text element or text segment means is the result of negotiation among the members of a language community, and these negotiations are also part of the discourse. Thus, the language community sets the conventions on the formal correctness of sentences and on their meaning. Those conventions are both implicit and dynamic; they are not engraved in stone like commandments. Any communication act may utilise syntactic structures in a new way, create new collocations, introduce new words or redeﬁne existing ones.
If those modiﬁcations are used in a sufﬁcient number of other communication acts or texts, they may well result in the modiﬁcation or amendment of an existing convention. One basic difference between natural and formal languages is the fact that natural language not only permits but actually integrates metalinguistic statements without explicitly marking the metalinguistic level. There is no separation between object language and metalanguage. Any convention may be discussed, questioned or even rejected in a text. Above all, discourses deal with meaning, and it is corpus linguistics that is best suited to deal with this dynamic aspect of meaning. We, as linguists, have no access to the cognitive encoding of the conventions of a language community. We only know what is expressed in texts.
Dictionaries, grammars, and language textbooks are also texts; therefore, they are part of the universe of discourse. As long as they represent socially accepted standards, we have to consider their special status. Still, their contents are neither comprehensive nor always based on factual evidence. Corpus linguistics, on the other hand, aims to reveal the conventions of a certain language community on the basis of a relevant corpus. In a corpus, words are embedded in their context. Corpus linguistics is, therefore, especially suited to describe the gradual changes in meaning: it is the context which determines the concrete meaning in most areas of the vocabulary.

Yüklə 33,79 Kb.

Dostları ilə paylaş:

1 2