|2000-2||Diversité du traitement automatique des langues|
(Diversity in automatic language processing)
|Click the book to abstract!|
Antoine CONSIGNY (Liverpool, Grande-Bretagne)Looking at Phrasal Verbs in a Data-Driven Perspective : A Case Study of 'Take Up'pp. 7-18
The aim of the paper is to present a case-study of the phrasal verb (PV) 'take up' from a data-driven perspective. First previous approaches are studied and it is shown that they have some shortcomings due to the fact that the results rely solely on the linguist's intuition. Here, the PV is studied semantically from a computerised corpus of the British newspaper The Guardian, using Johns and Scott's (1993) Microconcord concordancing software. The occurrences of 'take up' are studied individually before a list of senses is established. Once the different senses are defined, a second step is to look into the PV itself, for the meanings of the parts (verb and postposition). Comparing the results of the study with previous studies of postpositions in PVs done elsewhere (in particular Lindner, 1981 ; Side, 1990 ; Hampe, 1997 ; Hannan, 1998) and those of verbs (in particular Consigny, 1995 ; Allen, 1998), it is argued that the relative importance of the postposition is not as great as some would have it to be.
François MANIEZ (Lyon 2)Le repérage par traitement automatique du défigement lexical des proverbes dans la presse américaine(Automatic retrieval of intentionally modified proverbs in the American press)pp. 19-32
Reference to a well-known proverb or phrase by altering one of its components is a widespread phenomenon in the British and American press. Since such modifications can impede a non-native speaker's understanding of newspaper or magazine articles, a system that could identify them and refer learners of English as a second language to the original wording of such expressions might be useful in the conception of an on-line comprehension assistant. Using a data base of 10 500 titles from an American news magazine, we analyze the various types of modifications that come into play in the use of shared cultural references. Through the comparison of our data base with 800 English proverbs, we test various ways in which such modifications can be automatically detected.
J.G. KRUYT (Leyde, Pays-Bas)Towards the Integrated Language Database of 8th-21st Century Dutchpp. 33-44
In the past decade, technology has had a major impact on the activities of the Institute for Dutch Lexicology (INL). The results include three electronic dictionaries, covering the period from 1200 up to 1976, and some linguistically annotated text corpora of historical and present-day Dutch. Three present-day corpora have been widely used not only for lexicography but also for many other purposes, since becoming accessible over the Internet in 1994. Advanced technology will have even more importance for a project recently started, the Integrated Language Database of 8th-21st Century Dutch, in which the dictionaries, lexica and a diachronic text corpus will be linked in a meaningful way. Parts of the database will be linked with comparable data collections at other institutes, thus creating a supra-institutional research instrument which will provide new opportunities for innovative research.
Pablo GAMALLO (Lisbonne, Portugal)Bases lexicales et systèmes d'héritage conduits par la relation de méréonymie(Lexical bases and 'heritage' systems on the basis of meronymy relationships)pp. 45-56
The majority of lexical databases and computerized thesauri are organised on the principle of a system of lexical 'heritage' based on taxonomic relationships (IS_A). This relationship is perceived as the channel through which the lexical information is passed. We claim, however, that the transfer of information in a lexical thesaurus can also take place through other kinds of relationships. In this respect, we analyse the 'heritage mechanism' based on the mereonymic relationship COMPOSED_OF. The main object of this paper is to characterise the framework of a lexical system based on a system of mereonymic relationships, i.e. a system which will allow for a whole to inherit the information of its parts. Further, we want to demonstrate that this type os heritage allows for a model of metonymic interpretation of polysemic nouns.
Manuel BARBERA & Carla MARELLO (Turin, Italie)Les lexies complexes et leur annotation morphosyntaxique dans le Corpus Taurinense(Complex lexical units and their morphosyntactic treatment in the Corpus Taurinense)pp. 57-70
Corpus Taurinense (CT) is the POS tagged version of ItalAnt Corpus, an electronic corpus of Old Italian texts (between 1251 and 1300). In this article we aim to describe the approach followed in CT for the annotation of multiword units (MWU). MWU in our work is a set of two or more graphic words which receive (also) an overall POS tagging because this set of words is in paradigmatic relation with one word lexical unit with the same POS.Our POS tagging confirms that most of the Modern Italian compound conjunctions at that time were not lexicalised. The order of the components is already the Modern Italian order but they can still be interrupted by occasional elements.
Thomas LEBARBÉ & François GIRAULT (Caen)TAPAS : Traitement et Analyse par Perception Augmentée en Syntaxe(TAPAS: Treatment and Analysis in Syntax by Augmented Perception)pp. 71-83
In this article, we present a novel approach to syntactic parsing. As opposed to usual methods that consider syntactic parsing as a series of processes, we suggest here an architecture based on hybrid agents which task is robust deep syntactic parsing. After a short summary of the research on which we have based our work we present by means of an example the theoretical functioning of the architecture. This then allows us to describe the APA architecture, developped by Girault, that we have used in this joint research. Finally, in conclusion, we present some perspectives for implementation.
Nathalie GARRIC & Denis MAUREL (Tours)Désambiguïsation des noms propres déterminés par l'utilisation des grammaires locales(Disambiguating proper nouns by use of local grammars)pp. 85-100
This paper is part of the PROLEX project of automatic processing of proper nouns. Our objective consists, with the help of the computer, not only in identifying the various occurrences of the definite proper noun (modified or not modified), but also in tagging them with a relevant type of interpretation: referential, denominative, model, metaphoric or split. After the elaboration of a typology of the various uses of the definite proper noun, we try to extract the formal and lexical indications allowing for the deambiguation of the referential and semantic functioning of the proper noun. After isolating these differential units (e.g. determiners, adjectives, predicates of existence), we build local grammars intended for automatic recognition.
Denise MALRIEU & François RASTIER (CNRS-Paris)Genres et variations morphosyntaxiques('Genres' and morphosyntactic variations)pp. 101-120
A differential statistical analysis of 2600 integral texts of a French language textual database parsed and tagged by the parser CORDIAL enabled us to test and exploit the notion of "textual genre". A previous texts "manual" classification enabled us to combine deductive and inductive approaches to test the existence of significative differences between discourses, generic fields and genres, attested on 250 morphosyntactic variables. The univariate analysis shows more and stronger differences between discourses and generic fields than between narrative genres. The ascending hierarchical classification confirms the differences between discourses and generic fields (legal vs others ; theatre and poetry vs narrative genres), but it establishes mixed classes at the bottom of the hierarchy, the detective novel constrasting more with the other narrative genres. These results confirm the interest of the notion of genre for textual linguistic analysis, strengthen Hjelmslev's hypothesis that syntax belongs to linguistic content, and show scale solidarities between global text level and local word level, that have been until now unnoticed.
Béatrice OSMONT (IUFM-Lille)Comment définir le genre hypertextuel d’un site d’établissement(Defining the hypertextual genre in school websites)pp. 121-136
The notion of genre is applied to socially located forms of textually produced web hypertext. A semantic analysis approach is taken with a view to account for the complexity of these hypertextual forms. School web sites are taken as examples to demonstrate certain recurrent properties specific to such group of sites.
Alberto DIAZ ESTEBAN & Pablo Gervas GOMEZ-NAVARRO (Madrid, Espagne)Three Information Filtering Applications on the Internet driven by Linguistic Techniquespp. 137-149
Sylvie NORMAND & Didier BOURIGAULT (CNRS-Rouen / Toulouse)Analyse des adjectifs d'un corpus médical à l'aide d'outils de traitement automatique des langues(Analysis of the adjectives of a medical corpus by means of automatic language processing)pp. 151-160
Divergent descriptions of histopathologic images induce inter- and intra-observer variability in diagnosis based on the observation of breast tumours images. The lack of reproducibility in identifying specific morphological features is partly due to varying levels of expertise among pathologists and to differences in subjective analysis and comprehension of pathological images. As linguists and developers of Natural Language Processing (NLP) systems, we started a collaboration with the Medical Informatics Department at the Broussais Hospital in order to explore a new way for corpus-based medical glossary acquisition. We focused our analysis on adjectives because they are the main linguistic category involved in the evaluation process. The first results of this study show the relevance of a corpus-based approach to cope with the "subjective" interpretations given by pathologists when they analyse microscopic images.