Corpus exploitation

  • Anne ABEILLÉ (Paris 7)
    Corpora and syntax: the contribution of computational linguistics
    1996, Vol. I-2, pp. 7-23

    Automatic annotation of electronic texts has become a major activity in computational linguistics. We present here and comment the "Penn Treebank", a large English corpus with complete syntactic annotations, and two publicly available taggers for French, one from Xerox Research Center, the other from INaLF.


  • Martine ADDA-DECKER (Paris)
    Corpus for automatic transcription of spoken texts
    2007, Vol. XII-1, pp. 71-84

    This contribution aims at giving an overview of automatic speech recognition research, highlighting the needs for corpora development. As recognition systems largely rely on statistical approaches, large amounts of both spoken and written corpora are required. In order to fill the gap between written and spoken language, speech transcripts need to be produced manually using appropriate tools. Methods and resources accumulated over the years now allow, not only to tackle genuine oral genres, but also to envision large-scale corpus studies to increase our knowledge of spoken language, as well as to improve automatic processing.


  • Sophie ASLANIDES (Paris 8)
    Adapting a finely annotated corpus for linguistic research objectives
    1999, Vol. IV-1, pp. 97-99
  • Manuel BARBERA (Turin, Italie)
    Complex lexical units and their morphosyntactic treatment in the Corpus Taurinense
    2000, Vol. V-2, pp. 57-70

    Corpus Taurinense (CT) is the POS tagged version of ItalAnt Corpus, an electronic corpus of Old Italian texts (between 1251 and 1300). In this article we aim to describe the approach followed in CT for the annotation of multiword units (MWU). MWU in our work is a set of two or more graphic words which receive (also) an overall POS tagging because this set of words is in paradigmatic relation with one word lexical unit with the same POS.Our POS tagging confirms that most of the Modern Italian compound conjunctions at that time were not lexicalised. The order of the components is already the Modern Italian order but they can still be interrupted by occasional elements.


  • Olivier BAUDE (Orléans)
    Legal and ethical aspects of conserving and diffusing corpora of spoken texts
    2007, Vol. XII-1, pp. 85-97

    The digitalization of spoken language corpora opens large perspectives for linguistics. However, the archiving and the exploitation of these spoken corpora raise new ethical and legal problems that the scientific community must take into account. This article presents the results of an interdisciplinary working group which wrote a Guide of good practices for the constitution, the exploitation, the archiving and the diffusion of spoken language corpora.


  • Valérie BEAUDOUIN (France Télécom R & D)
    Metrics in rules
    2004, Vol. IX-1, pp. 119-137

    Metrics and rythmics aspects are examined on a 80 000-verse corpus, analysed with computational linguistics tools. We propose a cumulative experimental approach consisting in building a verse pattern with a series of features (morpho-syntactic, stress, rhyme.). Features may characterize units of different levels (syllables, hemi-verse, verse, etc.) and are evidenced by different tools, but all are integrated in a single database. Thus we can verify classic metric rules and hypotheses. We also document new regularities, for example on stress patterns, and we test some new hypotheses about links between traits and patterns. This empirical approach on a large corpus, beyond verification of hypotheses, may lead to the construction of grounded theories.


  • Claire BLANCHE-BENVENISTE (Aix-en-Provence)
    Building and using a large corpus
    1999, Vol. IV-1, pp. 65-74

    This article aims to demonstrate how a corpus of spoken French, that was started in Aix-en-Provence around 1975, has developed over time in connexion with the development of what has been called since 'corpus linguistics'. The story of that corpus is related here, whilst at the same time the possibilities for exploitation that it offers today are traced briefly.


  • Claire BLANCHE-BENVENISTE (Aix-en-Provence)
    On the usefulness of linguistic corpora
    1996, Vol. I-2, pp. 25-42

    This article aims to demonstrate how a corpus of spoken French, that was started in Aix-en-Provence around 1975, has developed over time in connexion with the development of what has been called since 'corpus linguistics'. The story of that corpus is related here, whilst at the same time the possibilities for exploitation that it offers today are traced briefly.


  • Mylène BLASCO-DULBECCO (Clermont-Ferrand)
    Proven relationships between data and analysis
    1999, Vol. IV-2, pp. 31-40

    More often than not, oral data differ from written data both from a frequential and distributional point of view. They lead to a sharpening of the description as they supply us with construction characteristics or contexts that do not exist in writing.Dislocations and the form 'il y a', known to be lavishly used in the oral language, provide us with often predictable examples regarding their distributional characteristics as well as their function in textual dynamics.Although 'certains' as a subject is not much used in spoken language, its offers a variety of distribution facts which are also clearly divided and actually related to the kind of observed corpus.The following article therefore aims at presenting us with three case studies that are representative of the relationship between data and analysis.


  • Christian BOITET (Grenoble 1)
    Corpus for the Machine Translation: types, sizes and connected problems, in relation to use and system type
    2007, Vol. XII-1, pp. 25-38

    It is important to realise that human translation is difficult and diverse, and that automation is needed not only by end users, but also by translators and Interpreters. Also, automation itself comes in many forms. After briefly describing computer tools for translators, we will concentrate on the linguistic and computer approaches to the automation of translation proper. This survey will yield an array of criteria for categorizing existing CAT systems, with brief examples of the state of the art. Finally, we present perspectives of future research, development, and dissemination.


  • Paul CAPPEAU (Poitiers)
    The sociolinguistic exploitation of large corpora. Key-word and stone of wisdom
    2007, Vol. XII-1, pp. 99-110

    The desire to make use of large collections of oral data is nowadays largely shared by linguists. At a time when such tools are becoming increasingly available for French, it is important to make sure that there is sensitivity to all of those factors which guarantee reliability in the different stages of obtaining data: clarification of the term ‘corpus’; reflection on approaches to the field and to orality, and on representativeness (both in terms of genres and numbers of speakers); data elicitation practices and transcription.


  • Anne CONDAMINES (Toulouse 2-Le Mirail / CNRS)
    The role of interpretation in corpus semantics: building a terminology
    2007, Vol. XII-1, pp. 39-52

    The aim of this paper is to focus on the necessity of a double marking-out when doing the semantic analysis of the data of a corpus. The first mark lies in the situation in which texts are produced and the second one lies in the interpretation of the texts. In both cases, the author suggests to use the notion of 'genre' (textual 'genre' and interpretative 'genre') in order to classify and categorize situations. The issue is exemplified by the problem of the building of terminologies according to a particular interpretative 'genre'. The paper shows how textual 'genre' influences the functioning of conceptual relations patterns (e.g. the preposition avec is used to spot a meronymic relation). It demonstrates that this kind of analysis may help to refine the descriptions initially made by introspection.


  • Antoine CONSIGNY (Liverpool, Grande-Bretagne)
    Looking at Phrasal Verbs in a Data-Driven Perspective : A Case Study of 'Take Up'
    2000, Vol. V-2, pp. 7-18

    The aim of the paper is to present a case-study of the phrasal verb (PV) 'take up' from a data-driven perspective. First previous approaches are studied and it is shown that they have some shortcomings due to the fact that the results rely solely on the linguist's intuition. Here, the PV is studied semantically from a computerised corpus of the British newspaper The Guardian, using Johns and Scott's (1993) Microconcord concordancing software. The occurrences of 'take up' are studied individually before a list of senses is established. Once the different senses are defined, a second step is to look into the PV itself, for the meanings of the parts (verb and postposition). Comparing the results of the study with previous studies of postpositions in PVs done elsewhere (in particular Lindner, 1981 ; Side, 1990 ; Hampe, 1997 ; Hannan, 1998) and those of verbs (in particular Consigny, 1995 ; Allen, 1998), it is argued that the relative importance of the postposition is not as great as some would have it to be.


  • Maria de Lourdes CRISPIM (Lisbonne, Portugal)
    Building and using a corpus of medieval Portuguese
    1999, Vol. IV-1, pp. 41-45

    In this article, the authors describe first how the Corpus of Medieval Portuguese has been constitued, in particular how it has been coded ; secondly, an attempt will be made at demonstrating how it can be used for the construction of a dictionary of medieval Portuguese, more specifically of its verbs, proper and common nouns.


  • Jean-Pierre DESCLÉS (Paris 4)
    Information retrieval from corpora of technical texts
    1997, Vol. II-2, pp. 19-33

    Technical texts present interesting and so far poorly researched linguistic characteristics. In this article, a research project is described, carried out by a multidisciplinary group of linguists and computer scientists, which aims at devising and realising prototypes of computer programmes for extracting information from technical texts. This research, as is illustrated by concrete examples, has led to computer programmes that have the form either of networks between concepts or of phrases taken from the analysed texts, and that are, if necessary, accompanied by automatically assigned semantic information.


  • Marie-Laure ELALOUF (Cergy-Pontoise)
    The building-up and exploitation of corpora of texts written in schools
    2007, Vol. XII-1, pp. 53-70

    The first part of this article explains which methodological issues need to be examined in order to establish and transcribe a large corpus of texts written by pupils, along with their school context. The second part of the article states the various lines of epistemological questioning which led to a second research project, i.e. questions about how to define types of school writing as well as a corpus and context, and about the necessary links between those three elements. A variety of software programs was used to analyse corpora which were not in conformity with orthographical and stylistical standards. Such a use seems possible, joined with qualitative analysis.


  • Pablo GAMALLO (Lisbonne, Portugal)
    Lexical bases and 'heritage' systems on the basis of meronymy relationships
    2000, Vol. V-2, pp. 45-56

    The majority of lexical databases and computerized thesauri are organised on the principle of a system of lexical 'heritage' based on taxonomic relationships (IS_A). This relationship is perceived as the channel through which the lexical information is passed. We claim, however, that the transfer of information in a lexical thesaurus can also take place through other kinds of relationships. In this respect, we analyse the 'heritage mechanism' based on the mereonymic relationship COMPOSED_OF. The main object of this paper is to characterise the framework of a lexical system based on a system of mereonymic relationships, i.e. a system which will allow for a whole to inherit the information of its parts. Further, we want to demonstrate that this type os heritage allows for a model of metonymic interpretation of polysemic nouns.


  • Nathalie GARRIC (Tours)
    Disambiguating proper nouns by use of local grammars
    2000, Vol. V-2, pp. 85-100

    This paper is part of the PROLEX project of automatic processing of proper nouns. Our objective consists, with the help of the computer, not only in identifying the various occurrences of the definite proper noun (modified or not modified), but also in tagging them with a relevant type of interpretation: referential, denominative, model, metaphoric or split. After the elaboration of a typology of the various uses of the definite proper noun, we try to extract the formal and lexical indications allowing for the deambiguation of the referential and semantic functioning of the proper noun. After isolating these differential units (e.g. determiners, adjectives, predicates of existence), we build local grammars intended for automatic recognition.


  • Jacqueline GUILLEMIN-FLESCHER (Paris 7)
    Human translation: constraints and corpora
    1996, Vol. I-2, pp. 43-56

    This study is based on a corpus of texts and translations in French and English and aims to show there are recurrent patterns in the choice of translators. Three examples have been examined: existential sentences, passive structures and attributive constructions. The criteria that condition the features observed in the corpus are underlined and an analysis of the constraints is proposed. In conclusion, the specific points that have been dealt with are related to a more fundamental difference between the two languages.


  • Céline GUILLOT (ENS-LSH Lyon)
    Les corpus de français médiéval : état des lieux et perspectives
    2007, Vol. XII-1, pp. 125-128
  • Serge HEIDEN (ENS LSH Lyon)
    Electronic aids in studying medieval texts: methods and tools
    2004, Vol. IX-1, pp. 99-118

    Two approaches to the development of medieval text corpora can be distinguished among the projects carried out since a few decades. The first one consists of digitizing modern critical editions, and the second one is concerned with the production of precise diplomatic transcriptions of manuscripts, often directly linked to the photographs of the originals. These approaches are in fact complementary rather than contradictory, as they make it possible for scholars to choose between the quantity (representativeness) and the quality (accuracy and richness) of the data depending on the goals of their research. For both types of corpora, the challenges of their XML-TEI encoding related to the tools of their processing and analyzing are considered. Many methodological problems which arise from creating and processing medieval text corpora also concern other types of linguistic corpora.


  • Thomas LEBARBÉ (Caen)
    TAPAS: Treatment and Analysis in Syntax by Augmented Perception
    2000, Vol. V-2, pp. 71-83

    In this article, we present a novel approach to syntactic parsing. As opposed to usual methods that consider syntactic parsing as a series of processes, we suggest here an architecture based on hybrid agents which task is robust deep syntactic parsing. After a short summary of the research on which we have based our work we present by means of an example the theoretical functioning of the architecture. This then allows us to describe the APA architecture, developped by Girault, that we have used in this joint research. Finally, in conclusion, we present some perspectives for implementation.


  • Sarah LEROY (Paris X-Nanterre)
    Extraction on patterns: two-way traffic between linguistic analysis and automated identification
    2004, Vol. IX-1, pp. 25-43

    We present here an automated identification of proper name antonomasia in tagged texts. First, we compare manual and computed identification, describing the system's cogs as well as the methods and tools we used ; we point out that the automated process is better concerning reliability.After having exposed how capabilities and limits of automated location can influence linguistic work, we compare this rather old (2000) work with new tools now usable by linguists, e.g. the query ability on a subset of tagged texts in the Frantext database.


  • Patrick LEROYER (Aarhus, Danemark)
    In terms of wine: lexicographisation of an on-line tourist guide for wine-lovers
    2009, Vol. XIV-2, pp. 99-116

    Online tourist guides are information tools communicating destination image and specialised knowledge at the same time. They feature a large variety of lexicographic structures including word lists, articles, conceptual schemes, indexes and registers, search options on keywords, internal and external cross references etc. This is by no means surprising in so far as what is needed is effective data access in order to extract information – precisely in the same way as in lexicography. The functional thesis we defend in this article is that lexicographisation in a user perspective can improve the access process. Taking œnotouristic online guides as a case in point, we will examine different user situations leading to consultation, in particular the need for experiential information, in which users simply wish to improve the conditions of their œnotouristic experience. We will then formulate theoretical proposals aimed at ensuring better interaction of lexicographic functions, data presentation and access possibilities.


  • Stéphanie LOPEZ (Toulouse 2-Le Mirail / CNRS)
    An analysis of communication between pilots and air traffic control : between norms and realities of linguistic usage
    2014, Vol. XIX-1, pp. 87-101

    The domain of air traffic control is the perfect example for the analysis of a linguistic norm. In this field, phraseology is a specialised language created to cover the most common situations encountered in air navigation in order to secure and optimise radiotelephony communications. When phraseology proves inadequate, a more natural form of language, called plain language, is required: it was recently introduced in the field and is a difficult notion to implement. A comparative analysis between a reference corpus and a real-communication corpus allows the description and categorisation of different types of variations used on the radio frequency as well as a discussion on the notions of norm and usage in the field of air traffic control.


  • Christiane MARCHELLO-NIZIA (ENS)
    Diachronic Corpora
    1999, Vol. IV-1, pp. 31-39

    After having reminded what distinguishes a database from a corpus, the author of the present article gives a rapid overview of the most important documentary sources that exist today in the domain of French diachrony, after which she demonstrates first, what use can be made of them thanks to a series of tools available today, and second, and more important, how the access to large corpora allow us to review our analysis of linguistic facts, and invites us to a qualitative change in our linguistic reasoning.


  • Augusta MELA (Montpellier 3)
    Linguists and NLP-specialists may work together: location and analysis of gloses
    2004, Vol. IX-1, pp. 63-82

    This paper is related with a collective linguistic research project about the word and its gloss. Just like definitions, glosses catch 'the spoken experience of the meaning'. In French texts, this metalinguistic activity appears in words such as c'est-à-dire, ou, signifier. These signs can clarify the nature of the semantical relationship between two words: specification with au sens, equivalency with ou, c'est-à-dire, nomination with dit, baptisé, hyponymy with en particulier, comme, hyperonymy with et/ou autre(s), etc. Glosses can be automatically located because of both their marks and the features of their configurations. This paper describes an automatical retriever implementation, using 'ou glosses' as 'un magazine électronique, ou webzine' and a data-processing environment 'for linguists', namely the textual base Frantext and its Stella query language interpreter.


  • Morten PILEGAARD (Aarhus, Danemark)
    Collaborative repositories: An organisational and technological response to current challenges in specialised knowledge communication?
    2009, Vol. XIV-2, pp. 57-71

    This paper presents concepts and systems for multilingual terminological and textual knowledge codification, representation, validation, management and sharing structured around the notion of genre. These systems operationalize the different stages of the ‘virtuous knowledge cycle’ within a dynamic, multilingual specialized web-dictionary and a multilingual, genre-based corpus of medical texts genre hierarchies or systems. The knowledge cycle approach mirrors ‘real life’ working processes and allows for repeated conversions of knowledge between its tacit and explicit forms, allowing knowledge to codify and spiral up from the individual to the collective level at corporate, ‘community of practice’. The paper reports on the results of the implementation of these concepts and systems in general and the web-dictionary in particular within the Danish health care, pharmaceutical, medical device and translation sectors which technologically have been fused into one collective ‘knowledge cluster’ and it discusses the opportunities for research and business that spring from fusion of language and health technologies.


  • Claus D. PUSCH (Fribourg, Allemagne)
    Les corpus de linguistique romane en pays germanophones. Bilan et perspectives
    2007, Vol. XII-1, pp. 111-124
  • Caroline SCHAETZEN (DE) (Bruxelles, Belgique)
    Corpora and terminology: Building specialised corpora for making dictionaries
    1996, Vol. I-2, pp. 57-76

    Construction of dictionaries or specialised glossaries is more and more based on large corpora. This article aims at presenting a state of the art of the numerous technical problems that arise in the construction and exploitation of these corpora, and on computer programmes developed to help solve these problems.