Aufbau von Korpora

  • Manuel BARBERA (Turin, Italie)
    Les lexies complexes et leur annotation morphosyntaxique dans le Corpus Taurinense
    (Complex lexical units and their morphosyntactic treatment in the Corpus Taurinense)
    2000, Vol. V-2, pp. 57-70

    Corpus Taurinense (CT) is the POS tagged version of ItalAnt Corpus, an electronic corpus of Old Italian texts (between 1251 and 1300). In this article we aim to describe the approach followed in CT for the annotation of multiword units (MWU). MWU in our work is a set of two or more graphic words which receive (also) an overall POS tagging because this set of words is in paradigmatic relation with one word lexical unit with the same POS.Our POS tagging confirms that most of the Modern Italian compound conjunctions at that time were not lexicalised. The order of the components is already the Modern Italian order but they can still be interrupted by occasional elements.

  • Rabia BELRHALI (INPG-Grenoble)
    BdPholex : une base de données phonétiques et lexicales du français parlé
    (BdPholex: a phonetical and lexical database of spoken French)
    1999, Vol. IV-1, pp. 75-78
  • Christian BOITET (Grenoble 1)
    Corpus pour la TA : types, tailles et problèmes associés, selon leur usage et le type de système
    (Corpus for the Machine Translation: types, sizes and connected problems, in relation to use and system type)
    2007, Vol. XII-1, pp. 25-38

    It is important to realise that human translation is difficult and diverse, and that automation is needed not only by end users, but also by translators and Interpreters. Also, automation itself comes in many forms. After briefly describing computer tools for translators, we will concentrate on the linguistic and computer approaches to the automation of translation proper. This survey will yield an array of criteria for categorizing existing CAT systems, with brief examples of the state of the art. Finally, we present perspectives of future research, development, and dissemination.

  • Louis-Jean BOË (Grenoble)
    La matérialité des structures sonores du langage
    (The material aspect of sound structures in language)
    1996, Vol. I-1, pp. 41-54

    Do the major tendencies of phonological systems of languages depend on constraints of production and perception ? This problem has been studied in the framework of "substance oriented" linguistics, which was introduced simultaneously by Lindblom and Stevens in 1972. Various universal tendencies of phonological systems that might be explained by the characteristics of the sound structures and could be looked upon from an ontogenetical perspective, will be presented and discussed here. The characteristics and the predictability of vocalic and syllabic systems seem to be eminently suited for the study of this question on the basis of research carried out at the ICP.

  • Veerle BROSENS (Louvain, Belgique)
    Les projets ELILAP et LANCOM
    (The ELILAP and LANCOM projects)
    1999, Vol. IV-1, pp. 89-95
  • Henri BÉJOINT (Lyon 2)
    Informatique et lexicographie de corpus : les nouveaux dictionnaires
    (Computer science and corpus lexicography: the new dictionaries)
    2007, Vol. XII-1, pp. 7-23

    The dictionary evolved from medieval glosses that explained fragments of discourse in their contexts. Those fragments were later collected, then classified and reduced to their simplest forms, ie words. The most important aspect of that evolution from the gloss to the dictionary is that the fragment to be explained was decontextualized, extracted from discourse. The main objective of the dictionary is to give an image of the system. It is now possible to improve the dictionary in its role as a tool for explaining discourse. It cannot provide explanations that would be adapted to every single context, but it can give to the user a huge quantity of discourse, and provide explanations that would be more closely adapted to every occurrence or type of occurrence. Lexicographers would be well advised to investigate those new possibilities.

  • Nicoletta CALZOLARI (CNR-Pise, Italie)
    Standards for Linguistic Resources in Europe : the LE-EAGLES Project
    1999, Vol. IV-1, pp. 57-64

    The rapid growth of digitized linguistic information has brought forward the problem of its standardisation in view of a broader and better use, while at the same time the need of testing the various tools developed for this goal was felt. On the initiative of the European Commission, these questions have led to several research projects aiming at proposing useful standards for the whole of Europe, among which the EAGLES Project presented here.

  • Marie-Laure ELALOUF (Cergy-Pontoise)
    Construction et exploitation de corpus d'écrits scolaires
    (The building-up and exploitation of corpora of texts written in schools)
    2007, Vol. XII-1, pp. 53-70

    The first part of this article explains which methodological issues need to be examined in order to establish and transcribe a large corpus of texts written by pupils, along with their school context. The second part of the article states the various lines of epistemological questioning which led to a second research project, i.e. questions about how to define types of school writing as well as a corpus and context, and about the necessary links between those three elements. A variety of software programs was used to analyse corpora which were not in conformity with orthographical and stylistical standards. Such a use seems possible, joined with qualitative analysis.

  • Benoît HABERT (Paris X-Nanterre)
    Outiller la linguistique : de l'emprunt de techniques aux rencontres de savoirs
    (To tool up linguistics: from borrowing techniques to the meeting of knowledge)
    2004, Vol. IX-1, pp. 5-24

    As such, linguistic research does not imply specific devices. However, linguistic descriptions and models would benefit from relying more often on NLP (Natural Language Processing) tools and resources and on computer science methods. The possible outcome depends on the chosen type of interaction between NLP, computer science and linguistics. A synergy between paradigms and methodologies would be more fruitful than a mere import of techniques.

  • J.G. KRUYT (Leyde, Pays-Bas)
    Towards the Integrated Language Database of 8th-21st Century Dutch
    2000, Vol. V-2, pp. 33-44

    In the past decade, technology has had a major impact on the activities of the Institute for Dutch Lexicology (INL). The results include three electronic dictionaries, covering the period from 1200 up to 1976, and some linguistically annotated text corpora of historical and present-day Dutch. Three present-day corpora have been widely used not only for lexicography but also for many other purposes, since becoming accessible over the Internet in 1994. Advanced technology will have even more importance for a project recently started, the Integrated Language Database of 8th-21st Century Dutch, in which the dictionaries, lexica and a diachronic text corpus will be linked in a meaningful way. Parts of the database will be linked with comparable data collections at other institutes, thus creating a supra-institutional research instrument which will provide new opportunities for innovative research.

  • Jon LANDABURU (CNRS-Célia)
    La construction d'une base de données linguistiques pour les langues amérindiennes de Colombie : atlas, glossaires, sonothèques
    (Building a linguistic database for the Indo-american languages of Columbia: maps, glossaries, sound archives)
    1997, Vol. II-1, pp. 83-90
  • Ann LAWSON (IDS-Mannheim, Allemagne)
    Corpus Linguistics at the Institut für deutsche Sprache
    1999, Vol. IV-1, pp. 79-82
  • Thomas Hun-tak LEE (Hong-Kong)
    CANCORP - The Hong Kong Cantonese Child Language Corpus
    1999, Vol. IV-1, pp. 21-30

    In this article the CANCORP (The Hong-Kong Cantonese Child Language) is presented, a corpus built in the spirit of the Child Language Data Exchange System (CHILDES, MacWhinney & Snow, 1985). After a brief description of the contents of CANCORP, the technical problems related to the transcription of the recordings of children in Chinese and in romanized characters are addressed. Next, a short assessment is made of the possibilities that CANCORP offers for the study of language development.

  • Isabelle LEROY-TURCAN (Lyon 3)
    La Base ACADEMIE et son hypertexte : les huit éditions du Dictionnaire de l'Académie française (1694-1935) et les données associées à chaque édition
    (The ACADEMIE base and its hypertexte: the eight editions of the Dictionnaire de l'Académie française (1694-1935) and the specifics of each edition)
    1999, Vol. IV-1, pp. 47-54

    The ACADEMIE project aims at building an electronic database on the eight editions of the Dictionnaire de l'Académie française (DAF). As these eight editions cover the period from 1694 to 1932-35, this corpus presents interesting problems of diachrony and synchrony, and touches also on issues related to literature and culture. This way the DAF database is enriched by a whole range of hypertext links, allowing a dynamic dialogue between specialists and readers/consultants.

  • Patrick SAINT-DIZIER (CNRS-Toulouse)
    Quelques défis et éléments de méthode pour la construction de ressources lexicales sémantiques
    (Challenges and methods in building lexical semantic tools)
    2002, Vol. VII-1, pp. 39-51

    This paper deals with the construction of lexical semantic resources for predicates, verbs and prepositions. We first raise questions about the theoretical perspectives and the methods to be applied. Next, we describe our resources: alternations, thematic grids and lexical conceptual structure representations. We conclude by some indications on the use of these resources in applications.

  • Emmanuel SCHANG (Orléans)
    CreolData : une base de données lexicales sur les langues créoles
    (CreolData: a lexical database on creole languages)
    2005, Vol. X-1, pp. 65-76

    This paper presents CreolData, a multilingual lexical database concerning the Portuguese-based Creole Languages of Africa. In section 2, we describe the goals of the project. Section 3 is devoted to a short description of the languages of the database. We then give an overview of XML and the standards for electronic dictionarie, and focus on the macrostructure the microstructure (sections 4, 5 and 6) Finally, we give an outlook for future developments of this project (section 7).

  • José SOLER (UE)
    Projets lexiques de la Commission européenne
    (Lexical projects of the European Commission)
    1997, Vol. II-1, pp. 79-81
  • Céline VAGUER (Paris X-Nanterre)
    Constitution d'une base de données : les emplois de dans marquant la « coïncidence »
    (Creating a database: the different usages of 'dans' in marking simultaneity)
    2004, Vol. IX-1, pp. 83-97

    The setting-up of a database from which a corpus and associated information (whether syntactic or semantic, etc.) are derived is not a natural undertaking in non-computational linguistics. This article sets out to present how such a technique can be exploited within the context of a research project focussing on the French preposition dans dans.

  • André VALLI (Aix-en-Provence)
    Etiquetage grammatical des corpus de parole : problèmes et perspectives
    (Grammatical labeling of corpora of spoken language: problems and perspectives)
    1999, Vol. IV-2, pp. 113-133

    The use of transcription conventions that attempt to code the specific properties of speech, such as false starts, hesitations, and repetitions, and do not rely on the usual written punctuation, suggests that the grammatical tagging of transcribed oral corpora might be a very difficult undertaking. Developing speech-specific taggers, although desirable, would be a long-term project. In the experiment reported in this article, a spoken corpus was tagged using a system designed for written text, along with some appropriate pre-editing and post-editing programs. Quite unexpectedly, the results for speech were excellent, almost as good as those previously obtained for writing. This discovery allows us to foresee the rapid compilation of large tagged spoken corpora for French.

  • Nathalie VALLÉE (INPG-Grenoble)
    La base de données UPSID : objectif et utilisation
    (The UPSID database: its aims and its use)
    1999, Vol. IV-1, pp. 7-19

    The search for universal tendencies in the languages of the world is undoubtedly a necessary axis for any theoretical perspective in linguistics. We present here UPSID (UCLA Phonological Segment Inventory Database, Maddieson, 1986 ; Maddieson & Precoda, 1990). This database contains phonological data which are genetically balanced and the description of which is harmonized. We implemented it in ICP to enrich typological researches on vowels, diphthongs and consonants. We have analysed UPSID with the help of an original methodology which not only confirm or make more precise some regularities already stressed on, but also brings up new data.

  • Piek VOSSEN (Amsterdam, Pays-Bas)
    WordNet, EuroWordNet and Global WordNet
    2002, Vol. VII-1, pp. 27-38

    In this article we aim to present the architecture of the database WordNet, organised in order to represent conceptual relations, and set up initially for the English language, as well as its extensions made under the name of EuroWordNet for seven other European languages.