Computer-assisted language processing

  • Martine ADDA-DECKER (Paris)
    Corpus for automatic transcription of spoken texts
    2007, Vol. XII-1, pp. 71-84

    This contribution aims at giving an overview of automatic speech recognition research, highlighting the needs for corpora development. As recognition systems largely rely on statistical approaches, large amounts of both spoken and written corpora are required. In order to fill the gap between written and spoken language, speech transcripts need to be produced manually using appropriate tools. Methods and resources accumulated over the years now allow, not only to tackle genuine oral genres, but also to envision large-scale corpus studies to increase our knowledge of spoken language, as well as to improve automatic processing.

  • Martine ADDA-DECKER (Paris)
    French ‘liaison’ in casually spoken French, as investigated in a large corpus of casual French speech
    2012, Vol. XVII-1, pp. 113-128

    In this paper, the realisation of the French Liaison is investigated in a large corpus of casual speech. Considering that casual speech gives rise to a large range of pronunciation variants and that overall temporal reduction increases, one may hypothesize that French liaison tends to be less productive in this speaking style. We made use of automatic processing such as automatic speech alignments to evaluate when liaison is realized in the NCCFr corpus. Realized liaisons were examined and measured for the most frequent liaison consonants (/z/, /n/ and /t/) as a function of a liaison sites classified as mandatory, optional or forbidden. The relation between speech rate and liaison realization is also examined.

  • Valérie BEAUDOUIN (France Télécom R & D)
    Metrics in rules
    2004, Vol. IX-1, pp. 119-137

    Metrics and rythmics aspects are examined on a 80 000-verse corpus, analysed with computational linguistics tools. We propose a cumulative experimental approach consisting in building a verse pattern with a series of features (morpho-syntactic, stress, rhyme.). Features may characterize units of different levels (syllables, hemi-verse, verse, etc.) and are evidenced by different tools, but all are integrated in a single database. Thus we can verify classic metric rules and hypotheses. We also document new regularities, for example on stress patterns, and we test some new hypotheses about links between traits and patterns. This empirical approach on a large corpus, beyond verification of hypotheses, may lead to the construction of grounded theories.

  • Christian BOITET (Grenoble 1)
    Automated Translation
    2003, Vol. VIII-2, pp. 99-121

    It is important to realise that human translation is difficult and diverse, and that automation is needed not only by end users, but also by translators and Interpreters. Also, automation itself comes in many forms. After briefly describing computer tools for translators, we will concentrate on the linguistic and computer approaches to the automation of translation proper. This survey will yield an array of criteria for categorizing existing CAT systems, with brief examples of the state of the art. Finally, we present perspectives of future research, development, and dissemination.

  • Christian BOITET (Grenoble 1)
    Corpus for the Machine Translation: types, sizes and connected problems, in relation to use and system type
    2007, Vol. XII-1, pp. 25-38

    It is important to realise that human translation is difficult and diverse, and that automation is needed not only by end users, but also by translators and Interpreters. Also, automation itself comes in many forms. After briefly describing computer tools for translators, we will concentrate on the linguistic and computer approaches to the automation of translation proper. This survey will yield an array of criteria for categorizing existing CAT systems, with brief examples of the state of the art. Finally, we present perspectives of future research, development, and dissemination.

  • Philippe BOULA DE MAREÜIL (Paris Sud)
    Diachronic variation in the prosody of French news announcer speech: changes in word initial accent
    2012, Vol. XVII-1, pp. 97-111

    This study addresses prosodic evolution in the French news announcer style, based on acoustic analysis of French audiovisual archives. A 10-hour corpus covering six decades of broadcast news is investigated automatically, focusing on word-initial stress, which may give an impression of emphatic style. Objective measurements suggest that the following features have decreased since the forties: mean pitch, pitch rise associated with initial stress, and vowel duration characterising an emphatic initial stress. The onsets of stressed initial syllables have become longer while speech rate (measured at the phonemic level) has not changed. This puzzling outcome raises interesting questions for research on French prosody, suggesting that the durational correlates of word-initial stress have changed over time, in the French news announcer style.

  • Antonia CRISTINOI-BURSUC (Orléans)
    Gender errors in automatic translation between English and French: typology, linguistic causes and solutions
    2009, Vol. XIV-1, pp. 93-107

    By means of the notions of behavioural classes, marking and morphosyntactic markers this paper shows that all the translation (or Machine Translation) problems that arise when translating gender from French into English or vice-versa can be predicted a priori at a lexical level, for all the linguistic items concerned. It also proves that systematic solutions to these problems can be found and implemented. The approach defended here for French and English can be applied to other languages or language pairs, and to other linguistic categories, and could thus contribute to the improvement of Machine Translation systems.

  • Nathalie GARRIC (Tours)
    Disambiguating proper nouns by use of local grammars
    2000, Vol. V-2, pp. 85-100

    This paper is part of the PROLEX project of automatic processing of proper nouns. Our objective consists, with the help of the computer, not only in identifying the various occurrences of the definite proper noun (modified or not modified), but also in tagging them with a relevant type of interpretation: referential, denominative, model, metaphoric or split. After the elaboration of a typology of the various uses of the definite proper noun, we try to extract the formal and lexical indications allowing for the deambiguation of the referential and semantic functioning of the proper noun. After isolating these differential units (e.g. determiners, adjectives, predicates of existence), we build local grammars intended for automatic recognition.

  • Nathalie GASIGLIA (Lille 3)
    The co-operation of Cordial Analyser and Unitex for optimising corpus extractions
    2004, Vol. IX-1, pp. 45-62

    A well-delimited linguistic study - a semantico-syntactic analysis of the use of the verbs donner 'to give' and passer 'to pass' in the language of soccer - presents a useful framework for the development of quality reflection on documentary resources that can form an instructive concentrated electronic corpus and for the introduction of the notion of 'thematic corpus of high efficiency'. In order to explore a thus constructed corpus, two tools that generate concordances and provide syntactic analyses, Cordial Analyseur and Unitex, are put to the test. The description of their shared functionality and specificity, as well as of their weak points induced me to formulate an original proposal: to make these two tools function together so that their strategically used complementarity allows to formulate searches of certain complexity using confirmed analysis reliability and a capacity to mark every identified element in the generated concordances tagged in the XML language.

  • Gaston GROSS (Paris 13)
    Automatic processing of linguistic 'domains'
    1998, Vol. III-2, pp. 47-56

    The aim of this article is to present a practical implementation of the notion of a "domain" for the automatic processing of domain information and its application to information retrieval on the web. After a discussion of the problems that arise when assigning a text to a given domain, we will define a domain as a set of hyperclasses (such as human, concrete, locative, actions etc.) and of object classes, which correspond to the structure of a simple sentence into predicates and arguments. This semantic-syntactic information is already encoded in general and technical language dictionaries. In these dictionaries we distinguish between simple and compound words. On the basis of the dictionaries we will tag web pages. A first application has enabled the search engine Alta Vista to identify 29 languages. We have also found that the identification of compound word allows queries that lead to more precise and more rapid results, compared to search algorithms that work exclusively with the simple words of the compound expression. Information retrieval can thus be considerably improved by taking into account compound words. An application of this research will be the retrieval of texts in medical language that we are carrying out in the framework of the European Commission research project Webling.

  • Benoît HABERT (Paris X-Nanterre)
    To tool up linguistics: from borrowing techniques to the meeting of knowledge
    2004, Vol. IX-1, pp. 5-24

    As such, linguistic research does not imply specific devices. However, linguistic descriptions and models would benefit from relying more often on NLP (Natural Language Processing) tools and resources and on computer science methods. The possible outcome depends on the chosen type of interaction between NLP, computer science and linguistics. A synergy between paradigms and methodologies would be more fruitful than a mere import of techniques.

  • Serge HEIDEN (ENS LSH Lyon)
    Electronic aids in studying medieval texts: methods and tools
    2004, Vol. IX-1, pp. 99-118

    Two approaches to the development of medieval text corpora can be distinguished among the projects carried out since a few decades. The first one consists of digitizing modern critical editions, and the second one is concerned with the production of precise diplomatic transcriptions of manuscripts, often directly linked to the photographs of the originals. These approaches are in fact complementary rather than contradictory, as they make it possible for scholars to choose between the quantity (representativeness) and the quality (accuracy and richness) of the data depending on the goals of their research. For both types of corpora, the challenges of their XML-TEI encoding related to the tools of their processing and analyzing are considered. Many methodological problems which arise from creating and processing medieval text corpora also concern other types of linguistic corpora.

  • Christine JACQUET-PFAU (Collège de France)
    Spelling and grammar checkers : which tool(s) are suitable for which author ?
    2001, Vol. VI-2, pp. 81-94

    This article intends to question the users' assessment of so-called correction tools, spelling correction and syntactic recovery. Several criteria should be considered, namely : a- how correctors, whether integrated or autonomous, will operate; b- what should be their particular configuration in each case; and c- which correction constraints should precisely be defined in connexion with the users checking process. Our purpose is firstly to show that the use of the word "error" is, in this context, to be settled; secondly to examine the main characteristics of the 'correcticiels', and thirdly to propose a users' typology. Finally we will make a few suggestions as to how these tools can be used in the acquisition of the French language.

  • Hendrik J. KOCKAERT (Lessius)
    A tool for managing terminology in juridical translation activities in Belgium; How it works and what it can do
    2011, Vol. XVI-1, pp. 93-104

    The Department of Applied Language Studies of Lessius and the Research unit of quantitative and variational linguistics of the K.U. Leuven have been invited by the translation department of the Ministry of Justice to develop a Terminology Management System (TMS) of legal phraseology and terminology allowing translators to work with correct, coherent and expert-revised phraseologies and terminologies in the three national languages. This paper firstly investigates how terminology management has been carried out in the translation departments of the federal public services of justice in Belgium. Based on this survey, this paper proposes a TMS tool which is based on a new concept of phraseological terminology. To reach this goal, an extraction method of phraseological terminology based on some usage-based models of language will serve as a basis of a customised experimental analysis method which will allow us to design a road map capable of developing terminology, specifically engineered for the legal translation LSP.

  • Thomas LEBARBÉ (Caen)
    TAPAS: Treatment and Analysis in Syntax by Augmented Perception
    2000, Vol. V-2, pp. 71-83

    In this article, we present a novel approach to syntactic parsing. As opposed to usual methods that consider syntactic parsing as a series of processes, we suggest here an architecture based on hybrid agents which task is robust deep syntactic parsing. After a short summary of the research on which we have based our work we present by means of an example the theoretical functioning of the architecture. This then allows us to describe the APA architecture, developped by Girault, that we have used in this joint research. Finally, in conclusion, we present some perspectives for implementation.

  • Sarah LEROY (Paris X-Nanterre)
    Extraction on patterns: two-way traffic between linguistic analysis and automated identification
    2004, Vol. IX-1, pp. 25-43

    We present here an automated identification of proper name antonomasia in tagged texts. First, we compare manual and computed identification, describing the system's cogs as well as the methods and tools we used ; we point out that the automated process is better concerning reliability.After having exposed how capabilities and limits of automated location can influence linguistic work, we compare this rather old (2000) work with new tools now usable by linguists, e.g. the query ability on a subset of tagged texts in the Frantext database.

  • Denise MALRIEU (CNRS-Paris)
    'Genres' and morphosyntactic variations
    2000, Vol. V-2, pp. 101-120

    A differential statistical analysis of 2600 integral texts of a French language textual database parsed and tagged by the parser CORDIAL enabled us to test and exploit the notion of "textual genre". A previous texts "manual" classification enabled us to combine deductive and inductive approaches to test the existence of significative differences between discourses, generic fields and genres, attested on 250 morphosyntactic variables. The univariate analysis shows more and stronger differences between discourses and generic fields than between narrative genres. The ascending hierarchical classification confirms the differences between discourses and generic fields (legal vs others ; theatre and poetry vs narrative genres), but it establishes mixed classes at the bottom of the hierarchy, the detective novel constrasting more with the other narrative genres. These results confirm the interest of the notion of genre for textual linguistic analysis, strengthen Hjelmslev's hypothesis that syntax belongs to linguistic content, and show scale solidarities between global text level and local word level, that have been until now unnoticed.

  • François MANIEZ (Lyon 2)
    Automatic retrieval of intentionally modified proverbs in the American press
    2000, Vol. V-2, pp. 19-32

    Reference to a well-known proverb or phrase by altering one of its components is a widespread phenomenon in the British and American press. Since such modifications can impede a non-native speaker's understanding of newspaper or magazine articles, a system that could identify them and refer learners of English as a second language to the original wording of such expressions might be useful in the conception of an on-line comprehension assistant. Using a data base of 10 500 titles from an American news magazine, we analyze the various types of modifications that come into play in the use of shared cultural references. Through the comparison of our data base with 800 English proverbs, we test various ways in which such modifications can be automatically detected.

  • Taoufik MASSOUSSI (Paris 13)
    Automated processing of metonymies
    2009, Vol. XIV-2, pp. 43-56

    Metonymy plays an important though often neglected role in lexicalisation, both in general language and in language for special purposes. This article shows how principles set out to account automatically for metonymy in general language are directly applicable to LSP.

  • Augusta MELA (Montpellier 3)
    Linguists and NLP-specialists may work together: location and analysis of gloses
    2004, Vol. IX-1, pp. 63-82

    This paper is related with a collective linguistic research project about the word and its gloss. Just like definitions, glosses catch 'the spoken experience of the meaning'. In French texts, this metalinguistic activity appears in words such as c'est-à-dire, ou, signifier. These signs can clarify the nature of the semantical relationship between two words: specification with au sens, equivalency with ou, c'est-à-dire, nomination with dit, baptisé, hyponymy with en particulier, comme, hyperonymy with et/ou autre(s), etc. Glosses can be automatically located because of both their marks and the features of their configurations. This paper describes an automatical retriever implementation, using 'ou glosses' as 'un magazine électronique, ou webzine' and a data-processing environment 'for linguists', namely the textual base Frantext and its Stella query language interpreter.

  • Sylvie NORMAND (CNRS-Rouen)
    Analysis of the adjectives of a medical corpus by means of automatic language processing
    2000, Vol. V-2, pp. 151-160

    Divergent descriptions of histopathologic images induce inter- and intra-observer variability in diagnosis based on the observation of breast tumours images. The lack of reproducibility in identifying specific morphological features is partly due to varying levels of expertise among pathologists and to differences in subjective analysis and comprehension of pathological images. As linguists and developers of Natural Language Processing (NLP) systems, we started a collaboration with the Medical Informatics Department at the Broussais Hospital in order to explore a new way for corpus-based medical glossary acquisition. We focused our analysis on adjectives because they are the main linguistic category involved in the evaluation process. The first results of this study show the relevance of a corpus-based approach to cope with the "subjective" interpretations given by pathologists when they analyse microscopic images.

    Automatic processing of medical terminology
    2001, Vol. VI-2, pp. 47-62

    Specialized texts are characterized by a specific terminology. Medicine holds a particular position in this respect, both because of the impressive number of terms involved and of the amount of international effort devoted to build normalized terminologies. These terminologies play a key role in medical information and knowledge processing. A large part of the work performed on medical language processing is therefore centered on these terminologies, either as information targets or as knowledge sources. We present here, through examples drawn from our own work, various aspects of medical terminology processing.