Oral corpora

  • Martine ADDA-DECKER (Paris)
    French ‘liaison’ in casually spoken French, as investigated in a large corpus of casual French speech
    2012, Vol. XVII-1, pp. 113-128

    In this paper, the realisation of the French Liaison is investigated in a large corpus of casual speech. Considering that casual speech gives rise to a large range of pronunciation variants and that overall temporal reduction increases, one may hypothesize that French liaison tends to be less productive in this speaking style. We made use of automatic processing such as automatic speech alignments to evaluate when liaison is realized in the NCCFr corpus. Realized liaisons were examined and measured for the most frequent liaison consonants (/z/, /n/ and /t/) as a function of a liaison sites classified as mandatory, optional or forbidden. The relation between speech rate and liaison realization is also examined.

  • Martine ADDA-DECKER (Paris)
    Corpus for automatic transcription of spoken texts
    2007, Vol. XII-1, pp. 71-84

    This contribution aims at giving an overview of automatic speech recognition research, highlighting the needs for corpora development. As recognition systems largely rely on statistical approaches, large amounts of both spoken and written corpora are required. In order to fill the gap between written and spoken language, speech transcripts need to be produced manually using appropriate tools. Methods and resources accumulated over the years now allow, not only to tackle genuine oral genres, but also to envision large-scale corpus studies to increase our knowledge of spoken language, as well as to improve automatic processing.

  • Olivier BAUDE (Orléans)
    Legal and ethical aspects of conserving and diffusing corpora of spoken texts
    2007, Vol. XII-1, pp. 85-97

    The digitalization of spoken language corpora opens large perspectives for linguistics. However, the archiving and the exploitation of these spoken corpora raise new ethical and legal problems that the scientific community must take into account. This article presents the results of an interdisciplinary working group which wrote a Guide of good practices for the constitution, the exploitation, the archiving and the diffusion of spoken language corpora.

  • Philippe BOULA DE MAREÜIL (Paris Sud)
    Diachronic variation in the prosody of French news announcer speech: changes in word initial accent
    2012, Vol. XVII-1, pp. 97-111

    This study addresses prosodic evolution in the French news announcer style, based on acoustic analysis of French audiovisual archives. A 10-hour corpus covering six decades of broadcast news is investigated automatically, focusing on word-initial stress, which may give an impression of emphatic style. Objective measurements suggest that the following features have decreased since the forties: mean pitch, pitch rise associated with initial stress, and vowel duration characterising an emphatic initial stress. The onsets of stressed initial syllables have become longer while speech rate (measured at the phonemic level) has not changed. This puzzling outcome raises interesting questions for research on French prosody, suggesting that the durational correlates of word-initial stress have changed over time, in the French news announcer style.

  • Paul CAPPEAU (Poitiers)
    The sociolinguistic exploitation of large corpora. Key-word and stone of wisdom
    2007, Vol. XII-1, pp. 99-110

    The desire to make use of large collections of oral data is nowadays largely shared by linguists. At a time when such tools are becoming increasingly available for French, it is important to make sure that there is sensitivity to all of those factors which guarantee reliability in the different stages of obtaining data: clarification of the term ‘corpus’; reflection on approaches to the field and to orality, and on representativeness (both in terms of genres and numbers of speakers); data elicitation practices and transcription.

  • Sylvain DETEY (Tokyo, Japon)
    Learners of French and pronunciation norms in the FL : what input do we need to reach what results
    2012, Vol. XVII-1, pp. 81-96

    In the field of French language education, the developments of corpus linguistics have spurred a reassessment of the importance of pedagogical norms and linguistic variation in teaching curricula. In this article, we focus on the phonetic-phonological dimension of the teaching/learning process and, after a short glance at pronunciation models in French, we examine the impact of sociolinguistic descriptions of varieties of French on pronunciation education. Referring to the notions of 'errors' and 'accents' among non-native speakers, we point out the need for broad and systematic corpus-based studies, comparable with native databases. Finally, we introduce the InterPhonologie du français contemporain project and look at the notion of non-native norms, both from theoretical and applied perspectives.

  • Jacques DURAND (Toulouse)
    Phonology of Contemporary English: usage, varieties and structures
    2012, Vol. XVII-1, pp. 25-37

    The PAC project (The Phonology of Contemporary English: usage, varieties, structure) aims at giving a better picture of spoken English in its unity and its geographical, social and stylistic diversity. Based on Labovian methods, the project seeks to describe both rhotic and non rhotic accents of English, from traditional standards to more recent postcolonial varieties. This large corpus enables researchers to analyse and compare intervarietal features such as rhoticity as well as more specific phenomena such as vocalic length in Australian English or variable rhoticity in New Zealand English. Today LVTI, a collaborative project aiming at an interdisciplinary sociolinguistic survey of great urban centres such as Manchester and Toulouse is being set up following the PAC/PFC classical protocol.

  • Julien EYCHENNE (Groningue, Pays-Bas)
    The Phonology of Contemporary French program: results and perspective
    2012, Vol. XVII-1, pp. 7-24

    This paper offers an overview of the work that has been done within the Phonologie du français contemporain : usages, variétés, structure (PFC) research programme. We first critically assess the relation between phonological research and data. We then move on to describe PFC's methodology and the coding schemes that have been devised for the analysis of schwa and liaison. We finish off by showing how the PFC programme makes a valuable contribution to our understanding of the phonology of French, by widening the scope and breadth of empirical descriptions and by offering new insights into theoretical problems such as the analysis of liaison or the role of usage frequency in grammar.

  • Françoise GADET (Paris Ouest)
    A large corpus of spoken French : CIEL-F. Epistemological choices and empirical outcome
    2012, Vol. XVII-1, pp. 39-54

    This article presents the structure of the Corpus International Ecologique de la Langue Française, an extensive corpus of spoken French that will soon be available on the Internet, from both an epistemological and empirical perspective. Explanations are given with regard to the ideas that guided the data collection (ecological approach, comparability of the different areas of the Francophonie and communication situations) and to the choices made ("communicative spaces" and "activity types") with a view to relevant analyses in various research fields (variation, interaction, multimodality, French in contact, oral syntax) and an attempt is made to fill existing gaps in the current corpus. The article further addresses the issue of building up a network of experts, problems that had to be solved during fieldwork in the different areas and questions concerning standardisation, archiving and publication of the collected data (audio and video recordings, transcriptions, metadata), whereupon several examples are presented for comparative analyses.

  • Michèle OLIVIÉRI (Université Nice Sophia Antipolis)
    All about the Thesaurus occitan
    2017, Vol. XXII-1, pp. 89-102

    The Thesaurus occitan (THESOC) is a multimedia database that aims at assembling all the dialectal data gathered in an oral form throughout the occitan-speaking region. It has two parts:  one deals with the lexicon, and the other, composed of sentences, is devoted to syntax.  Different tools and functionalities are associated with the data in order to allow researchers to constitute bodies of work and to formulate and verify hypotheses. This article presents the most recently upgraded form of the THESOC, its modalities of construction, and its methods of consultation.

  • Shana POPLACK (Ottawa, Canada)
    The corpus of spoken French of Ottawa-Hull
    1996, Vol. I-2, pp. 95-97
  • Louise PÉRONNET (Moncton, Canada)
    Linguistic research on French spoken in Acadia
    1996, Vol. I-2, pp. 98-99
  • André VALLI (Aix-en-Provence)
    Grammatical labeling of corpora of spoken language: problems and perspectives
    1999, Vol. IV-2, pp. 113-133

    The use of transcription conventions that attempt to code the specific properties of speech, such as false starts, hesitations, and repetitions, and do not rely on the usual written punctuation, suggests that the grammatical tagging of transcribed oral corpora might be a very difficult undertaking. Developing speech-specific taggers, although desirable, would be a long-term project. In the experiment reported in this article, a spoken corpus was tagged using a system designed for written text, along with some appropriate pre-editing and post-editing programs. Quite unexpectedly, the results for speech were excellent, almost as good as those previously obtained for writing. This discovery allows us to foresee the rapid compilation of large tagged spoken corpora for French.