FreEM project

NLP resources and applications for Early Modern French

7.26M+
Tokens
Lemmatised corpus
4.85M+
Tokens
NER corpus
163k+
Tokens
POS+morph corpus
25M+
Lines
Normalized Text Lines

About

FreEM corpora is a project dedicated to Early Modern French (16th–18th centuries). It provides corpora and NLP models for several core tasks, including lemmatisation, part-of-speech tagging, linguistic normalisation, and named entity recognition.

The project brings together specialists in NLP and philology in order to deliver accurate linguistic modelling, large-scale annotated corpora, state-of-the-art models, and linguistic studies of Early Modern French.

All corpora are freely available.

NLP tasks

Lemmas, Parts of speech, Morphology

The FreEMLPM ("Lemmas, Parts of speech, Morphology") dataset provides training data for:

  • Lemmatisation
  • Parts of speech
  • Full morphological information (tense, mood, gender, etc.)

Training data comes from several projects:

  • SETAF (16th c.)
  • PRESTO (16th-18th c.)
  • CORNMOL (17th c.)
  • FRANTEXT (19-20th c.)

Normalisation

Text normalisation is crucial for processing noisy or non-orthographic text. Our system handles various normalisation challenges including spelling variants and abbreviations.

FreEMnorm consists of two parallel corpus covering a range of different genres of text throughout different centuries (16th-20th c.).

  • Semi-diplomatic normalisation
  • Full normalisation

Training data comes from several projects:

  • SETAF (16th c.)
  • FreEM original data (16-20th c.)

Named Entity Recognition

Our NER system identifies and classifies named entities in text with high accuracy. The system recognizes persons, organizations, locations, and other entity types across multiple domains and centuries.

  • Multi-class entity recognition
  • Nested entity detection

Training data comes from this project:

  • PRESTO (16th-18th c.)

Team

Principal Investigator

Principal Investigator

Simon Gabay

Philology, linguistics

University of Geneva

Participants

Team Member

Lucence Ing

Philology

Inria Paris

RB

Sonia Solfrini

Philology

University of Geneva

BS

Thibault Clérice

NLP

Inria Paris

Former participants

POS

Pedro Ortiz Suárez

NLP

Inria Paris

RB

Rachel Bawden

NLP

Inria Paris

BS

Benoît Sagot

NLP

Inria Paris

POS

Philippe Gambette

Computer science

Université Gustave Eiffel

POS

Alexandre Bartz

Research Engineer

Sorbonne Université

Publications

The birth of French orthography. A computational analysis of French spelling systems in diachrony
Simon Gabay, Thibault Clérice
Computational Humanities Research Conference (CHR), Dec 2024, Aahrus, Denmark.

The 17th c. is crucial for the French language, as it sees the creation of a strict orthographic norm that largely persists to this day. Despite its significance, the history of spelling systems remains however an overlooked area in linguistics for two reasons. On the one hand, spelling is made up of microchanges which requires a quantitative approach, and on the other hand, no corpus is available due to the interventions of editors in almost all the texts already available. In this paper, we therefore propose a new corpus allowing such a study, as well as the extraction and analysis tools necessary for our research. By comparing the text extracted with OCR and a version automatically aligned with contemporary French spelling, we extract the variant zones, we categorise these variants, and we study their frequency to study the (ortho)graphic change during the 17th century.

Ancien ou moderne ? Pistes computationnelles pour l'analyse graphématique des textes écrits au XVIIe siècle
Simon Gabay, Philippe Gambette, Rachel Bawden, Benoît Sagot
Linx, 85, 2023.

The use of contemporary spelling rather than old graphic systems in the vast majority of current editions of 17th century French texts has the unfortunate effect of masking their graphematic richness. Such valuable information has remained concealed and therefore under-exploited, despite the potential it holds in terms of analysis. By favouring a practical corpus-based approach, rather than a theoretical one, and by relying on a recategorisation of the various competing systems at that time in French scriptae, we propose the foundations of a scriptometric study of the classical language, focusing on the analysis of specific documents, both manuscripts and old prints.

A Data-driven Approach to Named Entity Recognition for Early Modern French
Pedro Ortiz Suarez, Simon Gabay
29th International Conference on Computational Linguistics (COLING), Oct 2022, Gyeongju, South Korea.

Named entity recognition has become an increasingly useful tool for digital humanities research, specially when it comes to historical texts. However, historical texts pose a wide range of challenges to both named entity recognition and natural language processing in general that are still difficult to address even with modern neural methods. In this article we focus in named entity recognition for historical French, and in particular for Early Modern French (16th-18 th c.), i.e. Ancien Régime French. However, instead of developing a specialised architecture to tackle the particularities of this state of language, we opt for a data-driven approach by developing a new corpus with fine-grained entity annotation, covering three centuries of literature corresponding to the early modern period; we try to annotate as much data as possible producing a corpus that is many times bigger than the most popular NER evaluation corpora for both Contemporary English and French. We then fine-tune existing state-of-the-art architectures for Early Modern and Contemporary French, obtaining results that are on par with those of the current state-of-the-art NER systems for Contemporary English. Both the corpus and the fine-tuned models are released.

Le changement linguistique au XVIIe siècle: nouvelles approches scriptométriques
Simon Gabay, Rachel Bawden, Philippe Gambette, Jonathan Poinhos, Eleni Kogkitsidou, Benoît Sagot
8e Congrès Mondial de Linguistique Française (CMLF), Jul 2022, Orléans, France.

Linguistic change in 17th c. France: new scriptometric approaches The end of the 17th c. remains a blind spot of the research on the spelling system, despite its importance for French at this period, during which a strict norm, still (more or less) in place, was created and imposed. Focusing on a practical rather than a theoretical approach, we propose to lay the foundation for a computational scriptometric study of early modern French and analyse the evolution of the spelling system over the 17th c. To do so, we measure and evaluate the distance between the early modern and the contemporary versions of the language, thanks to two automatic normalisers: one rule-based and another one neural-based.

Automatic Normalisation of Early Modern French
Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot, Simon Gabay
13th Language Resources and Evaluation Conference (LREC), Jun 2022, Marseille, France.

Spelling normalisation is a useful step in the study and analysis of historical language texts, whether it is manual analysis by experts or automatic analysis using downstream natural language processing (NLP) tools. Not only does it help to homogenise the variable spelling that often exists in historical texts, but it also facilitates the use of off-the-shelf contemporary NLP tools, if contemporary spelling conventions are used for normalisation. We present FREEMnorm, a new benchmark for the normalisation of Early Modern French (from the 17th century) into contemporary French and provide a thorough comparison of three different normalisation methods: ABA, an alignment-based approach and MT-approaches, (both statistical and neural), including extensive parameter searching, which is often missing in the normalisation literature.

From FreEM to D’AlemBERT: a Large Corpus and a Language Model for Early Modern French
Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagué, Rachel Bawden, Philippe Gambette, Benoît Sagot
13th Language Resources and Evaluation Conference (LREC), Jun 2022, Marseille, France.

Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16th to the 18th centuries). We present the FreEMmax corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on FreEMmax. We evaluate the usefulness of D'AlemBERT by fine-tuning it on a part-of-speech tagging task, outperforming previous work on the test set. Importantly, we find evidence for the transfer learning capacity of the language model, since its performance on lesser-resourced time periods appears to have been boosted by the more resourced ones. We release D'AlemBERT and the open-sourced subpart of the FreEMmax corpus.

Le projet FREEM : ressources, outils et enjeux pour l’étude du français d’Ancien Régime
Simon Gabay, Pedro Ortiz Suarez, Rachel Bawden, Alexandre Bartz, Philippe Gambette, Benoît Sagot
Traitement Automatique des Langues Naturelles (TALN), Jun 2022, Avignon, France.

Despite their undoubted quality, the resources and tools available for the analysis of Ancien Régime French are no longer able to meet the challenges of research in linguistics and literature for this period. After having precisely defined the chronological framework, we present the corpora made available and the results obtained with them for several NLP tasks, fundamental to the study of language and literature.

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre
Jean-Baptiste Camps, Simon Gabay, Paul Fièvre, Thibault Clérice, Florian Cafiero
Journal of Data Mining & Digital Humanities, jdmdh:6485, 2021.

This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e. up to 20th c.novels.

Expanding the content model of annotationBlock
Alexandre Bartz, Juliette Janès, Laurent Romary, Philippe Gambette, Rachel Bawden, Pedro Ortiz Suarez, Benoît Sagot, Simon Gabay
International Workshop on Digital Humanities 2023, Oct 2021, Virtual, United States

Linguistic annotation benefits from ISO specifications such as the Morphosyntactic Annotation Framework (MAF), whose recommendations have been added to the TEI P5. Relying on feature structures, these recommendations have however not been fully integrated into the TEI stand-off annotation model and, for instance, it is currently impossible to encode feature structures within the listAnnotation and annotationBlock elements.

Standardizing linguistic data: method and tools for annotating (pre-orthographic) French
Simon Gabay, Thibault Clérice, Jean-Baptiste Camps, Jean-Baptiste Tanguy, Matthias Gille-Levenson
Proceedings of the 2nd International Conference on Digital Tools & Uses Congress, Oct 2020, Tunisia

With the development of big corpora of various periods, it becomes crucial to standardise linguistic annotation (e.g. lemmas, POS tags, morphological annotation) to increase the interoperability of the data produced, despite diachronic variations. In the present paper, we describe both methodologically (by proposing annotation principles) and technically (by creating the required training data and the relevant models) the production of a linguistic tagger for (early) modern French (16-18th c.), taking as much as possible into account already existing standards for contemporary and, especially, medieval French.

Traduction automatique pour la normalisation du français du XVIIe siècle
Simon Gabay, Loïc Barrault
Traitement Automatique des Langues Naturelles (TALN), Jun 2020, Nancy, France.

The study of old state of languages is facing a double problem : on the one hand the distance with contemporary spelling prevents scholars from using standard NLP solutions, on the other hand the instability of the scriptae complexifies the training of solutions directly on the original source text. Returning to this problem with a DH perspective, we start with the philological reasoning behind the creation of the training corpus, and use traditional NLP methods to compare two machine translation systems (statistical and neural) and offer a functional tool for the normalisation of 17th c. French answering the needs of philologists.

A Workflow For On The Fly Normalisation Of 17th c. French
Simon Gabay, Marine Riguet, Loïc Barrault
Digital Humanities (DH), Jul 2019, Utrecht, the Netherlands

If Neural machine tranaslation (NMT) has proven to be the most efficient solution for normalising pre-orthographic texts, the amount of training data required remains an obstacle. In this paper, we address for the first time the case of normalising modern French and we propose a workflow to create the parallel corpus that an NMT solution requires.

Resources

  • FreEMLPM

    Corpora annotated in lemmas, POS, morphology and named entities.

  • FreEMnorm

    Parallel corpus for linguistic normalisation.

  • FreEMmax (Open access)

    Corpus of historical texts for LLM training. A non-open version exists but cannot be distributed.

  • D'AlemBERT

    This model is a RoBERTa base model pre-trained on the FreEMmax corpus for Early Modern French. This model is Cased and was trained with a mix of normalized and unnormalized data.

Contact

Email: simon.gabay@unige.ch

Address: Humanités numériques, Université de Genève, Geneva, Switzerland.

For inquiries about collaboration, data access, or other questions, please reach out via email.