FreEM_LPM¶

FreEM_LPM are several annotated corpora with lemma, POS, morphology and named-entities (cf. here for NER).

Online service¶

Data (GitHub and Zenodo):

Gabay, Simon, Thibault Clérice, Matthias Gille Levenson, Jean-Baptiste Camps, Jean-Baptiste Tanguy, FreEM-corpora/FreEMlpm: FreEM LPM (Lemma, POS-tags, Morphology) corpus (4.0.1), GitHub, 2022, https://github.com/FreEM-corpora/FreEMlpm.
Gabay, Simon, Thibault Clérice, Matthias Gille Levenson, Jean-Baptiste Camps, Jean-Baptiste Tanguy, FreEM-corpora/FreEMlpm: FreEM LPM (Lemma, POS-tags, Morphology) corpus (4.0.1), Zenodo, 2022, doi.org/10.5281/zenodo.6481300.

Conference paper:

Jean-Baptiste Camps, Simon Gabay, Paul Fièvre, Thibault Clérice, Florian Cafiero. Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre. Journal of Data Mining and Digital Humanities, Episciences.org, 2021, ⟨10.46298/jdmdh.6485⟩ and ⟨halshs-02591388v2⟩.
Simon Gabay, Thibault Clérice, Jean-Baptiste Camps, Jean-Baptiste Tanguy, Matthias Gille-Levenson, "Standardizing linguistic data: method and tools for annotating (pre-orthographic) French", Proceedings of the 2^nd International Digital Tools & Uses Congress (DTUC '20), Oct 2020, Hammamet, Tunisia. ⟨10.1145/3423603.3423996⟩ and ⟨hal-03018381⟩.

Four different corpora are used:

The CornMol corpus is made of 41 comedies written in the 17^th c., carefully sampled and proofread. Lemmas and POS have been manually corrected, morphology is reliable but not guaranteed.
The Frantext Démonstration corpus is composed of 32 texts, mainly written in the 18^th (7 texts), 19^th (24 texts) and 20^th c. (4 texts). Because the corpus is already tagged and lemmatised (but not fully corrected) following other guidelines than ours, the lemmas have been aligned according to our standards and authority lists.
The Presto corpus:
- A gold corpus is made of 60,000 tokens, taken from 5 texts written in the 16^th (1 text), 17^th (2 texts) and 18^th c. (2 texts). Because the corpus is already tagged and lemmatised following other guidelines than ours, the lemmas have been aligned according to our standards.
- The final Presto corpus is a three-fold one: noyau (“core”), contrôlé (“controlled”) and étendu (“extended”). We have limited ourselves to a selection of its core version, whose lemmas have been corrected according to our standards and authority lists.

Lemmas are aligned with the LGeRM authority list. Additional tokens have been added and are kept separately.
POS tags and morphology are annotated following the CATTEX
Cf. here for NER annotation.