Normalisation models¶

We provide several models for the automatic normalisation of Early Modern French (17^th century texts) into contemporary French spelling norms. The models are described in detail in our LREC 2022 paper Automatic Normalisation of Early Modern French (see below for the citation) and in our GitHub repository.

Use the models¶

Machine Translation (MT) approach¶

MT-style normalisation models trained on FreEM_norm. We provide a statistical phrase-based model and two neural models (LSTM and transformer).

Quick use guide

For easy use, we provide a model through HuggingFace. It is based on the transformer MT model from the paper, ported to HuggingFace and fine-tuned, with some additional post-processing steps to avoid hallucinated words. The model scores will therefore differ slightly from the paper. More information is provided in the GitHub repository.

To use the model, you will need to download the pipeline.py file either from GitHub or from HuggingFace. You can use it on the command line as follows:
cat INPUT_FILE | python pipeline.py -k BATCH_SIZE -b BEAM_SIZE > OUTPUT_FILE
or use it python-internally:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM from pipeline import NormalisationPipeline # N.B. local file tokeniser = AutoTokenizer.from_pretrained("rbawden/modern_french_normalisation") model = AutoModelForSeq2SeqLM.from_pretrained("rbawden/modern_french_normalisation") norm_pipeline = NormalisationPipeline(model=model, tokenizer=tokeniser, batch_size=256, beam_size=5) list_inputs = ["Elle haïſſoit particulierement le Cardinal de Lorraine;", "Adieu, i'iray chez vous tantoſt vous rendre grace."] list_outputs = norm_pipeline(list_inputs) print(list_outputs) >> ["Elle haïssait particulièrement le Cardinal de Lorraine;", "Adieu, j'irai chez vous tantôt vous rendre grâce."]

Detailed instructions and information

For detailed information on how to use our models and reproduce our results (including training), please visit the GitHub repository.

ABA¶

This alignment-based approach for 17^th-century text normalisation is available on GitHub and a demo (whose word transformation list was obtained on the train subcorpus of FreEM_norm) is provided here.

Publication¶

Please cite the following article:

Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. Automatic Normalisation of Early Modern French. In Proceedings of the 13^th Language Resources and Evaluation Conference. European Language Resources Association. Marseille, France, p. 3354‑3366.

Bibtex:

@inproceedings{bawden-etal-2022-automatic, title = {{Automatic Normalisation of Early Modern French}}, author = {Bawden, Rachel and Poinhos, Jonathan and Kogkitsidou, Eleni and Gambette, Philippe and Sagot, Beno{\^i}t and Gabay, Simon}, url = {https://hal.inria.fr/hal-03540226}, booktitle = {Proceedings of the 13th Language Resources and Evaluation Conference}, publisher = {European Language Resources Association}, year = {2022}, address = {Marseille, France}, pages = {3354-3366} }

The MT models can be found on Zenodo:

Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. (2022). FreEM-corpora/FreEM-automatic-normalisation: normalisation models for Early Modern French (1.0). Zenodo. https://doi.org/10.5281/zenodo.6594765

@software{bawden_rachel_2022_6594765, author = {Bawden, Rachel and Poinhos, Jonathan and Kogkitsidou, Eleni and Gambette, Philippe and Sagot, Benoît and Gabay, Simon}, title = {{FreEM-corpora/FreEM-automatic-normalisation: normalisation models for Early Modern French}}, month = may, year = 2022, publisher = {Zenodo}, version = {1.0}, doi = {10.5281/zenodo.6594765}, url = {https://doi.org/10.5281/zenodo.6594765} }

Results¶

Full results are provided in the code repository and the paper. Here we provide results on the FreEM_norm test set: (i) overall accuracy (symmetrised word accuracy) and (ii) accuracy on out-of-vocabulary words.

Model	Precision %	Precision OOV %
Identity function	72.73	43.00
ABA	95.14	69.50
SMT	97.10±0.02	75.64±0.18
LSTM	96.14±0.08	76.69±0.70
Transformer	95.89±0.07	75.73±0.38
Fonction d’identité + Lefff	86.12	64.84
ABA + Lefff	95.44	73.54
SMT + Lefff	97.24±0.02	78.37±0.20
LSTM + Lefff	96.25±0.10	78.35±0.79
Transformer + Lefff	96.01±0.09	77.51±1.00

Qualitative comparison of results¶

Using MEDITE, we aligned two automatically normalised versions of the dev subcorpus of FreEM_norm, the first one with the best statistical model (SMT + Lefff) and the second one with the best non-statistical approach (ABA + Lefff). The results of the comparison can be found here.