Skip to content

Normalisation models

We provide several models for the automatic normalisation of Early Modern French (17th century texts) into contemporary French spelling norms. The models are described in detail in our LREC 2022 paper Automatic Normalisation of Early Modern French (see below for the citation) and in our GitHub repository.

Use the models

Machine Translation (MT) approach

MT-style normalisation models trained on FreEMnorm. We provide a statistical phrase-based model and two neural models (LSTM and transformer).

Quick use guide

For easy use, we provide a model through HuggingFace. It is based on the transformer MT model from the paper, ported to HuggingFace and fine-tuned, with some additional post-processing steps to avoid hallucinated words. The model scores will therefore differ slightly from the paper. More information is provided in the GitHub repository.

To use the model, you will need to download the pipeline.py file either from GitHub or from HuggingFace. You can use it on the command line as follows:
cat INPUT_FILE | python pipeline.py -k BATCH_SIZE -b BEAM_SIZE > OUTPUT_FILE
or use it python-internally:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from pipeline import NormalisationPipeline # N.B. local file

tokeniser = AutoTokenizer.from_pretrained("rbawden/modern_french_normalisation")
model = AutoModelForSeq2SeqLM.from_pretrained("rbawden/modern_french_normalisation")
norm_pipeline = NormalisationPipeline(model=model,
                                      tokenizer=tokeniser,
                                      batch_size=256,
                                      beam_size=5)

list_inputs = ["Elle haïſſoit particulierement le Cardinal de Lorraine;", "Adieu, i'iray chez vous tantoſt vous rendre grace."]
list_outputs = norm_pipeline(list_inputs)
print(list_outputs)

>> ["Elle haïssait particulièrement le Cardinal de Lorraine;", "Adieu, j'irai chez vous tantôt vous rendre grâce."]

Detailed instructions and information

For detailed information on how to use our models and reproduce our results (including training), please visit the GitHub repository.

ABA

This alignment-based approach for 17th-century text normalisation is available on GitHub and a demo (whose word transformation list was obtained on the train subcorpus of FreEMnorm) is provided here.

Publication

Please cite the following article:

Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. Automatic Normalisation of Early Modern French. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association. Marseille, France, p. 3354‑3366.

Bibtex:

@inproceedings{bawden-etal-2022-automatic,
  title = {{Automatic Normalisation of Early Modern French}},
  author = {Bawden, Rachel and Poinhos, Jonathan and Kogkitsidou, Eleni and Gambette, Philippe and Sagot, Beno{\^i}t and Gabay, Simon},
  url = {https://hal.inria.fr/hal-03540226},
  booktitle = {Proceedings of the 13th Language Resources and Evaluation Conference},
  publisher = {European Language Resources Association},
  year = {2022},
  address = {Marseille, France},
  pages = {3354-3366}
}

The MT models can be found on Zenodo:

Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. (2022). FreEM-corpora/FreEM-automatic-normalisation: normalisation models for Early Modern French (1.0). Zenodo. https://doi.org/10.5281/zenodo.6594765

@software{bawden_rachel_2022_6594765,
  author = {Bawden, Rachel and Poinhos, Jonathan and Kogkitsidou, Eleni and Gambette, Philippe and Sagot, Benoît and Gabay, Simon},
  title = {{FreEM-corpora/FreEM-automatic-normalisation:
normalisation models for Early Modern French}},
  month = may,
  year = 2022,
  publisher = {Zenodo},
  version = {1.0},
  doi = {10.5281/zenodo.6594765},
  url = {https://doi.org/10.5281/zenodo.6594765}
}

Results

Full results are provided in the code repository and the paper. Here we provide results on the FreEMnorm test set: (i) overall accuracy (symmetrised word accuracy) and (ii) accuracy on out-of-vocabulary words.

Model Precision % Precision OOV %
Identity function 72.73 43.00
ABA 95.14 69.50
SMT 97.10±0.02 75.64±0.18
LSTM 96.14±0.08 76.69±0.70
Transformer 95.89±0.07 75.73±0.38
Fonction d’identité + Lefff 86.12 64.84
ABA + Lefff 95.44 73.54
SMT + Lefff 97.24±0.02 78.37±0.20
LSTM + Lefff 96.25±0.10 78.35±0.79
Transformer + Lefff 96.01±0.09 77.51±1.00

Qualitative comparison of results

Using MEDITE, we aligned two automatically normalised versions of the dev subcorpus of FreEMnorm, the first one with the best statistical model (SMT + Lefff) and the second one with the best non-statistical approach (ABA + Lefff). The results of the comparison can be found here.