Amanuensis, noun. /əˌmænjuˈensɪs/. Early 17th cent.: Latin, from a manu (short for secretary) and -ensis ‘belonging to’.
- a person who writes down your words when you cannot write.
- a literary assistant, especially one who writes, types for somebody or copies text.
Amanuēsis is an application designed to accelerate normalization tasks in large historical corpora. It increases legibility by expanding abbreviations and replacing unicode characters in a systematic and context-sensitve way. This type of pre-processing is instrumental to subsequent digital analyses and manipulations. Amanuēsis is conceived to swiflty handle very large corpora and, as such, is optimized to use the full potential of your multicore CPU.
- Unicode Character Replacement: A powerful conversion tool to clean up text by removing and/or replacing undesirable characters.
- Dynamic Word Normalization: Expanding abbreviated words using Natural Language Processing, human inputs, and Large Language Models.
- Parallel Processing: Built with efficiency in mind, Amanuēsis uses parallel processing to make large normalization tasks more manageable.
- Comprehensive Logging: Every single modification is meticulously tracked and stored in accessible json files, enabling further statistical analysis.
- Multilingual Support: Addition of French, Italian, Latin, and Spanish.
- Beyond OpenAI: Compatibility with competing APIs.
- Documentation: Basic documentation in English, French, and Spanish.
Feel free to suggest new features in the Issues section.
To use Amanuēsis, simply clone this repository, navigate to the directory, and run ./run.sh
. Alternatively, you can
run the app directly from the modules/ folder python main.py
. Make sure before to indicate the input and destination
paths in the config.toml file.
See requirements.txt
Contributions are welcome! Please feel free to submit a pull request.
This project is licensed under the terms of the MIT license. For more details, see the LICENSE file.