A quick-start Paralex package
Presentation
This repository is an empty Paralex-compliant package model. If you plan to release a new Paralex package, you can fork this repository to start from a model. The Makefile
and the preconfigured Continuous Integration (CI) pipeline will help you ensure replicability of your research results.
Please note that this model is only ONE of the MANY possible ways to organize your Paralex repository. The requirements for a repository to be Paralex compatible are described in the standard, and this example is not part of the standard. However, if you are starting your first Paralex package, you can probably benefit from using this model to avoid some repetitive steps.
How to use this repository?
To start your new Paralex Package, first read the quick start guide. Then follow these steps:
-
Fork this repository and replace PackageName/packagename everywhere by the name of your Paralex package.
-
Go through the file structure, understand it, and update the files according to your own needs. If you read the quick start (step 1) most of the files mentioned here should make sense:
README.md
: This file. Keep it, it will be useful to generate your website (see below).metadata.yml
: A crucial Paralex file, it contains all the custom metadata you want to put in your Paralex dataset. Update it!Makefile
: Using a Makefile will help you make this package replicable. Describe each step with classic command line instructions or python code fromformat_lexicon.py
. The makefile already contains a standard outline.requirements.txt
: This contains the python packages you might need to reproduce your steps. Update it!packagename_xxx.csv
: The Paralex data. Replace them by your own Paralex tables! Keep the names, or update the metadata.yml file if you change them./docs/datasheet.md
: The official datasheet describing your Paralex project. Complete it!/epitran
: Useful only if you want to do some Grapheme-2-Phoneme conversion with the Epitran software. If yes, you can write your own rules in the files. Otherwise, you can delete these files and the corresponding instructions from the Makefile./evaluation
: Put here everything useful for quality-control. If you use Epitran (see above), keep thedev_forms.csv
file, and use it to check the quality of your G2P conversion, in combination with themake evaluate
instruction. This folder can also store random samples of 100 items generated by themake sample
instruction./sources
: you might want to store some source files. This is the right place to do so.format_lexicon.py
: If you work with Python, a good place to store your code. Contains already some useful functions for G2P conversion and quality check.LICENSE
: put the license text here.mkdocs.yml
: This file describes how your website will be generated from the Paralex dataset. Please, read the instructions below, and then update the name and colour of the website in the configuration.
-
Once you have set up your configuration, the Makefile and the python scripts needed to produce your dataset, test your build process with make:
make all
-
If everything went fine, go through this README.md and update everything according to your Makefile procedure. It will be a reference for future users who might want to replicate your dataset.
-
Create a Zenodo repository, and book a DOI, as described in the online documentation (step 1, again). You can now update the DOI field in the different files of this repository. Instructions in the online doc.
-
Push this repository to your gitlab storage. The predefined
.gitlab-ci.yml
file will automatically generate a website if your configuration is correct. Instructions are also in the online documentation (step 1). This script will also publishe releases at every new tag. -
Ready? Delete this introduction from your README.md, make your Gitlab repository and minisite public, push to Zenodo. You're done!
About
PackageName is a collection of Livonian nominal paradigms, in phonemic and orthographic notation. They are suited for both computational and manual analysis.
The data is encoded in csv
files, and the metadata follows frictionless standards. The dataset conforms to the Paralex standard
Please cite as:
- Authors. PackageName: Paradigms in Phonemic Notation [Data set]. Zenodo. https://doi.org/10.5281/zenodo.
The data can be downloaded from zenodo or from the gitlab repository.
How this lexicon was prepared
Short description of the procedure.
Summary
You can add a nice mermaid graph describing all of this:
flowchart LR
A[(SOURCE1)]:::start ==>|JSON| B(Orthographic
paradigms)
B ==> X
E[["🖋 G2P rules"]]:::add -.-> X{{Epitran}}:::start
X ==> C(Phonemic
paradigms)
C ==> D[(Paralex
dataset)]:::aim
Z[(SOURCE2)] -...->|Token frequencies| F
F[["🖋 Rich annotations"]]:::add --> D
classDef start stroke:#f00
classDef aim stroke:#090
classDef add stroke:#ffa10a
How to re-generate the data
To ensure replicability, we provide the possibility to rebuild the package from the sources by running the following commands:
$ git clone https://gitlab.com/<REPOSITORY>.git
$ cd packagename
$ make all
Getting the sources
You should first clone the git repository:
$ git clone https://gitlab.com/<REPOSITORY>.git
$ cd packagename
Preparing python environment:
$ make venv
Some other steps:
$ make step
Transcriptions
Evaluating the transcription on dev forms:
$ make evaluate
Phonological transcription:
$ make transcription
Packaging & Validation
We produce Frictionless metadata:
$ make metadata
Check the conformity with Paralex standard:
$ make validate
It is possible to export a random sample (with fixed seed), for manual verifications:
$ make sample
References
Quote references.