Skip to content

A quick-start Paralex package

Presentation

This repository is an empty Paralex-compliant package model. If you plan to release a new Paralex package, you can fork this repository to start from a model. The Makefile and the preconfigured Continuous Integration (CI) pipeline will help you ensure replicability of your research results.

Please note that this model is only ONE of the MANY possible ways to organize your Paralex repository. The requirements for a repository to be Paralex compatible are described in the standard, and this example is not part of the standard. However, if you are starting your first Paralex package, you can probably benefit from using this model to avoid some repetitive steps.

How to use this repository?

To start your new Paralex Package, first read the quick start guide. Then follow these steps:

  1. Fork this repository and replace PackageName/packagename everywhere by the name of your Paralex package.

  2. Go through the file structure, understand it, and update the files according to your own needs. If you read the quick start (step 1) most of the files mentioned here should make sense:

    • README.md : This file. Keep it, it will be useful to generate your website (see below).
    • metadata.yml : A crucial Paralex file, it contains all the custom metadata you want to put in your Paralex dataset. Update it!
    • Makefile : Using a Makefile will help you make this package replicable. Describe each step with classic command line instructions or python code from format_lexicon.py. The makefile already contains a standard outline.
    • requirements.txt : This contains the python packages you might need to reproduce your steps. Update it!
    • packagename_xxx.csv : The Paralex data. Replace them by your own Paralex tables! Keep the names, or update the metadata.yml file if you change them.
    • /docs/datasheet.md : The official datasheet describing your Paralex project. Complete it!
    • /epitran : Useful only if you want to do some Grapheme-2-Phoneme conversion with the Epitran software. If yes, you can write your own rules in the files. Otherwise, you can delete these files and the corresponding instructions from the Makefile.
    • /evaluation : Put here everything useful for quality-control. If you use Epitran (see above), keep the dev_forms.csv file, and use it to check the quality of your G2P conversion, in combination with the make evaluate instruction. This folder can also store random samples of 100 items generated by the make sample instruction.
    • /sources : you might want to store some source files. This is the right place to do so.
    • format_lexicon.py : If you work with Python, a good place to store your code. Contains already some useful functions for G2P conversion and quality check.
    • LICENSE : put the license text here.
    • mkdocs.yml : This file describes how your website will be generated from the Paralex dataset. Please, read the instructions below, and then update the name and colour of the website in the configuration.
  3. Once you have set up your configuration, the Makefile and the python scripts needed to produce your dataset, test your build process with make:

    make all
    
  4. If everything went fine, go through this README.md and update everything according to your Makefile procedure. It will be a reference for future users who might want to replicate your dataset.

  5. Create a Zenodo repository, and book a DOI, as described in the online documentation (step 1, again). You can now update the DOI field in the different files of this repository. Instructions in the online doc.

  6. Push this repository to your gitlab storage. The predefined .gitlab-ci.yml file will automatically generate a website if your configuration is correct. Instructions are also in the online documentation (step 1). This script will also publishe releases at every new tag.

  7. Ready? Delete this introduction from your README.md, make your Gitlab repository and minisite public, push to Zenodo. You're done!


DOI pipeline status Latest Release License: CC BY-SA 4.0

About

PackageName is a collection of Livonian nominal paradigms, in phonemic and orthographic notation. They are suited for both computational and manual analysis.

The data is encoded in csv files, and the metadata follows frictionless standards. The dataset conforms to the Paralex standard

Please cite as:

The data can be downloaded from zenodo or from the gitlab repository.

How this lexicon was prepared

Short description of the procedure.

Summary

You can add a nice mermaid graph describing all of this:

flowchart LR
    A[(SOURCE1)]:::start ==>|JSON| B(Orthographic
                                    paradigms)
        B ==> X
        E[["🖋 G2P rules"]]:::add -.-> X{{Epitran}}:::start
        X ==> C(Phonemic
                paradigms)
        C ==> D[(Paralex
                dataset)]:::aim
        Z[(SOURCE2)] -...->|Token frequencies| F
        F[["🖋 Rich annotations"]]:::add --> D

classDef start stroke:#f00
classDef aim stroke:#090
classDef add stroke:#ffa10a

How to re-generate the data

To ensure replicability, we provide the possibility to rebuild the package from the sources by running the following commands:

$ git clone https://gitlab.com/<REPOSITORY>.git
$ cd packagename
$ make all
Getting the sources

You should first clone the git repository:

$ git clone https://gitlab.com/<REPOSITORY>.git
$ cd packagename

Preparing python environment:

$ make venv

Some other steps:

$ make step
Transcriptions

Evaluating the transcription on dev forms:

$ make evaluate

Phonological transcription:

$ make transcription
Packaging & Validation

We produce Frictionless metadata:

$ make metadata

Check the conformity with Paralex standard:

$ make validate

It is possible to export a random sample (with fixed seed), for manual verifications:

$ make sample

References

Quote references.