Skip to content

sylvainloiseau/igtcorpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Utilities for converting interlinear glossed texts (IGT) corpora between the following formats:

  • EMELD (Cathy Bow, Baden Hughes, Steven Bird, (2003) "Towards a general model of interlinear text", Proceedings of Emeld workshop 2003) Online. Used in particular by SIL FLEX
  • CONLL
  • ELAN Elan website [in a specific configuration -- can be adapted to others]
  • JSON representation of Emeld

Installation

pip install git+https://github.com/sylvainloiseau/igtcorpus.git#egg=igtcorpus

Usage

Two commnand line utilities are installed.

emeld: info about an Emeld document

Print a summary of the fields used in the document and their number of occurrences.

$ emeld summary tests/data/EmeldByFlex.xml
Unit 'EmeldUnit.morph':
        503 occurrences
        fields:
                cf, tww (502 occurrences)
                gls, en (502 occurrences)
                glsAppend, en (4 occurrences)
                glsPrepend, en (4 occurrences)
                hn, tww (231 occurrences)
                msa, en (502 occurrences)
                txt, tww (503 occurrences)
                variantTypes, en (4 occurrences)
Unit 'EmeldUnit.word':
        366 occurrences
        fields:
                gls, en (220 occurrences)
                pos, en (262 occurrences)
                punct, tww (61 occurrences)
                txt, tww (273 occurrences)
Unit 'EmeldUnit.phrase':
        0 occurrences
        fields:
                gls, en (31 occurrences)
                gls, tpi (32 occurrences)
                gls, tww (1 occurrences)
                lit, en (32 occurrences)
                note, tww (10 occurrences)
                note, en (12 occurrences)
                segnum, en (32 occurrences)
Unit 'EmeldUnit.paragraph':
        6 occurrences
        fields:
Unit 'EmeldUnit.text':
        1 occurrences
        fields:
                title, en (1 occurrences)
                title-abbreviation, en (1 occurrences)

igtc: conversion between format

Command line interface:

$ igtc -i input.xml -o output.json -f emeld -t json -l tww -m en

See the doc:

$ igtc -h
usage: igtc [-h] [--verbose] --output OUTPUT --input INPUT --fromformat {json,emeld,elan} --toformat {json,emeld,conll} [--olanguage OLANGUAGE] [--mlanguage MLANGUAGE]

Utilities for converting between interlinear glossed texts formats.

options:
  -h, --help            show this help message and exit
  --verbose, -v         output detailled information
  --output OUTPUT, -o OUTPUT
                        output file
  --input INPUT, -i INPUT
                        input file
  --fromformat {json,emeld,elan}, -f {json,emeld,elan}
                        input file format
  --toformat {json,emeld,conll}, -t {json,emeld,conll}
                        output file format
  --olanguage OLANGUAGE, -l OLANGUAGE
                        Object language
  --mlanguage MLANGUAGE, -m MLANGUAGE
                        Meta language

API

from igtcorpus.elan import ElanCorpoAfr
from igtcorpus.igt import Corpus
from igtcorpus.emeld import Emeld
from igtcorpus.json import EmeldJson

# Read...
# - EAF (elan) file
corpus = ElanCorpoAfr.read("tests/data/BEJ_MV_CONV_01_RICH.EAF")
# - Emeld document
corpus = Emeld.read("tests/data/test.emeld.xml")
# - json
corpus = EmeldJson.read("tests/data/tiny.json")

# ...Write...
# - as emeld
Emeld.write(corpus, "corpus.emeld")
# - as JSON
EmeldJson.write(corpus, "corpus.json")

About

A command-line tool for converting interlinear glossed text (IGT) corpora between popular formats.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages