Tingmál

Tingmál is an unofficial structured dataset of documents, acts, and proceedings from the Parliament of the Faroe Islands (Løgtingið).

Note: This is not an official publication of Løgtingið.

Overview

What it is: A cleaned, structured compilation intended for research, analysis, and tooling.
What it isn't: An official publication of Løgtingið.
Typical uses: text mining, parliamentary analytics, search/indexing experiments, Faroese language data, dataset tooling.

Project Structure

The repository is organized into the following directories:

tingmal/
├── parliamentary-questions/   # Parliamentary questions organized by year (2008-2025)
│   └── YYYY/                     # Each year contains XML files named: 52-NNN-YYYY.xml
├── legislation/               # Laws, guidelines, and procedural rules
│   └── loegtingid/               # Parliament-specific regulations and procedures
├── proposals/                 # Legislative proposals organized by year
├── reports/                   # Committee reports organized by year
├── decisions/                 # Administrative decisions from various bodies
├── coalition-agreements/      # Government coalition agreements
├── misc/                      # Miscellaneous documents
├── utils/                     # Python utilities for data processing
│   ├── compute_stats.py          # Generate statistics from sentences.jsonl
│   ├── section52a_coverage.py    # Compute parliamentary question coverage
│   ├── detect_gaps.py            # Detect gaps in question numbering
│   ├── export_ids.py             # Export and manage sentence IDs
│   └── id_utils.py               # Generate unique base32 IDs
└── sentences.jsonl            # Deduplicated sentence-level dataset (23,664+ sentences)

Key files:

All documents are encoded in TEI/XML format (Text Encoding Initiative)
Parliamentary questions follow naming convention: 52-NNN-YYYY.xml where NNN is the question number
sentences.jsonl contains all extracted Faroese sentences with unique identifiers

Data Formats

TEI/XML Format

All documents in this repository use the TEI (Text Encoding Initiative) P5 XML standard, a widely-used format for encoding structured texts in the humanities and social sciences.

Document Structure

Each TEI/XML document consists of two main parts:

1. TEI Header (<teiHeader>) - Contains metadata:

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>52-1/2024: Miðfyrisitingin</title>
      </titleStmt>
      <publicationStmt>
        <publisher>Rani Høgnason Hansen</publisher>
        <idno type="url">https://github.com/hoegnason/tingmal</idno>
        <availability status="free">
          <licence target="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</licence>
        </availability>
      </publicationStmt>
      <sourceDesc>
        <bibl type="parliamentary_document">
          <publisher>Føroya Løgting</publisher>
          <author>
            <persName ref="https://tingdata.fo/person/...">Name</persName>
          </author>
          <ref target="https://www.logting.fo/documents/..."/>
        </bibl>
      </sourceDesc>
    </fileDesc>
  </teiHeader>

The header includes:

Document title and identification
License information (CC BY 4.0)
Original source references with URLs to official Løgtingið documents
Author information with persistent identifiers
Editorial notes about text correction and segmentation

2. Text Body (<text>) - Contains the actual content:

  <text>
    <body>
      <div type="question">
        <div type="questioner">
          <head>Spyrjari: </head>
          <persName>Name, <roleName>løgtingsmaður</roleName></persName>
        </div>
        <div type="question-list">
          <s xml:id="woyjvu7qcg" xml:lang="fo">
            Hvør er árligi kostnaðurin av aðalráðunum árini 2019 – 2023?
          </s>
        </div>
        <div type="background">
          <s xml:id="uf6ofajlqn" xml:lang="fo">
            Hetta er ein uppfylging uppá spurning 52-133/2023...
          </s>
        </div>
      </div>
    </body>
  </text>
</TEI>

Structural elements:

<div type="question"> - Parliamentary question container
<div type="questioner"> - Person asking the question
<div type="respondent"> - Person who must respond (usually a minister)
<div type="subject"> - Subject/topic of the question
<div type="question-list"> - The actual questions
<div type="background"> - Background context and justification

Sentence-level markup:

Every sentence is wrapped in <s> element
xml:id attribute: 10-character base32 identifier (globally unique)
xml:lang attribute: Language code (fo = Faroese, da = Danish)
IDs are cryptographically generated and serve as stable identifiers for citation

Example sentence:

<s xml:id="woyjvu7qcg" xml:lang="fo">Hvør er árligi kostnaðurin?</s>

JSONL Format

The sentences.jsonl file uses JSONL (JSON Lines) format - a newline-delimited JSON format where each line is a complete, valid JSON object. This format is ideal for streaming large datasets and line-by-line processing.

Structure

Each line contains a single sentence object:

{"id": "clbkpo3l72", "text": "\"Teppið skrikt undan teimum\" við hesi skerjing, sum fór fram, og eftirfylgjandi hevur tað havt negativa ávirkan á teirra dagliga og lívsgóðsku og teirra sosialu møguleikar verða skerdir og sjálvbjargni minkar."}
{"id": "n42wesqrbx", "text": "\"Vit skulu ikki gloyma, at umhvørvisfelagsskapir eru líka ágangandi móti veiðu á djúphavinum, sum móti grindadrápi\"."}

Fields

id (string): The 10-character base32 identifier from xml:id in the TEI/XML source
- Format: lowercase letters and digits from base32 alphabet (a-z, 2-7)
- Always starts with a letter (XML requirement)
- Globally unique across all documents in the repository
text (string): The sentence text with normalized whitespace
- Multiple spaces/tabs collapsed to single space
- Leading and trailing whitespace removed
- Original Faroese orthography preserved (including diacritics: áíóúýæøð)

Properties

Format: One JSON object per line (no comma between objects)
Encoding: UTF-8
Language: Only Faroese sentences (xml:lang="fo"); Danish excluded
Deduplication: Duplicate sentences removed (only first occurrence retained)
Sorting: Alphabetically sorted by text (case-insensitive)
Size: 23,664+ unique sentences

Reading JSONL in Python

import json

# Read line by line (memory efficient)
with open('sentences.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        obj = json.loads(line)
        print(f"ID: {obj['id']}, Text: {obj['text'][:50]}...")

Processing Pipeline

The following diagram shows how data flows from source documents to the final datasets:

PDF Documents (løgting.fo)
         ↓
   Manual extraction & OCR correction
         ↓
   TEI/XML Documents (.xml files)
         ↓
   ID Assignment (export_ids.py)
   • Scans all XML files
   • Generates unique 10-char IDs
   • Adds xml:id to <s> elements
         ↓
   Sentence Extraction (export_ids.py)
   • Extracts text from <s> elements
   • Filters by language (fo only)
   • Normalizes whitespace
         ↓
   Deduplication & Sorting
   • Removes duplicate sentences
   • Sorts alphabetically
         ↓
   sentences.jsonl (Final dataset)

Detailed Processing Steps

1. ID Generation (id_utils.py)

Generates cryptographically secure random IDs using Python's secrets module
Uses base32 alphabet: abcdefghijklmnopqrstuvwxyz234567
Ensures IDs start with letter (XML compliance)
Tracks used IDs in utils/used_ids.txt to prevent collisions

2. XML Processing (export_ids.py)

Parses XML with lxml while preserving whitespace
Namespace handling:
- TEI namespace: http://www.tei-c.org/ns/1.0
- XML namespace: http://www.w3.org/XML/1998/namespace
XPath queries: tree.xpath('//tei:s[@xml:id]', namespaces=namespaces)
Adds missing IDs to sentences without xml:id attribute

3. Sentence Extraction

# Pseudocode showing extraction logic
for xml_file in all_xml_files:
    for sentence_element in xml_file.find_all('<s>'):
        if sentence_element.xml_lang == 'da':
            continue  # Skip Danish sentences

        text = sentence_element.text_content()
        text = ' '.join(text.strip().split())  # Normalize whitespace

        sentences.append({
            'id': sentence_element.xml_id,
            'text': text
        })

4. Deduplication & Output

Sentences sorted alphabetically (case-insensitive)
Duplicates removed based on text content
First occurrence of each sentence retained
Written as JSONL (one JSON object per line)

Regenerating the Dataset

To regenerate sentences.jsonl after modifying XML files:

cd utils
python3 export_ids.py

This will:

Scan all XML files in the repository
Assign IDs to any new sentences
Extract all Faroese sentences
Generate fresh sentences.jsonl in parent directory

Utility Scripts

The utils/ directory contains Python scripts for data processing and quality assurance:

`compute_stats.py`

Generates statistics from sentences.jsonl:

python3 utils/compute_stats.py [path/to/sentences.jsonl]

Output: Markdown table with metrics:

Total sentence count
Token count (space-split)
Vocabulary size (unique tokens, case-folded)
Average/median sentence length
Percentile ranges

`section52a_coverage.py`

Computes parliamentary question coverage by year:

python3 utils/section52a_coverage.py

Output:

PQ_STATS.json - Machine-readable coverage data
PQ_STATS.md - Markdown table for README
Console output showing gaps in question numbering

`detect_gaps.py`

Detects missing parliamentary questions by analyzing filename sequences:

python3 utils/detect_gaps.py

Identifies missing question numbers (e.g., 52-015-2020 through 52-019-2020).

`export_ids.py`

Core processing script - assigns IDs and generates sentences.jsonl:

cd utils
python3 export_ids.py

Functions:

xml_files(path) - Recursively finds all .xml files
parse_sentences(filepath) - Extracts existing IDs
add_ids_to_file(filepath, used_ids) - Adds missing IDs
parse_sentences_for_extraction(filepath) - Extracts sentence text
process_files(path) - Main processing loop

`id_utils.py`

ID generation utility (imported by other scripts):

from id_utils import generate_b32_id

# Generate a new 10-character ID
new_id = generate_b32_id(length=10)  # e.g., 'woyjvu7qcg'

Dependencies

The utility scripts require:

Python 3.10+
lxml - XML processing library
```
pip install lxml
```

Provenance & Legal Notes

See the headers of individual documents for the original source of data.
Content includes material exempt under §9 (public documents) and §27 (public debate) of the Faroese Copyright Act.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.

Stats

The summary below was computed from sentences.jsonl, a sentence-level JSONL file with fields id and text. It contains only full, deduplicated sentences where formatting such as bullet points, ordinal list numbers, and legal section symbols (§) has been removed. Sentence segmentation has been reviewed by a Faroese native speaker.

Metric	Value
Sentences	23,945
Tokens (space-split)	469,088
Types (unique tokens, case-folded)	50,559
Avg. sentence length (tokens)	19.59
Median sentence length (tokens)	18
5-95% sentence length (tokens)	7-39
Avg. sentence length (characters)	127.4

Coverage by Decade

How the dataset is distributed across different decades:

Decade	Sentences	% of Total	Tokens	Types	Avg. Length (tokens)	Avg. Length (chars)
1900s	8	0.03%	137	98	17.12	85.8
1940s	13	0.05%	246	151	18.92	114.2
1990s	785	3.28%	13,487	4,196	17.18	112.8
2000s	1,400	5.85%	27,417	7,394	19.58	127.6
2010s	9,731	40.64%	192,050	28,557	19.74	128.7
2020s	11,319	47.27%	221,287	28,276	19.55	127.0
Unknown	689	2.88%	14,464	3,442	20.99	133.8

Coverage of Parliamentary Questions

The table below reports yearly coverage of parliamentary questions - currently limited to §52a - by comparing the number of TEI/XML records present in this repository with the official yearly totals. These figures are computed from the files under parliamentary-questions/<YEAR>/ using the script in utils/section52a_coverage.py.

Year	Collected	Official total	Coverage	Missing
2008	24	39	61.5%	15
2009	50	115	43.5%	65
2010	85	85	100.0%	0
2011	46	46	100.0%	0
2012	60	60	100.0%	0
2013	66	66	100.0%	0
2014	108	108	100.0%	0
2015	71	71	100.0%	0
2016	76	100	76.0%	24
2017	24	86	27.9%	62
2018	31	88	35.2%	57
2019	43	120	35.8%	77
2020	79	169	46.7%	90
2021	222	222	100.0%	0
2022	135	135	100.0%	0
2023	141	141	100.0%	0
2024	119	119	100.0%	0

Totals: Collected 1,380 of 1,770 (overall coverage 78%). Note: these figures currently exclude regular written and oral parliamentary questions; those will be added in a later release.

Contributing

Issues and pull requests are welcome. Please open an issue to discuss substantial changes.

Disclaimer

Tingmál is provided “as is,” without warranties of any kind.
The author(s)/maintainer(s) are not affiliated with Løgtingið.
You are responsible for compliance with applicable laws when redistributing or adapting the data (e.g., §27(2) - the limitation to the "public debate" exemption i.e. original contributor's exclusive right to collections consisting of only their own contributions).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tingmál

Table of Contents

Overview

Project Structure

Data Formats

TEI/XML Format

Document Structure

JSONL Format

Structure

Fields

Properties

Reading JSONL in Python

Processing Pipeline

Detailed Processing Steps

Regenerating the Dataset

Utility Scripts

`compute_stats.py`

`section52a_coverage.py`

`detect_gaps.py`

`export_ids.py`

`id_utils.py`

Dependencies

Provenance & Legal Notes

License

Stats

Coverage by Decade

Coverage of Parliamentary Questions

Contributing

Disclaimer

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 869 Commits
coalition-agreements		coalition-agreements
decisions		decisions
legislation		legislation
misc		misc
parliamentary-questions		parliamentary-questions
proposals		proposals
reports		reports
utils		utils
LICENSE		LICENSE
README.md		README.md
sentences.jsonl		sentences.jsonl

License

hoegnason/tingmal

Folders and files

Latest commit

History

Repository files navigation

Tingmál

Table of Contents

Overview

Project Structure

Data Formats

TEI/XML Format

Document Structure

JSONL Format

Structure

Fields

Properties

Reading JSONL in Python

Processing Pipeline

Detailed Processing Steps

Regenerating the Dataset

Utility Scripts

compute_stats.py

section52a_coverage.py

detect_gaps.py

export_ids.py

id_utils.py

Dependencies

Provenance & Legal Notes

License

Stats

Coverage by Decade

Coverage of Parliamentary Questions

Contributing

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`compute_stats.py`

`section52a_coverage.py`

`detect_gaps.py`

`export_ids.py`

`id_utils.py`

Packages