Faroese corpus taken from Wikipedia dumps.
This repository will contain corpus of Faroese language taken from the content dump of Faroese Wikipedia.
This project uses pipenv. How to install pipenv.
In order to read 7zip archives (used by Wikia's XML dumps) you need to install libarchive:
pipenv install
sudo apt install libarchive-dev
Run pipenv shell before running them.
Shows the longest words taken from the dump:
1 llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch - 58
2 samvinnufelagiðsamvinnufelagnum - 31
3 krabbameinsgranskingarstovnurin - 31
4 southernplayalisticadillacmuzik - 31
5 barnabókavirðislønavinnararnar - 30
6 norðurlandameistarakappingini - 29
7 sjónvarpsundirhaldssendingini - 29
8 bókmentakritikaraheiðurslønir - 29
9 einstaklingaítróttargreinunum - 29
10 vegsúkklukappingarmeistaranum - 29