-
Notifications
You must be signed in to change notification settings - Fork 44
Description
When running the annif upload command to push models to a Hugging Face repository, if an uploaded project is an NN ensemble, the nn-train.mdb directory is also uploaded, and the files it contain can take much space (yso-en.zip
takes 1.26 GBs in https://huggingface.co/NatLibFi/FintoAI-data-YSO/tree/main/projects).
The intention was to not upload any files that Annif creates in training preprocessing, see
Lines 112 to 117 in 371d1eb
| def _is_train_file(fname: str) -> bool: | |
| train_file_patterns = ("-train", "tmp-") | |
| for pat in train_file_patterns: | |
| if pat in fname: | |
| return True | |
| return False |
but this does not work for directories. I already made a "fix": Detect "-train" also in directory names.
However, now it comes to mind that NN ensemble can be used for online learning, which may require the original nn-train.mdb directory to be present, so it might be needed to be uploadeble and downloadable via Hugging Face, so the "fix" should not be made. That need to be checked, for now I just create this issue for this possible bug.