Skip to content

nn-train.mdb directory is uploaded to Hugging Face with annif upload #911

@juhoinkinen

Description

@juhoinkinen

When running the annif upload command to push models to a Hugging Face repository, if an uploaded project is an NN ensemble, the nn-train.mdb directory is also uploaded, and the files it contain can take much space (yso-en.zip
takes 1.26 GBs in https://huggingface.co/NatLibFi/FintoAI-data-YSO/tree/main/projects).

The intention was to not upload any files that Annif creates in training preprocessing, see

Annif/annif/hfh_util.py

Lines 112 to 117 in 371d1eb

def _is_train_file(fname: str) -> bool:
train_file_patterns = ("-train", "tmp-")
for pat in train_file_patterns:
if pat in fname:
return True
return False

but this does not work for directories. I already made a "fix": Detect "-train" also in directory names.

However, now it comes to mind that NN ensemble can be used for online learning, which may require the original nn-train.mdb directory to be present, so it might be needed to be uploadeble and downloadable via Hugging Face, so the "fix" should not be made. That need to be checked, for now I just create this issue for this possible bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions