Skip to content

Using identifierSpace when the entity set uses a mixture of namespaces #139

@osma

Description

@osma

There has been some discussion on identifierSpace and schemaSpace before, e.g. in issue #3 and PR #76. The definitions of these have shifted over time. The current definition, in both the latest draft spec and version 0.2, of identifierSpace is:

identifier space
The URI namespace (i.e. prefix) for the identifiers of an entity returned by the reconciliation service, for example http://www.wikidata.org/entity/ or https://d-nb.info/gnd/. This URI MAY resolve to a page describing these entities and their identifiers;

We are currently implementing reconciliation API support for Annif (see NatLibFi/Annif#734) and providing the identifierSpace information has caused some headache. Returning the service manifest is mandatory, and also the identifierSpace information is mandatory within the manifest: "A reconciliation service MUST define two URIs [...] identifierSpace ... schemaSpace"

Service manifest Example 1 given in the spec uses this identifierSpace:

"identifierSpace": "http://vocab.getty.edu/doc/#GVP_URLs_and_Prefixes",

(FWIW, I would like to point out that this doesn't seem to match well with the definition - IIRC this is not the URI namespace prefix for any Getty vocabulary, but a URI/URL of a web page explaining them. But that is a separate problem, maybe the example is just outdated.)

Annif uses SKOS vocabularies internally and often those vocabularies use a specific URI namespace; in my understanding, this would be the natural value for identifierSpace. But Annif is currently unaware of this namespace, and there is nothing in principle preventing a vocabulary from using a mixture of namespaces. For example, a vocabulary could consist of a mixture of Wikidata and GND entities. A perhaps more realistic example would be a mixture of YSO concepts and those of a domain-specific extension vocabulary such as KAUNO (fiction literature), JUHO (public administration) or TERO (health and welfare), all of which are extensions of YSO - you can think of them naively as additional concepts to add on top of YSO - that use their own URI namespace which is different from YSO.

So what should Annif return in the service manifest for a project that uses a vocabulary whose URI namespace it isn't aware of? Should it look at all the concept URIs and try to infer what is the longest common prefix? What if the URIs are a mixture of namespaces and there is nothing in common - say, a mixture of http and https URIs?

Or should the value be something more custom (somewhat like the Getty document in the example) that isn't really a URI namespace at all, but is unique to the vocabulary / entity set? For example, the reconciliation service at /rest/v1/projects/myproject/reconcile could return an identifierSpace of /rest/v1/vocabs/myvocab (i.e. the vocabulary used by myproject). That doesn't seem to match the current definition of identifierSpace, as it talks specifically about URI namespace prefixes, but would at least be a shared identifier that could also be referenced by other endpoints at the same Annif instance which use the same underlying vocabulary.

Or is it OK to return an identifierSpace of "" (the current quick-and-dirty solution in the Annif draft PR) since it seems to work fine with OpenRefine - apparently this information is not used at all. Maybe providing identifierSpace shouldn't be a MUST in the spec, if it's actually not used by the main client tool that this API is targeting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions