Identifiers

Content-Derived Identifiers in the Semantic Engine

Built into the Semantic Engine is a particular kind of identifier called a SAID (Self-Addressing Identifier). Unlike traditional identifiers that are assigned to a resource, SAIDs are derived directly from the content itself. They are computed—typically using cryptographic hashing—so the identifier is intrinsically bound to the exact bytes of the resource it represents.

These identifiers are not designed to be human-friendly. They are long, opaque strings. But that trade-off enables something more important for research and data systems: verification. If a resource is referenced by a SAID, you can independently confirm that what you have is exactly what was intended. If the content changes, the identifier no longer matches. In that sense, SAIDs are tamper-evident and self-authenticating.

Why Identifier Types Matter in Standards

Many specifications—particularly in research data and interoperability frameworks—depend on identifiers and are explicit about what types are allowed. This ensures consistency, portability, and long-term usability across systems.

One commonly accepted class is the URN (Uniform Resource Name). Because URNs are standardized and designed for persistence, they are frequently permitted in specifications where long-lived, location-independent identifiers are required.

IANA and Global Recognition

The Internet Assigned Numbers Authority (IANA) is responsible for coordinating key elements of the internet’s infrastructure, including identifier namespaces. When IANA registers a namespace, it becomes part of the globally recognized technical foundation used across systems and standards.

SAIDs have now been formally registered with IANA as a new URN namespace: urn:said. This elevates them from an ecosystem-specific mechanism to a globally recognized identifier scheme.

URNs vs URLs

A URN identifies what something is, while a URL (Uniform Resource Locator) identifies where something is located.

URNs are not inherently resolvable—you cannot simply use one to retrieve a resource without additional infrastructure. Instead, they are designed to be persistent names that systems can interpret.

SAIDs fit naturally into this model but add an important property: because they are content-derived, they can be independently verified. Anyone can build a resolver that retrieves content and checks whether it matches the SAID. Trust does not depend on the resolver—it depends on the content itself.

Implications for Research Data Infrastructure

The registration of urn:said means that SAIDs can now be used anywhere URNs are accepted. This has direct implications for research data standards and infrastructure.

The Semantic Engine already uses SAIDs to generate secure, tamper-evident identifiers. With official URN recognition, those identifiers can now integrate cleanly into broader ecosystems—supporting interoperability across repositories, metadata standards, and distributed workflows.

This represents a shift in how identifiers function within research systems. Instead of relying solely on assigned names backed by registries, systems can incorporate identifiers that are self-verifying by design. For research data—where integrity, provenance, and reproducibility are central concerns—this provides a stronger and more flexible foundation.

– Written by Carly Huitema

In Canada, national research data infrastructure is coordinated by the Digital Research Alliance of Canada (DRAC). The Alliance provides the digital tools and platforms that researchers depend on to manage data, perform advanced computing, and leverage research software. Supported by federal funding, DRAC works with partners across the country to expand access, improve security, and strengthen the digital research workforce. These efforts enable Canadian researchers in all disciplines to conduct more efficient, secure, and interoperable research.

As part of the ongoing modernization of Canada’s research infrastructure, DRAC is preparing to introduce a national Registration Agency for Research Activity Identifiers (RAiDs). RAiDs are a relatively new category of persistent identifiers designed to support the accurate identification, management, and linking of research activities—often conceptualized as research “projects”—throughout their full lifecycle.

Why RAiDs Matter

RAiDs provide a globally unique, persistent identifier for a research activity and connect that activity to:

People (researchers, collaborators)
Organizations (institutions, funders)
Outputs (publications, datasets, software)
Related resources (grants, ethics approvals, infrastructure)

This enables research projects to be tracked, referenced, and integrated across multiple systems. RAiDs are especially relevant in environments where interoperability is critical, such as national and international research data platforms.

For Canada, RAiDs are being positioned as a foundational component of the Canadian Research Data Platform, where they will facilitate information exchange between services, reduce duplication, and improve project-level transparency across institutions.

How RAiDs Are Minted

A key principle of the RAiD system is that researchers cannot independently mint RAiD identifiers. RAiDs must be generated through a recognized RAiD Service Provider. This approach ensures consistency, quality, and proper registration within the global RAiD infrastructure.

Two other identifiers have other minting processess:

ORCID allows individuals to mint their own researcher identifier at the ORCID website.
DOIs, however, must be issued by an authorized DOI service provider such as Dataverse, Zenodo, or Figshare.

RAiDs follow the DOI model rather than the ORCID model. Institutions, not individuals, carry the responsibility for minting and maintaining the associated metadata.

As DRAC moves toward becoming a national RAiD Registration Agency, Canadian researchers and institutions will gain a dedicated domestic pathway to obtain RAiDs that are recognized and resolvable globally.

The Global RAiD Registry

All RAiD identifiers and their metadata are maintained in a centralized global registry managed by the International RAiD Data Service, currently coordinated by the Australian Research Data Commons and partner organizations. This registry serves as the authoritative source for RAiD information and provides stable, persistent resolution of RAiD identifiers.

The registry stores:

The RAiD itself
Descriptive metadata about the research activity
Relationships to researchers, institutions, datasets, and grants
Activity lifecycle events (start, updates, completion)
Version histories and changes over time

Functionally, the RAiD registry operates in a manner similar to:

DataCite, which maintains DOI metadata
ORCID, which maintains researcher metadata

It is the central location where systems can query, resolve, and verify RAiD information.

The RAiD metadata schema is published openly and can be reviewed at:
https://metadata.raid.org/en/v1.6/index.html

Can Anyone Use the RAiD Metadata Schema?

Any organization—or individual—can choose to document their research activities using the publicly available RAiD metadata model. However, without going through an authorized RAiD Service Provider, they cannot mint an official RAiD identifier, and the resulting record will not be registered in the global RAiD registry or participate in the broader RAiD ecosystem.

Official registration is what ensures global uniqueness, persistent resolution, and interoperability across research platforms.

Conclusion

RAiDs are emerging as a critical component of modern research infrastructure, offering a structured, persistent mechanism for identifying and connecting research activities with all related people, outputs, and systems. The Digital Research Alliance of Canada’s plan to establish a national RAiD Registration Agency represents a significant step toward improving the coordination, traceability, and interoperability of research in Canada.

As Canada’s research ecosystem continues to evolve, the adoption of standardized, globally recognized identifiers like RAiDs will support more transparent, connected, and efficient research workflows—benefiting researchers, institutions, and the broader scientific community.

Written by Carly Huitema

On November 19th, Carly Huitema presented (YouTube link) at the Trust over IP 5-Year Symposium on emerging opportunities to use ORCID as a trust registry to help build secure, verifiable research data spaces. As research becomes increasingly digital and distributed, identity plays a central role in how data is created, shared, and validated. ORCID is already foundational to this ecosystem, and new standards such as Decentralized Identifiers (DIDs) open the door to extending its function beyond its original scope.

ORCID’s Existing Role in Trust and Attribution

For more than a decade, ORCID has provided the research community with a simple but essential service: a persistent, globally unique identifier for every researcher. ORCID IDs sit at the centre of scholarly workflows, connecting people to publications, datasets, affiliations, peer reviews, grants, and contributions.

ORCID IDs reduce ambiguity, streamline reporting, and support better attribution. Because ORCID records are self-maintained, researchers can manage their own scholarly identities while benefiting from integrations with publishers, repositories, and institutions.

One important feature is ORCID’s use of trust signals. When an external organization—such as a university or publisher—adds data to a researcher’s ORCID record, a green check mark appears. This indicates that the information was supplied by a trusted, authenticated source rather than by the researcher alone. These verified entries help create a more authoritative identity record, making ORCID a dependable reference point throughout the research landscape.

Why Bring Decentralized Identifiers Into the Picture?

The W3C’s Decentralized Identifier (DID) standard introduces cryptographically verifiable identifiers designed for secure digital interactions. Unlike traditional identifiers, a DID resolves to a DID Document, which contains:

Authentication keys — for proving control of the identifier.
Assertion keys — for signing statements such as dataset provenance records.
Key agreement keys — for establishing encrypted communication channels.
Service endpoints — for interacting with the DID subject or associated services.

The underlying storage of DID data varies depending on the DID method. It may be anchored in distributed ledgers, file systems, websites, or other registries. More than 200 DID methods exist, all interoperable through the same resolution model.

These keys provide practical capabilities:

Authenticate into systems using cryptographic proofs.
Sign datasets, workflows, and research outputs.
Securely transfer data using encrypted channels.
Support machine-to-machine operations with verifiable identity.

This aligns closely with the needs of modern research data ecosystems.

Two Proposals: How ORCID Could Support DIDs

Carly’s presentation introduced two possible integration models for combining ORCID with DIDs.

Proposal 1: Researchers List Their DIDs Within ORCID

The simplest model is allowing researchers to store one or more DIDs in their ORCID profile. ORCID remains the authoritative registry for researcher identity, while DIDs provide a layer of cryptographic capability.

This approach would enable:

Provenance Tracking

Researchers could sign datasets, computational workflows, or experimental logs using their DID keys, enabling verifiable provenance across repositories and platforms.

Authentication

DID authentication keys could be used for secure, passwordless login to research infrastructure—HPC clusters, repositories, cloud notebooks, and more.

Secure Data Transfer

Key agreement keys and service endpoints could support encrypted communication channels for sensitive or controlled-access data.

Building Data Spaces on Existing Keys

Research data spaces could use DID information listed in ORCID as a ready-made trust layer, without requiring a separate identity infrastructure.

This model preserves ORCID’s role while extending the capabilities of researcher identifiers.

Proposal 2: ORCID Hosts Full DID Infrastructure

A second, more ambitious additional option is for ORCID to operate DID infrastructure directly. In this model, ORCID could:

Allow researchers to create DIDs in addition to listing them.
Issue DIDs using DID methods suited for web publication (e.g., did:webvh, did:webs).
Provide self-certifying identifiers with key pre-rotation support.
Maintain verifiable histories and cryptographic proofs of key and document changes.
Support witnesses or watchers that monitor DID updates.
Publish DID Documents via the ORCID registry for portability and persistence.

This would bootstrap more secure research data spaces by giving the entire ecosystem access to standardized, interoperable, verifiable researcher identifiers backed by ORCID’s trust framework.

This technology is freely available as open source tooling either from the Government of British Columbia (DID:webvh) or the KERI Foundation (DID:webs) and could be hosted by any organization (including ADC). The advantage of DIDs are that they are decentralized and can be operated by many different entities, and even used in combination for multi-key security.

Toward a More Verifiable Research Ecosystem

ORCID already plays a central role in identity and attribution. Adding support for Decentralized Identifiers—whether simply listed in profiles or fully hosted by ORCID, ADC or any other research data space – would expand its capabilities to include authentication, digital signatures, secure communication, and cryptographically verifiable provenance.

As research data spaces continue to develop, these capabilities become essential. ORCID’s existing trust signals, combined with DID-based cryptographic assurance, could form a powerful foundation for next-generation research infrastructure—linking people, systems, and data through verifiable, interoperable digital identity.

For further reference about these technologies check out the educational short videos at Bite Size Trust.

Written by Carly Huitema

At Agri-food Data Canada (ADC), we often emphasize the importance of content-derived identifiers—unique fingerprints generated from the actual content of a resource. These identifiers are especially valuable in research and data analysis, where reproducibility and long-term verification are essential. When you cite a resource using a derived identifier, such as a digest of source code, you’re ensuring that years down the line, anyone can confirm that the referenced material hasn’t changed.

One of the best tools for managing versioned documents—especially code—is GitHub. Not only does GitHub make it easy to track changes over time, but it also automatically generates derived identifiers every time you save your work.

What Is a GitHub Commit Digest?

Every time you make a commit in GitHub (i.e., save a snapshot of your code or document), GitHub creates a SHA-1 digest of that commit. This digest is a unique identifier derived from the content and metadata of the commit. It acts as a cryptographic fingerprint that ensures the integrity of the data.

Here’s what goes into creating a GitHub commit digest:

Snapshot of the File Tree: Includes all file names, contents, and directory structure.
Parent Commit(s): References to previous commits, which help maintain the history.
Author Information: Name, email, and timestamp of the person who wrote the code.
Committer Information: May differ from the author; includes who actually committed the change.
Commit Message: The message describing the change.

All of this is bundled together and run through the SHA-1 hashing algorithm, producing a 40-character hexadecimal string like:

e68f7d3c9a4b8f1e2c3d4a5b6f7e8d9c0a1b2c3d

GitHub typically displays only the first 7–8 characters of this digest (e.g., e68f7d3c), which is usually enough to uniquely identify a commit within a repository.

Why You Should Reference Commit Digests

When citing code or documents stored on GitHub, always include the commit digest. This practice ensures that:

Your references are precise and verifiable.
Others can reproduce your work exactly as you did.
The cited material remains unchanged and trustworthy, even years later.

Whether you’re publishing a paper, sharing an analysis, or collaborating on a project, referencing the commit digest helps maintain transparency and reproducibility, promoting FAIR research.

Final Thoughts

GitHub’s built-in support for derived identifiers makes it a powerful platform for version control and long-term citation. By simply noting the commit digest when referencing code or documents, you’re contributing to a more robust and verifiable research ecosystem.

So next time you cite GitHub work, take a moment to copy that digest. It’s a small step that makes a big difference in the integrity of your research.

Written by Carly Huitema

In previous blog posts, we’ve discussed identifiers—specifically, derived identifiers, which are calculated directly from the digital content they represent. The key advantage of a derived identifier is that anyone can verify that the cited content is exactly what was intended. When you use a derived identifier, it ensures that the digital resource is authentic, no matter where it appears.

In contrast, authoritative identifiers work differently. They must be resolved through a trusted service, and you have to rely on that service to ensure the identifier hasn’t been altered and that the target hasn’t changed.

The Limitations of Derived Identifiers

One drawback of derived identifiers is that they only work for content that can be processed to generate a unique digest. Additionally, once an identifier is created, the content can no longer be updated. This can be a challenge when dealing with dynamic content, such as an evolving dataset or a standard that goes through multiple versions.

This brings us to the concept of identity, which goes beyond a simple identifier.

What Does Identity Mean?

Let’s take an example. The Global Alliance for Genomics and Health (GA4GH) publishes a data standard called Phenopackets. In this case Phenopackets is an identifier. Currently, there are two released versions (and two identifiers). However, anyone could create a new schema and call it “Phenopackets v3.” The key question is: is just naming something and giving it an identifier enough to have it be recognized as Phenopackets v3?

A name is not enough, what also matters is whether GA4GH itself releases “Phenopackets v3.” The identifier alone isn’t enough—we care about who endorses it. In this case, identity comes from GA4GH, the governing organization of Phenopackets.

Identity Through Reputation

Identity is established through reputation which is gained in two main ways:

Transferred reputation – When an official organization (like GA4GH) endorses an identifier, the identity is backed by its authority and reputation.
Acquired reputation – Even without a governing body, something gains identity via reputation if it becomes widely recognized and trusted.

For example, Bitcoin was created by an anonymous person (or group) using the pseudonym Satoshi Nakamoto—a name that doesn’t link to any legal identity which could grant it some reputation. Yet, the name Satoshi Nakamoto has strong identity via acquired reputation because of Bitcoin’s success and widespread recognition.

The key is that identity isn’t just about an identifier—it’s about who assigns it and why people trust it. To fully capture identity, we need to track not only the identifier but also the authority or reputation behind it.

How Do We Use an Identity?

Right now, we don’t have a universal system for identifying and verifying identity in a structured, machine-readable way. This is because identity is a combination of both an identifier and associated reputation/authority behind the identity and our current systems for identifiers don’t clearly recognize these two aspects of identity. Instead, we rely on indirect methods, like website URLs and domain names, to be a stand-in for the identity authority.

For example, if you want to verify the Phenopackets schema’s identity you would want to search out its associated authority. You might search for the Phenopackets name (the identifier) online or follow a link to its official GitHub repository. But how do you know that the GitHub page is legitimate? To confirm, you would check if the official GA4GH website links to it. Otherwise, anyone could create a GitHub repository and name it Phenopackets. The identifier is not enough, you also need to find the authority associated with the identity.

Another example of how we present the authority behind an identity are the academic journals. When they publish research, they add their reputation and peer-review process to build the reputation and identity of a paper. However, this system has flaws. When researchers cite papers they use DOIs which are specific identifiers of the journal article. The connection between the publication’s DOI to the identity of the paper is not standardized which makes discovery of important changes to the paper such as corrections and retractions challenging. Sometimes when you find the article on the journal webpage you might also find the retraction notice but this doesn’t always happen. This disconnect between identifiers and identity of publications leads to the proliferation of zombie publications which continue to be used even after they have been debunked.

Future Directions

As it stands, we lack effective tools for managing digital identity. This gap creates risks, including identity impersonation and difficulties in tracking updates, corrections, or retractions. Because our current citation system focuses on identifiers without strong linksing them to identity, important information can get lost. Efforts are underway to address these challenges, but we’re still in the early stages of finding solutions.

One early technology to address the growth of an identity has been Decentralized Identifiers (DIDs). We’ll talk more about them later, but they allow an identifier to be assigned to an identity that evolves and is provably under the control of an associated governance authority.

We hope this post has helped clarify the distinction between identifiers and identity which are often entangled — and why finding better ways to assign and verify identity is a problem worth solving.

Written by Carly Huitema

Funding for Agri-food Data Canada is provided in part by the Canada First Research Excellence Fund

University of Guelph

50 Stone Road East,
Guelph, Ontario, Canada
N1G 2W1

Call: (226) 971-0357

Resources

Quick Links

University of Guelph