Identifiers

On November 19th, Carly Huitema presented (YouTube link) at the Trust over IP 5-Year Symposium on emerging opportunities to use ORCID as a trust registry to help build secure, verifiable research data spaces. As research becomes increasingly digital and distributed, identity plays a central role in how data is created, shared, and validated. ORCID is already foundational to this ecosystem, and new standards such as Decentralized Identifiers (DIDs) open the door to extending its function beyond its original scope.

ORCID’s Existing Role in Trust and Attribution

For more than a decade, ORCID has provided the research community with a simple but essential service: a persistent, globally unique identifier for every researcher. ORCID IDs sit at the centre of scholarly workflows, connecting people to publications, datasets, affiliations, peer reviews, grants, and contributions.

ORCID IDs reduce ambiguity, streamline reporting, and support better attribution. Because ORCID records are self-maintained, researchers can manage their own scholarly identities while benefiting from integrations with publishers, repositories, and institutions.

One important feature is ORCID’s use of trust signals. When an external organization—such as a university or publisher—adds data to a researcher’s ORCID record, a green check mark appears. This indicates that the information was supplied by a trusted, authenticated source rather than by the researcher alone. These verified entries help create a more authoritative identity record, making ORCID a dependable reference point throughout the research landscape.

Why Bring Decentralized Identifiers Into the Picture?

The W3C’s Decentralized Identifier (DID) standard introduces cryptographically verifiable identifiers designed for secure digital interactions. Unlike traditional identifiers, a DID resolves to a DID Document, which contains:

  • Authentication keys — for proving control of the identifier.

  • Assertion keys — for signing statements such as dataset provenance records.

  • Key agreement keys — for establishing encrypted communication channels.

  • Service endpoints — for interacting with the DID subject or associated services.

The underlying storage of DID data varies depending on the DID method. It may be anchored in distributed ledgers, file systems, websites, or other registries. More than 200 DID methods exist, all interoperable through the same resolution model.

These keys provide practical capabilities:

  • Authenticate into systems using cryptographic proofs.

  • Sign datasets, workflows, and research outputs.

  • Securely transfer data using encrypted channels.

  • Support machine-to-machine operations with verifiable identity.

This aligns closely with the needs of modern research data ecosystems.

Two Proposals: How ORCID Could Support DIDs

Carly’s presentation introduced two possible integration models for combining ORCID with DIDs.

Proposal 1: Researchers List Their DIDs Within ORCID

The simplest model is allowing researchers to store one or more DIDs in their ORCID profile. ORCID remains the authoritative registry for researcher identity, while DIDs provide a layer of cryptographic capability.

This approach would enable:

Provenance Tracking

Researchers could sign datasets, computational workflows, or experimental logs using their DID keys, enabling verifiable provenance across repositories and platforms.

Authentication

DID authentication keys could be used for secure, passwordless login to research infrastructure—HPC clusters, repositories, cloud notebooks, and more.

Secure Data Transfer

Key agreement keys and service endpoints could support encrypted communication channels for sensitive or controlled-access data.

Building Data Spaces on Existing Keys

Research data spaces could use DID information listed in ORCID as a ready-made trust layer, without requiring a separate identity infrastructure.

This model preserves ORCID’s role while extending the capabilities of researcher identifiers.

Proposal 2: ORCID Hosts Full DID Infrastructure

A second, more ambitious additional option is for ORCID to operate DID infrastructure directly. In this model, ORCID could:

  • Allow researchers to create DIDs in addition to listing them.

  • Issue DIDs using DID methods suited for web publication (e.g., did:webvh, did:webs).

  • Provide self-certifying identifiers with key pre-rotation support.

  • Maintain verifiable histories and cryptographic proofs of key and document changes.

  • Support witnesses or watchers that monitor DID updates.

  • Publish DID Documents via the ORCID registry for portability and persistence.

This would bootstrap more secure research data spaces by giving the entire ecosystem access to standardized, interoperable, verifiable researcher identifiers backed by ORCID’s trust framework.

This technology is freely available as open source tooling either from the Government of British Columbia (DID:webvh) or the KERI Foundation (DID:webs) and could be hosted by any organization (including ADC). The advantage of DIDs are that they are decentralized and can be operated by many different entities, and even used in combination for multi-key security.

Toward a More Verifiable Research Ecosystem

ORCID already plays a central role in identity and attribution. Adding support for Decentralized Identifiers—whether simply listed in profiles or fully hosted by ORCID, ADC or any other research data space – would expand its capabilities to include authentication, digital signatures, secure communication, and cryptographically verifiable provenance.

As research data spaces continue to develop, these capabilities become essential. ORCID’s existing trust signals, combined with DID-based cryptographic assurance, could form a powerful foundation for next-generation research infrastructure—linking people, systems, and data through verifiable, interoperable digital identity.

For further reference about these technologies check out the educational short videos at Bite Size Trust.

Written by Carly Huitema

At Agri-food Data Canada (ADC), we often emphasize the importance of content-derived identifiers—unique fingerprints generated from the actual content of a resource. These identifiers are especially valuable in research and data analysis, where reproducibility and long-term verification are essential. When you cite a resource using a derived identifier, such as a digest of source code, you’re ensuring that years down the line, anyone can confirm that the referenced material hasn’t changed.

One of the best tools for managing versioned documents—especially code—is GitHub. Not only does GitHub make it easy to track changes over time, but it also automatically generates derived identifiers every time you save your work.

What Is a GitHub Commit Digest?

Every time you make a commit in GitHub (i.e., save a snapshot of your code or document), GitHub creates a SHA-1 digest of that commit. This digest is a unique identifier derived from the content and metadata of the commit. It acts as a cryptographic fingerprint that ensures the integrity of the data.

Here’s what goes into creating a GitHub commit digest:

  • Snapshot of the File Tree: Includes all file names, contents, and directory structure.
  • Parent Commit(s): References to previous commits, which help maintain the history.
  • Author Information: Name, email, and timestamp of the person who wrote the code.
  • Committer Information: May differ from the author; includes who actually committed the change.
  • Commit Message: The message describing the change.

All of this is bundled together and run through the SHA-1 hashing algorithm, producing a 40-character hexadecimal string like:

e68f7d3c9a4b8f1e2c3d4a5b6f7e8d9c0a1b2c3d

GitHub typically displays only the first 7–8 characters of this digest (e.g., e68f7d3c), which is usually enough to uniquely identify a commit within a repository.

Why You Should Reference Commit Digests

When citing code or documents stored on GitHub, always include the commit digest. This practice ensures that:

  • Your references are precise and verifiable.
  • Others can reproduce your work exactly as you did.
  • The cited material remains unchanged and trustworthy, even years later.

Whether you’re publishing a paper, sharing an analysis, or collaborating on a project, referencing the commit digest helps maintain transparency and reproducibility, promoting FAIR research.

Final Thoughts

GitHub’s built-in support for derived identifiers makes it a powerful platform for version control and long-term citation. By simply noting the commit digest when referencing code or documents, you’re contributing to a more robust and verifiable research ecosystem.

So next time you cite GitHub work, take a moment to copy that digest. It’s a small step that makes a big difference in the integrity of your research.

Written by Carly Huitema

In previous blog posts, we’ve discussed identifiers—specifically, derived identifiers, which are calculated directly from the digital content they represent. The key advantage of a derived identifier is that anyone can verify that the cited content is exactly what was intended. When you use a derived identifier, it ensures that the digital resource is authentic, no matter where it appears.

In contrast, authoritative identifiers work differently. They must be resolved through a trusted service, and you have to rely on that service to ensure the identifier hasn’t been altered and that the target hasn’t changed.

The Limitations of Derived Identifiers

One drawback of derived identifiers is that they only work for content that can be processed to generate a unique digest. Additionally, once an identifier is created, the content can no longer be updated. This can be a challenge when dealing with dynamic content, such as an evolving dataset or a standard that goes through multiple versions.

This brings us to the concept of identity, which goes beyond a simple identifier.

What Does Identity Mean?

Let’s take an example. The Global Alliance for Genomics and Health (GA4GH) publishes a data standard called Phenopackets. In this case Phenopackets is an identifier. Currently, there are two released versions (and two identifiers). However, anyone could create a new schema and call it “Phenopackets v3.” The key question is: is just naming something and giving it an identifier enough to have it be recognized as Phenopackets v3?

A name is not enough, what also matters is whether GA4GH itself releases “Phenopackets v3.” The identifier alone isn’t enough—we care about who endorses it. In this case, identity comes from GA4GH, the governing organization of Phenopackets.

Identity Through Reputation

Identity is established through reputation which is gained in two main ways:

  1. Transferred reputation – When an official organization (like GA4GH) endorses an identifier, the identity is backed by its authority and reputation.
  2. Acquired reputation – Even without a governing body, something gains identity via reputation if it becomes widely recognized and trusted.

For example, Bitcoin was created by an anonymous person (or group) using the pseudonym Satoshi Nakamoto—a name that doesn’t link to any legal identity which could grant it some reputation. Yet, the name Satoshi Nakamoto has strong identity via acquired reputation because of Bitcoin’s success and widespread recognition.

The key is that identity isn’t just about an identifier—it’s about who assigns it and why people trust it. To fully capture identity, we need to track not only the identifier but also the authority or reputation behind it.

How Do We Use an Identity?

Right now, we don’t have a universal system for identifying and verifying identity in a structured, machine-readable way. This is because identity is a combination of both an identifier and associated reputation/authority behind the identity and our current systems for identifiers don’t clearly recognize these two aspects of identity. Instead, we rely on indirect methods, like website URLs and domain names, to be a stand-in for the identity authority.

For example, if you want to verify the Phenopackets schema’s identity you would want to search out its associated authority. You might search for the Phenopackets name (the identifier) online or follow a link to its official GitHub repository. But how do you know that the GitHub page is legitimate? To confirm, you would check if the official GA4GH website links to it. Otherwise, anyone could create a GitHub repository and name it Phenopackets. The identifier is not enough, you also need to find the authority associated with the identity.

Another example of how we present the authority behind an identity are the academic journals. When they publish research, they add their reputation and peer-review process to build the reputation and identity of a paper. However, this system has flaws. When researchers cite papers they use DOIs which are specific identifiers of the journal article. The connection between the publication’s DOI to the identity of the paper is not standardized which makes discovery of important changes to the paper such as corrections and retractions challenging. Sometimes when you find the article on the journal webpage you might also find the retraction notice but this doesn’t always happen. This disconnect between identifiers and identity of publications leads to the proliferation of zombie publications which continue to be used even after they have been debunked.

Future Directions

As it stands, we lack effective tools for managing digital identity. This gap creates risks, including identity impersonation and difficulties in tracking updates, corrections, or retractions. Because our current citation system focuses on identifiers without strong linksing them to identity, important information can get lost. Efforts are underway to address these challenges, but we’re still in the early stages of finding solutions.

One early technology to address the growth of an identity has been Decentralized Identifiers (DIDs). We’ll talk more about them later, but they allow an identifier to be assigned to an identity that evolves and is provably under the control of an associated governance authority.

We hope this post has helped clarify the distinction between identifiers and identity which are often entangled — and why finding better ways to assign and verify identity is a problem worth solving.

Written by Carly Huitema