Research Infrasctructure

Broken links are a common frustration when navigating the web. If you’ve ever clicked on a reference only to land on a “404 Not Found” page, you know how difficult it can be to track down missing content. In research, broken links pose a serious challenge to reproducibility—if critical datasets, software, or methodologies referenced in a study disappear, how can others verify or build upon the original work?

Persistent Identifiers (PIDs) help solve this problem by creating stable, globally unique references to digital and physical objects. Unlike regular URLs, PIDs are designed to persist beyond the lifespan of URL links to a webpage or database, ensuring long-term access to research outputs.

Persistent Identifiers should be used to uniquely (globally) identify a resource, they should persist, and it is very useful if you can resolve them so when you put in the identifier (somewhere) you get taken to the thing it references. Perhaps the most successful PID in research is the DOI – the Digital Object Identifier which is used to provide persistent links to published digital objects such as papers and datasets. Other PIDs include ORCiDs, RORs and many others existing or being proposed.

We can break identifiers into two basic groups – identifiers created by assignment, and identifiers that are derived. Once we have the identifier we have options on how they can be resolved.

Identifiers by assignment

Most research identifiers in use are assigned by the governing body. An identifier is minted (created) for the object they are identifying and is added to a metadata record containing additional information describing the identified object. For example, researchers can be identified with an ORCiD identifier which are randomly generated 16 digit numbers. Metadata associated with the ORCiD value includes the name of the individual it references as well as information such as their affiliations and publications.

We expect that the governance body in charge of any identifier ensures that they are globally unique and that they can maintain these identifiers and associated metadata for years to come. If an adversary (or someone by mistake) altered the metadata of an identifier they could change the meaning of the identifier by changing what resource it references. My DOI associated with my significant research publication could be changed to referencing a picture of a chihuahua.

An AI generated image of a researcher being directed to a picture of a dog.
An AI generated image of a researcher being directed to a picture of a dog.

Other identifiers such as DOIs, PURLs, ARKs and RORs are also generated by assignment and connected to the content they are identifying. The important detail about identifiers by assignment is that if you find something (person, or organism or anything else) you cannot determine which identifier it has been assigned unless you can match it to the metadata of an existing identifier. These assigned identifiers aren’t resilient, they depend on the maintenance of the identifier documentation. If the organization(s) operating these identifiers goes away, so does the identifier. All research that used these identifiers for identifying specific objects has the research equivalent of the web’s broken link.

We also have organizations that create and assign identifiers for physical objects. For example, the American Type Culture Collection (ATCC) mints identifiers, maintains metadata and stores the canonical, physical cell lines and microorganism cultures. If the samples are lost or if ATCC can no longer maintain the metadata of the identifiers then the identifiers lose their connection. Yes, there will be E. coli DH5α cells everywhere around the world, but cell lines drift, and microorganisms growing in labs mutate which was the challenge the ATCC was created to address.

To help record which identifier has been assigned to a digital object you could include the identifier inside the object, and indeed this is done for convenience. Published journal articles will often include the DOI in the footer of the document, but this is not authoritative. Anyway could publish a document and add a fake DOI, it is only the metadata as maintained by the DOI organization that can be considered authentic and authoritative.

Identifiers by derivation

Derived identifiers are those where the content of the identifier is derived from the object itself. In chemistry for example, the IUPAC naming convention provides several naming conventions to systematically identify chemical compounds. While (6E,13E)-18-bromo-12-butyl-11-chloro-4,8-diethyl-5-hydroxy-15-methoxytricosa-6,13-dien-19-yne-3,9-dione does not roll off the tongue, anyone given the chemical structure would derive the same identifier. Digital objects can use an analogous method to calculate identifiers. There exist hashing functions which can reproducibly generate the same identifier (digest) for a given digital input. The md5 checksums sometimes associated with data for download is an example of a digest produced from a hashing function.

An IUPAC name is bi-directional, if you are given the structure you can determine the identifier and vice-versa. A digest is a one-way function – from an identifier you can’t go back and calculate the original content. This one-way feature makes digests useful for identifying sensitive objects such as personal health datasets.

With derived identifiers you have globally unique and secure identifiers which persist indefinitely but this still depends on the authority and authenticity of the method for deriving the identifiers. IUPAC naming rules are authoritative when you can verify the rules to follow come directly from IUPAC (e.g. someone didn’t add their own rules to the set and claim they are IUPAC rules). Hashing functions are also calculated according to a specific function which are typically published widely in the scientific literature, by standards bodies, in public code repositories and in function libraries. An important point about derived identifiers is that you can always verify for yourself by comparing the claimed identifier against the derived value. You can never verify an assigned identifier. The authoritative table maintained by the identifiers governance body is the only source of truth.

The Semantic Engine generates schemas where each component of the schema are given derived identifiers. These identifiers are inserted directly into the content of the schema (as self-addressing identifiers) where they can be verified. The schema is thus a self-addressing document (SAD) which contains the SAID.

Resolving identifiers

Once you have the identifier, how do you go from identifier to object? How do you resolve the identifier? At a basic level, many of these identifiers require some kind of look-up table. In this table will be the identifier and the corresponding link that points to the object the identifier is referencing. There may be additional metadata in this table (like DOI records which also contain catalogue information about the digital object being referenced), but ultimately there is a look-up table where you can go from identifier to the object. Maintaining the look-up table, and maintaining trust in the look-up table is a key function of any governance body maintaining an identifier.

For traditional digital identifiers, the content of the identifier look-up table is either curated and controlled by the governance body (or bodies) of the identifier (e.g. DOI, PURL, ARK), or the content may be contributed by the community or individual (ORCiD, ROR). But in all cases we must depend on the governance body of the identifier to maintain their look-up tables. If something happens and the look-up table is deleted or lost we couldn’t recreate it easily (imagine going through all the journal articles to find the printed DOI to insert back into the look-up table). If an adversary (or someone by mistake) altered the look-up table of an identifier they could change the meaning of the identifier by changing what resource it points to. The only way to correct this or identify it is to depend on the systems the identifier governance body has in place to find mistakes, undo changes and reset the identifier.

Having trust in a governance authority to maintain the integrity of the look-up table is efficient but requires complete trust in the governance authority. It is also brittle in that if the organization cannot maintain the look-up table the identifiers lose resolution. The original idea of blockchain was to address the challenge of trust and resilience in data management. The goal was to maintain data (the look-up table) in a way that all parties could access and verify its contents, with a transparent record of who made changes and when. Instead of relying on a central server controlled by a governance body (which had complete and total responsibility over the content of the look-up table), blockchain distributes copies of the look-up table to all members. It operates under predefined rules for making changes, coupled with mechanisms that make it extremely difficult to retroactively alter the record.

Over time, various blockchain models have emerged, ranging from fully public systems, such as Bitcoin’s energy-intensive Proof of Work approach, to more efficient Proof of Stake systems, as well as private and hybrid public-private blockchains designed for specific use cases. Each variation aims to balance transparency, security, efficiency, and accessibility depending on the application. Many of the Decentralized Identity (the w3c standard DID method) identifiers use blockchains to distribute their look-up tables to ensure resiliency and to have greater trust when many eyes can observe changes to the look-up table (and updates to the look-up table are cryptographically controlled). Currently there are no research identifiers that use any DID method or blockchain to ensure provenance and resiliency of identifiers.

With assigned identifiers only the organization that created the identifier can be the authoritative source of the look-up table. Only the organization holds the authoritative information linking the content to the identifier. For derived identifiers the situation is different. Anyone can create a look-up table to point to the digital resource because anyone can verify the content by recalculating the identifier and comparing it to the expected value. In fact, digital resources can be copied and stored in multiple locations and there can be multiple look-up tables pointing to the same object copied in a number of locations. As long as the identifiers are the same as the calculated value they are all authoritative. This can be especially valuable as organizational funding rises and falls and organizations may lose the ability to maintain look-up tables and metadata directories.

Summary

Persistent identifiers (PIDs) play a crucial role in ensuring the longevity and accessibility of research outputs by providing stable, globally unique references to digital and physical objects. Unlike traditional web links, which can break over time, PIDs persist through governance and resolution mechanisms. There are two primary types of identifiers: those assigned by an authority (e.g., DOIs, ORCiDs, and RORs) and those derived from an object’s content (e.g., IUPAC chemical names and cryptographic hashes). Assigned identifiers depend on a central governance body to maintain their authenticity, whereas derived identifiers allow independent verification by recalculating the identifier. Regardless of their method of generation, both types of identifiers benefit from resolution services which can either be centralized or decentralized. As research infrastructure evolves, balancing governance, resilience, and accessibility in identifier systems remains a key concern for ensuring long-term reproducibility and trust in scientific data.

Written by Carly Huitema

What does it mean for research to be decentralized? And how does this relate to ADC? I will use Learning Digital Identity (see also this blog post by the author Phil Windley) for definitions as I think it does the best job.

Centralized – Decentralized axis: This axis describes the control of the system. Is the system under the control of a single entity or multiple entities?

Distributed – Co-located axis: This axis describes the location of the system.

Here are some examples:

Co-Located and Centralized

  • Examples:
    • A standalone server in a data center managed by a single company.
    • A single-location biobank storing samples under the control of one institution.
  • Characteristics:
    • All components are physically or logically co-located.
    • Single entity has full control, making coordination simpler.

 

Co-Located and Decentralized

  • Examples:
    • A shared core facility in a research institution, such as a microscopy center used by multiple research groups.
    • Multi-department access to a centralized HPC cluster for computations.
  • Characteristics:
    • Components are co-located physically or logically.
    • Coordination requires agreements or standards between entities.

 

Distributed and Centralized

  • Examples:
    • A research network where one institution operates a centralized database pulling data from distributed studies.
    • A cloud platform like AWS or Google Cloud, where distributed servers across the globe are controlled by a single organization.
    • A national climate research project with distributed weather stations feeding data into a central repository.
  • Characteristics:
    • Components are geographically or logically distributed.
    • Centralized control simplifies governance but requires robust coordination tools.

 

Distributed and Decentralized

  • Examples:
    • The internet: Nodes (servers, routers, ISPs) are controlled by different entities but work together using standardized protocols.
    • Blockchain networks like Bitcoin or Ethereum, where no single entity controls all nodes.
    • International collaborations like the Large Hadron Collider (LHC), where experiments are distributed across institutions worldwide and no single entity has full control.
  • Characteristics:
    • Infrastructure and activities are distributed among many entities.
    • Coordination is achieved through agreements, protocols, and shared governance, often requiring significant effort to maintain interoperability
Control vs location for system architectures.
Control vs location for system architectures.

Implications

In general, many aspects of research are distributed and decentralized. Health data often has rules about where it can be stored and shared ensuring it will always be co-located. Different researchers work together on projects without a central governance authority that can determine the work of its members. Centralization and co-location of resources can be very useful, with instances of groups coming together to share collective benefits. However, even these nodes of centralization will not join a single platform of all research data run by a single organization. Imagine how successful the World Wide Web would have been if the design had been to build a single server to host all the webpages and a single governance authority to coordinate and approve all the work?

At Agri-food Data Canada we recognize the reality of research, that it needs significant coordination through shared standards and protocols. Indeed, the FAIR principles; that digital objects to be more Findable, Accessible, Interoperable and Reusable (FAIR), supports more efficient and better decentralized and distributed systems. ADC is supporting the development of better standards and protocols and improving decentralized and distributed research by following the FAIR principles and producing tools such as the Semantic Engine. The Semantic Engine helps users write better, machine-readable schemas that can be used across the ecosystem by any participant, helping to maintain data interoperability and contributing to data reuse.

The Semantic Engine also embeds the use of digests into its research data objects. These digests (specifically, Self Addressing IDentifiers or SAIDs) are an important feature of a decentralized and distributed research data ecosystem because no central authority controls the issuance of SAIDs and they can be used throughout the entire research ecosystem while maintaining their meaning. Read more about this in an upcoming blog post!

 

Written by Carly Huitema

Part of the blog series on Collaborative Research IT Infrastructure

In our last post, we talked about the problem with researchers ‘DIYing’ their IT infrastructure, and today we’ll explore a few benefits that researchers and institutions can have for going with a shared IT infrastructure.

As research grows increasingly data-driven, the challenges of managing IT infrastructure independently are becoming harder to ignore. Moving to a shared research compute and storage system offers a smarter alternative—one that delivers tangible benefits for researchers and institutions alike.

One of the most immediate and impactful advantages of shared infrastructure is the significant cost savings it provides. When research teams maintain individual systems, the result is often a costly duplication of servers, storage solutions, and software licenses, all of which could be streamlined. Consolidating these resources into a shared infrastructure eliminates redundancy, maximizing efficiency and freeing up funds for other priorities. Institutions benefit further from economies of scale, leveraging bulk purchasing power for hardware, software, and support contracts to secure better pricing and reduce per-unit costs.

Beyond acquisition, shared systems also optimize resource utilization. Individual servers in isolated labs often sit underutilized, wasting potential. Shared infrastructure ensures resources such as computing power and storage are dynamically allocated, meeting real-time demands and avoiding waste. Maintenance becomes simpler and more cost-effective, with IT teams focusing their efforts on managing one cohesive system instead of troubleshooting fragmented setups. This professional oversight reduces costly downtime, ensuring researchers face fewer disruptions and can work more efficiently.

The financial advantages of shared infrastructure extend to scalability. Research demands are rarely static, and shared systems allow institutions to scale resources up or down cost-effectively. Instead of requiring research teams to over-purchase hardware or anticipate growth inaccurately, shared infrastructure adapts seamlessly to evolving needs, accommodating projects of any size without costly piecemeal expansions.

While cost savings are crucial, shared infrastructure also protects institutional investments by ensuring compliance with essential guidelines such as the Canadian Federal Government’s National Security Guidelines for Research Partnerships Risk Assessment Form and the Government of Canada’s Sensitive Technology Research and Affiliations of Concern policy. These guidelines demand rigorous safeguards for sensitive research data, international collaborations, and access control. A unified system streamlines compliance by applying consistent security protocols across the board, simplifying risk assessments, and ensuring adherence to regulatory requirements. Proactive monitoring within a shared infrastructure mitigates risks such as unauthorized access or affiliations of concern, safeguarding research integrity while protecting the institution’s reputation and funding eligibility.

In addition to enhancing security and compliance, shared systems amplify research potential by making cutting-edge tools more accessible. Advanced technologies for data analysis, machine learning, and computational modeling—often cost-prohibitive for individual teams—become available through pooled resources. This accessibility fosters innovation, attracts top talent, and strengthens the institution’s competitiveness for grants and partnerships, further justifying the investment in a shared system.

Ultimately, transitioning to a shared infrastructure isn’t just about reducing costs or addressing security concerns; it’s about creating an environment where researchers can focus on advancing knowledge without being bogged down by IT challenges. A shared, professionally managed system enables institutions to optimize resources, ensure compliance, and support groundbreaking research, positioning them to thrive in a rapidly evolving academic and technological landscape.

Stay tuned for our next post, where we’ll debunk the myths around research autonomy in a shared system.

Written by

Lucas Alcantara

Featured picture generated by Pixlr

Part of the blog series on Collaborative Research IT Infrastructure

In our first post, we discussed the challenges universities face with fragmented IT systems and the need for unified solutions. Here, we explore the specific issues that arise when research teams independently manage their IT infrastructure.

Research IT infrastructure often develops in response to immediate needs, resulting in uncoordinated solutions across departments. This patchwork approach limits interoperability and visibility, isolating teams and stifling collaboration opportunities.

Researchers excel in their fields, but managing IT demands specialized knowledge. Critical tasks like ensuring security, applying updates, and maintaining backups are often overlooked. This leads to vulnerabilities, data loss, and project delays that research teams are ill-equipped to address.

Modern research generates data at an unprecedented scale. Storage solutions patched together to meet short-term needs often become too expensive, insecure, or simply inadequate as projects grow. These limitations are particularly evident when researchers need to share data with collaborators. Consider a scenario where a researcher has collected a large dataset on a personal server but needs to provide access to an external collaborator. Material Transfer Agreements (MTAs) may require that the data remain on a university-approved system. However, the server’s setup lacks the robust infrastructure to facilitate secure sharing, such as proper user management or a dedicated interface for external access.

To overcome these limitations, the researcher might use built-in services to expose their data over the internet. While convenient, such solutions often bypass institutional IT controls and lack the comprehensive security measures necessary for protecting sensitive research data. From an IT perspective, this approach introduces risks: the server operates as an isolated node with unknown management practices, limited auditing capabilities, and no integration into the broader university security ecosystem. Additionally, if the server is connected to the university’s local network, any vulnerabilities could potentially extend beyond the server to the institution’s wider infrastructure.

Alternatively, the researcher might attempt to migrate their dataset to an approved system within the university, only to face challenges such as delays in provisioning, insufficient computational resources, or incompatible tools. These barriers can slow research progress and, in some cases, discourage collaboration altogether.

The risks of managing IT infrastructure without institutional support become even more pronounced when researchers receive confidential data from collaborators under MTAs. These agreements often include strict provisions to safeguard sensitive research data, with severe legal and reputational consequences for non-compliance. If a researcher stores such data on a makeshift server with inadequate security controls, the liability extends beyond the research group to the entire department or university.

In the event of a breach—whether due to outdated software, weak access controls, or insufficient monitoring—the fallout can be catastrophic. Confidential data leakage not only jeopardizes the integrity of the research but also exposes the institution to potential lawsuits, loss of trust, and damage to its reputation. Furthermore, responding to a breach requires significant resources, from forensic investigations to notifying affected parties and implementing remediation measures, all of which could have been avoided with a more secure and compliant infrastructure.

Students and early-career researchers often shoulder the burden of IT management. While this can foster technical skill development, it frequently distracts from their primary academic goals. As these systems grow in complexity, maintaining them consumes valuable time that could be spent on research. Moreover, when these individuals leave, institutional knowledge is often lost, perpetuating inefficiencies.

Isolated IT systems hinder collaboration and drive costs up due to duplicated efforts. Shared IT infrastructure allows universities to pool resources, reduce expenses, and enhance capacity. Institutions with robust, scalable IT systems are better positioned to secure funding and partnerships, enabling them to tackle complex challenges and remain competitive.

Managing IT independently may seem like a quick way to move ahead, but it often leads to inefficiency, security vulnerabilities, and reduced productivity. All of these hinder collaboration and jeopardize the integrity of the research environment.

 

Looking Ahead: Toward a Smarter Solution

 

Shifting to shared, collaborative IT infrastructure offers a sustainable path forward. Consolidating resources can reduce costs, improve security, and provide scalability. More importantly, it improves data management, data sharing, and creates a foundation for advanced tools like machine learning and large-scale data analysis. A well-managed infrastructure empowers researchers to focus on advancing knowledge while fostering collaboration and innovation.

Stay tuned for our next post, where we’ll explore these benefits in detail.

 

Written by Lucas Alcantara

Featured picture generated by Pixlr

In today’s data-driven research environment, universities face a growing challenge: while researchers excel at pushing the boundaries of knowledge, they often face challenges managing the technology that supports their work.

Many university research teams still operate on isolated, improvised systems for computing and data storage—servers tucked in closets or offices, ad-hoc storage solutions, and no consistent approach to backups or security. These isolated systems may meet immediate needs but often creates inefficiencies, security risks, and lost opportunities for collaboration and innovation.

In this blog series, we’ll explore how research teams at universities in general can benefit by identifying their community commonalities and consolidating their IT infrastructure. A unified system, professionally managed by a dedicated research IT team, brings enhanced security, greater scalability, improved collaboration, and increased efficiency to researchers, allowing them to focus on discovery, not IT overhead. We’ll break down the benefits of this shift and how it can help research institutions thrive in today’s data-intensive landscape. Specifically, we will describe how a Collaborative Research IT Infrastructure might help University of Guelph researchers meet their IT needs while freeing up researchers’ time to focus on their core research objectives rather than the underlying IT infrastructure.

What You Can Expect from This Series

 

Post 1: The Problem with Doing It All Yourself

We’ll kick off by examining the issues that arise when research teams manage their own IT infrastructure—uncoordinated systems, security vulnerabilities, inefficient storage, and the burden of maintaining it all. We’ll explore the risks and costs of decentralized research IT infrastructure and the toll it takes on research productivity.

Posted on Nov 29 and available here.

Post 2: Why a Shared Infrastructure Makes Sense

Next, we’ll explore the advantages of moving to a shared research compute and storage system. From cost savings to enhanced security and easier scalability, we’ll show how a well-managed, shared resource pool can transform the way researchers handle data, computations, and infrastructure, giving them access to state-of-the-art tools and adding scalability by leveraging idle capacity from other research groups.

Post 3: Debunking the Myths: Research Autonomy in a Shared System

A common concern is that adopting a shared infrastructure means losing control. In this post, we’ll discuss how a collaborative system can actually increase flexibility, offering tailored environments for different research needs, while freeing researchers from the technical burdens of IT management. We’ll also explore how it fosters easier collaboration across departments and institutions.

Post 4: The Benefits of Shared Storage

Research generates vast amounts of data, and managing it efficiently is key to success. This post will look at how shared storage solutions offer more than just space—providing reliable backups, cost-effective scaling, and multiple storage tiers to meet various research needs, from active datasets to long-term archives.

Post 5: Scaling for the Future: Building a System That Grows with Your Research

As research projects evolve, so do their IT demands. This post will highlight how shared infrastructure offers scalability and adaptability, ensuring that universities can support growing data and computational needs. We’ll also discuss how investing in shared systems today sets universities up to leverage future advancements in research computing.

Post 6: Transitioning to a Shared System: Key Considerations

In our final post, we’ll discuss key considerations for the University of Guelph to explore the move to a shared research compute and storage system. We’ll look at the importance of securing sustainable funding, fostering consensus across departments, and navigating shared governance to ensure all voices are heard. Additionally, we’ll examine how existing organizational structures influence the establishment of dedicated roles for managing this infrastructure. This discussion aims to highlight the factors that can guide a smooth transition toward a collaborative research IT environment.

 

The Case for Change

 

By the end of this series, you’ll have a clear understanding of why shared research infrastructure is the future for universities. We’ll show that this approach isn’t just about technology—it’s about improving collaboration, safeguarding data, and ultimately empowering researchers to focus on what really matters: driving innovation. Join us as we explore the journey from siloed systems to shared success.

 

Written by Lucas Alcantara

Featured picture generated by Pixlr