FAIR

Uh-oh here she goes again!  I’ve been pondering the direction I wanted to take with this post – do I dig more into the data opportunities that may exist in the OAC Annual reports? or twist my Rubik’s cube to look at a different facet of this ongoing conversation.  Let’s look at another facet.

You may recall that we created a tagline for Agri-food Data Canada (ADC) a few years ago:

Making agri-food data FAIR!

Remember Findable Accessible Interoperable Reusable.  If you’ve been following this blog, you know we have gone into detail on FAIR and the tools that we have created to make the data FAIR.  Now the question- which I have posed in the past also – do we need to qualify WHAT data we make FAIR?  Does someone need to decide which data has more “value” and should be FAIR?  Oh you know that question will not end well!  So, let’s be objective and take that tagline at face value – agri-food data – whether it is created today or 150+ years ago.   I know this may seem a little counter to what I said a few weeks ago – but I do feel that we need to think about the goal!  Why would we want to make that 150+ data FAIR?  What am I going to use that historical data for?   If we can find it, dig it out, and document the older data – we also need to think about the resources that are needed to create that FAIR data resource AND do they outweigh how and if the data will be Reused?  Back to that circular conversation again 🙂

So why am I bringing this up yet again?  Because I was reminded this week – that there are INDEED projects out there – Canadian projects – that are indeed using older and historic datafiles.   For instance check out:  The GeoREACH Lab at UPEI – a FABULOUS use case for using historical data.

So….  back to my original question – is there VALUE in historical ag data?  Should we spend the resources to uncover that hidden data trove of OAC research?

Michelle

The Case for Uploadable Form Data: A More Flexible Approach to Online Submissions

Anyone who has worked extensively with online submission systems will recognize a familiar frustration: you have done the hard work of gathering, drafting, and refining your content – often collaboratively, across multiple documents and tools – and now you face the tedious task of manually copying everything into a web form, field by field. The content is ready; the process of getting it into the system is not.

This challenge comes up repeatedly in our work with the Agri-food Data Canada (ADC) and the Climate-Smart Data Collaboration Centre (CS-DCC), and it is not unique to any one platform or domain. It reflects a structural gap in how most online forms are designed: they are built for data entry, not data transfer.

The Collaboration Problem

This tension is particularly well illustrated in the context of data management plans. As noted in a recent article from Upstream:

“Data management planning is often a collaborative process, involving researchers, librarians, and institutional support staff. External tools and shared documents make it easier to iterate on plans, incorporate guidance, and ensure alignment with institutional policies and available resources. When plans are created directly within submission systems, that collaborative process can become more difficult.”

The same dynamic applies across many submission workflows. Researchers, teams, and support staff work best when they can iterate freely in shared documents, draw on existing resources, and incorporate guidance from multiple sources. Forcing that process into a single submission interface introduces friction at exactly the wrong moment – when content is nearly complete and should be easy to finalize.

Why APIs Are Not Always the Answer

The conventional infrastructure response to interoperability problems is to connect systems via APIs. While APIs are powerful and appropriate in many contexts, they come with real constraints. Both parties must be ready and willing to build and maintain the connection. Integration work requires technical resources on both sides. Security and access management become more complex. And the result is a series of point-to-point connections rather than a flexible, open approach that any participant can use.

APIs are well suited for tightly coupled systems with dedicated integration teams. They are less well suited for the diverse, distributed ecosystems that characterize research infrastructure – where institutions, tools, and workflows vary enormously and where not every participant has the capacity to build custom integrations.

A Simpler Proposal: Publishable Schemas and Uploadable Data Files

At ADC, we have been exploring a different approach. The core idea is straightforward: if an online form publishes its expected data structure as a downloadable schema or sample data file, then users can prepare their submissions outside the system – collaboratively, using whatever tools work best for them – and upload a structured file (such as a .json file) when they are ready to submit.

The form receives the uploaded file, validates it against the expected format, and populates the interface with the pre-filled content. The user retains final control, reviewing and editing within the UI before submitting. This preserves the benefits of collaborative, tool-agnostic preparation while keeping the submission process human-centered and editable.

Consider the DMP Assistant, the Alliance’s data management planning tool. Currently, users draft their plans directly in the online interface. Under this model, a researcher could instead work with their existing lab documentation, institutional guidance, and an AI assistant to compile all the relevant information into a structured .json file, then upload it into the DMP Assistant to populate the form in a single step – arriving at the editing stage with a complete draft rather than a blank form.

We Already Do This

This is not a theoretical proposal. ADC already supports this kind of workflow in the Semantic Engine’s schema development tool. We provide a structured prompt that helps users draft their schema content with an AI assistant, then upload the resulting .json file directly into the Semantic Engine editor. The file is parsed, the fields are populated, and the user continues working from there – a draft version that may contain AI errors but offers the opportunity to correct and improve.

The pattern works. It reduces manual data entry, supports collaborative preparation, and lowers the barrier for users who want to work in familiar tools before engaging with a specialized system. We believe it is a model worth broader adoption, and one that submission systems of all kinds could implement without requiring significant infrastructure investment from either side.

Written by Carly Huitema

 

Content-Derived Identifiers in the Semantic Engine

Built into the Semantic Engine is a particular kind of identifier called a SAID (Self-Addressing Identifier). Unlike traditional identifiers that are assigned to a resource, SAIDs are derived directly from the content itself. They are computed—typically using cryptographic hashing—so the identifier is intrinsically bound to the exact bytes of the resource it represents.

These identifiers are not designed to be human-friendly. They are long, opaque strings. But that trade-off enables something more important for research and data systems: verification. If a resource is referenced by a SAID, you can independently confirm that what you have is exactly what was intended. If the content changes, the identifier no longer matches. In that sense, SAIDs are tamper-evident and self-authenticating.

Why Identifier Types Matter in Standards

Many specifications—particularly in research data and interoperability frameworks—depend on identifiers and are explicit about what types are allowed. This ensures consistency, portability, and long-term usability across systems.

One commonly accepted class is the URN (Uniform Resource Name). Because URNs are standardized and designed for persistence, they are frequently permitted in specifications where long-lived, location-independent identifiers are required.

IANA and Global Recognition

The Internet Assigned Numbers Authority (IANA) is responsible for coordinating key elements of the internet’s infrastructure, including identifier namespaces. When IANA registers a namespace, it becomes part of the globally recognized technical foundation used across systems and standards.

SAIDs have now been formally registered with IANA as a new URN namespace: urn:said. This elevates them from an ecosystem-specific mechanism to a globally recognized identifier scheme.

URNs vs URLs

A URN identifies what something is, while a URL (Uniform Resource Locator) identifies where something is located.

URNs are not inherently resolvable—you cannot simply use one to retrieve a resource without additional infrastructure. Instead, they are designed to be persistent names that systems can interpret.

SAIDs fit naturally into this model but add an important property: because they are content-derived, they can be independently verified. Anyone can build a resolver that retrieves content and checks whether it matches the SAID. Trust does not depend on the resolver—it depends on the content itself.

Implications for Research Data Infrastructure

The registration of urn:said means that SAIDs can now be used anywhere URNs are accepted. This has direct implications for research data standards and infrastructure.

The Semantic Engine already uses SAIDs to generate secure, tamper-evident identifiers. With official URN recognition, those identifiers can now integrate cleanly into broader ecosystems—supporting interoperability across repositories, metadata standards, and distributed workflows.

This represents a shift in how identifiers function within research systems. Instead of relying solely on assigned names backed by registries, systems can incorporate identifiers that are self-verifying by design. For research data—where integrity, provenance, and reproducibility are central concerns—this provides a stronger and more flexible foundation.

– Written by Carly Huitema

That is the question to ask – when it comes to historical research data.

So – yes I found some of the original research data that was collected by OAC researchers back in 1877 – BUT…   do we spend the time and resources into pulling it out of the PDFs and making it accessible to the world?  I know I’ve asked this question in a different way in previous posts – but it’s a question that keeps coming up.

There are definitely aspects of this question that need to be objectively reviewed:

  • WHY? undertake this venture?  For curiosity or because it is valuable to you as a researcher?
  • HOW far back do we go? 10 yrs? 50 yrs? 150 yrs?
  • WHERE will you keep this data?  Hmm…  that’s a big question today!
  • WHO? will steward this data?

As a data geek – I want to pull this out and steward it – but I have to be practical about it as well.  How much have research technologies changed over this time period?  How valid is this data today?  Let’s think about this in a different way.  Textbooks – When we teach we update our textbooks on a regular basis since there is new materials to teach and new ways to view and teach the materials.  I have Statistics textbooks going way back – 1950s – but I don’t use these to teach – I use the new updated 2025 texts.  I may read the older texts to get a perspective on why or how things have changed –  but I will use the newer texts as resources for my students.

Let’s go back to data.   I have this cool data on weights and feed intakes of animals in the 1870s but our animals have changed over the past 150+ years.  Is the historical data really of use in today’s active research projects?   Probably not – unless you are a historian?   See – how I can always find a way or reason to keep this data?  But really, time and available resources come into play – do we have the time,  money, and resources to create and preserve this data – that may or may not be of use?

Oish!  I can talk circles around this!  I believe that everyone will come to a point where they will need to make these types of decisions –  but for today’s research data – let’s document it, and deposit into a repository – the data is relevant!  Not like my 150+yr old data 🙂

Michelle

Imagine this scenario. On her first field season as a principal investigator, a professor watched a graduate student realize—two weeks too late—that no one had recorded soil temperature at the sampling sites. The team had pH, moisture, GPS coordinates… but not the one variable that explained the anomaly in their results. A return trip wasn’t possible. The data gap was permanent.

After that, she changed how her lab collected data.

Instead of relying on ad hoc spreadsheets, she worked with her students to design schemas for their lab’s routine data collection. These weren’t schemas for final data deposit—they were practical structures for the messy, active phase of research. The goal was simple: define in advance what gets collected, how it’s recorded, and which values are allowed.

Researchers can use the Semantic Engine to create schemas that they need for all stages of their research program, from active data collection to final data deposition.

For data collection, once a schema is established, it can be uploaded into the Semantic Engine to generate a Data Entry Excel (DEE) file.

Each DEE contains:

  • A schema description sheet – documentation pulled directly from the schema, including variable definitions and code lists.

  • A data entry sheet – pre-labeled columns that follow the schema rules.

The schema description sheet of a Data Entry Excel.
The schema description sheet of a Data Entry Excel.
Data Entry Excel showing the sheet for data entry.
Data Entry Excel showing the sheet for data entry.

Because the documentation lives in the same file as the data, nothing has to be retyped, reinvented, or remembered from scratch. The schema description sheet also includes code lists that populate the drop-down menus in the data entry sheet, reducing inconsistent terminology and formatting errors.

If the standard schema isn’t sufficient, it can be edited in the Semantic Engine. Researchers can add attributes or adjust fields without rebuilding everything from scratch. The updated schema can then generate a new DEE, preserving previous structure while incorporating the changes.

This approach addresses a common problem: unstructured Excel data. Without standardization, spreadsheets accumulate inconsistent date formats, unit mismatches, ambiguous abbreviations, and missing values. Cleaning that data later is costly and error-prone.

By organizing data entry around a schema:

  • Required information is visible and less likely to be forgotten.

  • Fieldwork becomes more reliable – critical variables are collected the first time.

  • Data from multiple researchers or projects can be harmonized more easily.

  • Manual cleaning and interpretation are reduced.

The generated DEE does not enforce full validation inside Excel (beyond drop-down lists). For formal validation, the completed spreadsheet can be uploaded to the Semantic Engine’s Data Verification tool.

Using schema-driven Data Entry Excel files turns data structure into a practical research tool. Instead of discovering gaps during analysis, researchers define expectations at the point of collection—when it matters most.

Written by Carly Huitema

Generalist and Specialist Data Repositories

Research data repositories can be described along two important dimensions:

  1. how broad or specialized their scope is, and

  2. where they sit in the research data lifecycle.

Understanding these distinctions helps understand the technologies and repositories available for research data.

Generalist Repositories

Generalist repositories are designed to accept many different kinds of data across disciplines. They prioritize inclusivity and flexibility, offering a common technical platform where researchers can deposit datasets that do not fit neatly into a single domain.

A useful metaphor is the junk drawer in a kitchen. A junk drawer contains many useful items—batteries, spare cables, elastic bands—but finding a specific item often requires some effort. Similarly, generalist repositories can hold valuable datasets, but those datasets may be described with relatively generic metadata and limited domain-specific structure.

As a result, data in generalist repositories can be:

  • Harder to discover through precise searches

  • More difficult to interpret without additional context

  • Less immediately reusable by domain experts

Examples of generalist repositories include Dataverse (Borealis in Canada), Figshare and OSF.

Specialist Repositories

Specialist repositories focus on a specific discipline, data type, or research community. They typically enforce domain-specific metadata standards, controlled vocabularies, and structured submission requirements.

Continuing the kitchen metaphor, specialist repositories resemble a cutlery drawer: clearly organized, purpose-built, and easy to use—provided you are looking for the right type of item. Knives go in one place, forks in another, and everything has a defined role.

Because of this structure, specialist repositories tend to make data:

  • More findable through precise, domain-aware search

  • Easier to interpret due to consistent metadata

  • More interoperable with related tools and systems

  • More reusable for future research

In other words, data in specialist repositories are often more FAIR than data in generalist repositories. However, this specialization also limits what they can accept. Many interdisciplinary datasets—particularly in agri-food research—do not align cleanly with the strict models of existing specialist repositories and therefore end up in generalist ones. Examples of specialist repositories include Genbank, PDB and GEO.

The Research Data Lifecycle: Active and Archival Data

Another important way to think about data repositories is in relation to the research data lifecycle.

Research data typically move through several phases:

  1. Planning and collection

  2. Processing, active analysis and refinement

  3. Publication and dissemination

  4. Long-term preservation and reuse

Repositories are often designed to support either active data or archival data, but not both equally well.

Active Data

Active data are produced and used during the course of research. They may be incomplete, frequently updated, or subject to access restrictions due to confidentiality, sensitivity, or competitive concerns.

This is the phase where data are still being cleaned, analyzed, and interpreted. Changes are expected, and collaboration is often ongoing. Most formal repositories are not designed to support this stage, which is typically handled through local storage, shared drives, or project-specific platforms.

Archival Data

Once research is complete and results have been published, data generally move into an archival phase. At this point, datasets are more stable, less likely to change, and often less sensitive—especially if they have been anonymized or if concerns about being “scooped” no longer apply.

Most well-known repositories, including Dataverse, Figshare, and domain-specific archives such as the Protein Data Bank (PDB), are designed primarily for archival data. Their strengths lie in long-term preservation, persistent identifiers (PIDs like DOIs), citation, and access, rather than supporting ongoing analysis or frequent updates.

Bridging the Gaps

It would be inefficient to build a highly specialized repository for every possible type of dataset—much like building a kitchen with a separate drawer for every object that might otherwise end up in the junk drawer. Instead, a more scalable approach is to improve the organization and description of data held in generalist repositories.

Agri-food Data Canada’s approach focuses on developing tools, guidance, and training that help researchers add structure and context to their data wherever it is deposited. By enhancing metadata quality and enabling interoperability between repositories, it becomes possible to make data in generalist repositories more FAIR—without requiring a proliferation of narrowly specialized infrastructure.

Together, specialist and generalist repositories, along with active and archival data systems, form complementary parts of the research data ecosystem. Recognizing their respective roles helps researchers choose appropriate platforms and supports more effective data reuse over time.

Written by Carly Huitema

Streamlining Data Documentation in Research

In of research, data documentation is often a complex and time-consuming task. To help researchers better document their data ADC has created the Semantic Engine as a powerful tool for creating structured, machine-readable data schemas. These schemas serve as blueprints that describe the various features and constraints of a dataset, making it easier to share, verify, and reuse data across projects and disciplines.

Defining Data

By guiding users through the process of defining their data in a standardized format, the Semantic Engine not only improves data clarity but also enhances interoperability and long-term usability. Researchers can specify the types of data they are working with, the descriptions of data elements, units of measurement used, and other rules that govern their values—all in a way that computers can easily interpret.

Introducing Range Overlays

With the next important update, the Semantic Engine now includes support for a new feature: range overlays.

Range overlays allow researchers to define expected value ranges for specific data fields, and if the values are inclusive or exclusive (e.g. up to but not including zero). This is particularly useful for quality control and verification. For example, if a dataset is expected to contain only positive values—such as measurements of temperature, population counts, or financial figures—the range overlay can be used to enforce this expectation. By specifying acceptable minimum and maximum values, researchers can quickly identify anomalies, catch data entry errors, and ensure their datasets meet predefined standards.

Verifying Data

In addition to enhancing schema definition, range overlay support has now been integrated into the Semantic Engine’s Data Verification tool. This means researchers can not only define expected value ranges in their schema, but also actively check their datasets against those ranges during the verification process.

When you upload your dataset into the Data Verification tool—everything running locally on your machine for privacy and security—you can quickly verify your data within your web browser. The tool scans each field for compliance with the defined range constraints and flags any values that fall outside the expected bounds. This makes it easy to identify and correct data quality issues early in the research workflow, without needing to write custom scripts or rely on external verification services.

Empowering Researchers to Ensure Data Quality

Whether you’re working with clinical measurements, survey responses, or experimental results, this feature lets you to catch outliers, prevent errors, and ensure your data adheres to the standards you’ve set—all in a user-friendly interface.

 

Written by Carly Huitema

Alrighty let’s briefly introduce this topic.  AI or LLMs are the latest shiny object in the world of research and everyone wants to use it and create really cool things!  I, myself, am just starting to drink the Kool-Aid by using CoPilot to clean up some of my writing – not these blog posts – obviously!!

Now, all these really cool AI tools or agents use data.  You’ve all heard the saying “Garbage In…. Garbage Out…”?  So, think about that for a moment.  IF our students and researchers collect data and create little to no documentation with their data – then that data becomes available to an AI agent…  how comfortable are you with the results?  What are they based on?  Data without documentation???

Let’s flip the conversation the other way now.   Using AI agents for data creation or data analysis without understanding how the AI works, what it is using for its data, how do the models work – but throwing all those questions to the wind and using the AI agent results just the same.  How do you think that will affect our research world?

I’m not going to dwell on these questions – but want to get them out there and have folks think about them.   Agri-food Data Canada (ADC) has created data documentation tools that can easily fit into the AI world – let’s encourage everyone to document their data, build better data resources – that can then be used in developing AI agents.

Michelle

 

 

image created by AI

At Agri-food Data Canada (ADC), we often emphasize the importance of content-derived identifiers—unique fingerprints generated from the actual content of a resource. These identifiers are especially valuable in research and data analysis, where reproducibility and long-term verification are essential. When you cite a resource using a derived identifier, such as a digest of source code, you’re ensuring that years down the line, anyone can confirm that the referenced material hasn’t changed.

One of the best tools for managing versioned documents—especially code—is GitHub. Not only does GitHub make it easy to track changes over time, but it also automatically generates derived identifiers every time you save your work.

What Is a GitHub Commit Digest?

Every time you make a commit in GitHub (i.e., save a snapshot of your code or document), GitHub creates a SHA-1 digest of that commit. This digest is a unique identifier derived from the content and metadata of the commit. It acts as a cryptographic fingerprint that ensures the integrity of the data.

Here’s what goes into creating a GitHub commit digest:

  • Snapshot of the File Tree: Includes all file names, contents, and directory structure.
  • Parent Commit(s): References to previous commits, which help maintain the history.
  • Author Information: Name, email, and timestamp of the person who wrote the code.
  • Committer Information: May differ from the author; includes who actually committed the change.
  • Commit Message: The message describing the change.

All of this is bundled together and run through the SHA-1 hashing algorithm, producing a 40-character hexadecimal string like:

e68f7d3c9a4b8f1e2c3d4a5b6f7e8d9c0a1b2c3d

GitHub typically displays only the first 7–8 characters of this digest (e.g., e68f7d3c), which is usually enough to uniquely identify a commit within a repository.

Why You Should Reference Commit Digests

When citing code or documents stored on GitHub, always include the commit digest. This practice ensures that:

  • Your references are precise and verifiable.
  • Others can reproduce your work exactly as you did.
  • The cited material remains unchanged and trustworthy, even years later.

Whether you’re publishing a paper, sharing an analysis, or collaborating on a project, referencing the commit digest helps maintain transparency and reproducibility, promoting FAIR research.

Final Thoughts

GitHub’s built-in support for derived identifiers makes it a powerful platform for version control and long-term citation. By simply noting the commit digest when referencing code or documents, you’re contributing to a more robust and verifiable research ecosystem.

So next time you cite GitHub work, take a moment to copy that digest. It’s a small step that makes a big difference in the integrity of your research.

Written by Carly Huitema

At Agri-food Data Canada (ADC), we are developing tools to help researchers create high-quality, machine-readable metadata. But what exactly is metadata, and what types does ADC work with?

What Is Metadata?

Metadata is essentially “data about data.” It provides context and meaning to data, making it easier to understand, interpret, and reuse. While the data itself doesn’t change, metadata describes its structure, content, and usage. Different organizations may define metadata slightly differently, depending on how they use it, but the core idea remains the same: metadata adds value by enhancing data context and improving the FAIRness of data.

Key Types of Metadata at ADC

At ADC, we focus on several types of metadata that are especially relevant to research outputs:

1. Catalogue Metadata

Catalogue metadata describes the general characteristics of a published work—such as the title, author(s), publication date, and publisher. If you’ve ever used a library card catalogue, you’ve interacted with this type of metadata. Similarly, when you cite a paper in your research, the citation includes catalogue metadata to help others locate the source.

2. Schema Metadata

Schema metadata provides detailed information about the structure and content of a dataset. It includes descriptions of variables, data formats, measurement units, and other relevant attributes. At ADC, we’ve developed a tool called the Semantic Engine to assist researchers in creating robust data schemas.

3. License Metadata

This type of metadata outlines the terms of use for a dataset, including permissions and restrictions. It ensures that users understand how the data can be legally accessed, shared, and reused.

These three types of metadata play a crucial role in supporting data discovery, interpretation, and responsible reuse.

Combining Metadata Types

Metadata types are not isolated—they often work together. For example, catalogue metadata typically follows a structured schema, such as Darwin Core, which itself has licensing terms (license metadata). Interestingly, Darwin Core is also catalogued: the Darwin Core schema specification has a title, authors, and a publication date.

– written by Carly Huitema