Semantic Engine

Built into the Semantic Engine are SAIDs (Self-Addressing Identifiers), a form of content-derived identifier. Rather than assigning an arbitrary identifier to a resource, a SAID is calculated directly from the content itself. Because the identifier is derived from the content, any change to the content produces a different identifier.

Within the Semantic Engine, every OCA schema package is assigned a SAID. OCA is also highly compositional: schemas contain nested components, overlays, and dependencies, and each of those components is independently assigned its own SAID as well. This creates a structure where every significant part of the schema can be independently referenced and verified.

The recent registration of urn:said introduces a standardized way to express SAIDs as URNs. This matters because many interoperability and metadata specifications permit URIs or URNs as identifiers, but until now there has not been a formally recognized URN representation for SAIDs.

Referencing OCA Schemas in Other Standards

In some contexts, referencing an OCA schema is straightforward. The OCA specification itself already assumes SAIDs as identifiers, so there is no ambiguity about identifier type or interpretation.

However, OCA schemas are often embedded within broader metadata ecosystems and interoperability frameworks. Many of these standards do not prescribe a specific identifier system. Instead, they simply require that identifiers be expressed as URIs or IRIs.

This is where urn:said becomes useful.

An OCA package SAID can now be represented directly as a URN:

urn:said:EA1JHinLxjatOM46wD2rsWsApfytFRQYSdDS7lwABaj-

This allows OCA identifiers to integrate cleanly into systems that already support URNs without introducing a new identifier mechanism or custom syntax.

Embedding SAIDs Directly Into Metadata

The introduction of urn:said also established a preferred pattern for embedding SAIDs into metadata documents and schemas that expect URI-based identifiers.

The process works in two stages.

First, calculate the SAID in the standard way by placing placeholder characters in the identifier field during hashing. For example:

{
  "id": "######################################"
}

The placeholder ensures that the identifier field itself does not affect the calculation unpredictably.

Once the SAID has been calculated, replace the placeholder with the final URN-form identifier:

{
  "id": "urn:said:EA1JHinLxjatOM46wD2rsWsApfytFRQYSdDS7lwABaj-"
}

This preserves the normal SAID derivation process while also producing metadata that conforms to URI and URN expectations in external specifications.

Why This Matters

Many metadata and research-data standards are intentionally identifier-agnostic. They require identifiers to be globally unique and URI-compatible, but they do not dictate whether those identifiers are URLs, DOIs, ARKs, URNs, or something else.

The urn:said namespace allows SAIDs to participate directly in those ecosystems while preserving their core property: verifiability through content derivation.

This creates a bridge between conventional metadata infrastructure and content-addressable architectures. Instead of identifiers being purely assigned labels maintained by registries, identifiers can now also function as cryptographic integrity checks tied directly to the referenced content.

For OCA schemas and Semantic Engine workflows, this means schema packages can now be referenced cleanly across external standards, metadata frameworks, and distributed systems using a globally recognized URN format.

The SAID specification is part of the KERI suite of open standards, and discussions about the specification process and URN registration are publicly available through the KERI specification meetings.

— Written by Carly Huitema

Maintaining clean, consistent data remains one of the biggest challenges in data management. Entry codes—also known as picklists—have long played a key role in improving data quality by standardizing how information is captured. Building on this foundation, a new Entry Code Library feature has been introduced in the Semantic Engine schema writer, making it easier than ever to reuse proven standards and reduce errors at the point of data entry.

The Value of Entry Codes (Picklists)

Entry codes provide a structured alternative to free-text data entry. Instead of allowing users to manually type values, entry codes limit input to a predefined list of acceptable options. This approach helps:

  • Prevent spelling mistakes and inconsistent terminology
  • Ensure uniform data across datasets and projects
  • Improve searchability, aggregation, and downstream analysis

By capturing standardized codes rather than variable text, datasets become more reliable, interoperable, and easier to maintain over time.

Introducing the Entry Code Library

Based on direct user feedback, the Semantic Engine team has introduced an Entry Code Library to streamline schema creation and encourage reuse of existing work.

When defining a variable in the schema writer, users who select List as their initial data type now gain access to a premade library of entry codes.

Adding a list to a variable.
Adding a list to a variable.

Rather than building a list from scratch each time, you can browse and search the library for existing code lists that meet your needs.

Selecting entry codes from the entry code library.
Selecting entry codes from the entry code library.

Search, Reuse, and Align with Standards

The Entry Code Library is designed to save time and improve consistency by helping users:

  • Search for commonly used entry code lists
  • Reuse established vocabularies and standards
  • Avoid duplication of effort across projects
  • Reduce data cleanup caused by inconsistent entry values

By leveraging shared entry code lists, datasets across teams and domains can align more easily, improving overall data interoperability.

Contributing to the Library

The Entry Code Library is a growing resource. If you have created—or identified—a code list that you believe would be valuable to others, we encourage you to contribute.

If you see a list you would like added to the library, please contact us at adc@uoguelph.ca.

Your contributions help build a stronger, more reusable ecosystem for high-quality data entry.

Moving Toward Cleaner Data by Design

Entry codes have always been a powerful tool for enforcing consistency at the point of data capture. With the introduction of the Entry Code Library in the Semantic Engine schema writer, users now have even greater support for creating standardized, reusable, and error-resistant schemas.

By combining structured entry codes with shared libraries and community input, data quality improves not after collection—but from the very beginning.

Written by Carly Huitema

Content-Derived Identifiers in the Semantic Engine

Built into the Semantic Engine is a particular kind of identifier called a SAID (Self-Addressing Identifier). Unlike traditional identifiers that are assigned to a resource, SAIDs are derived directly from the content itself. They are computed—typically using cryptographic hashing—so the identifier is intrinsically bound to the exact bytes of the resource it represents.

These identifiers are not designed to be human-friendly. They are long, opaque strings. But that trade-off enables something more important for research and data systems: verification. If a resource is referenced by a SAID, you can independently confirm that what you have is exactly what was intended. If the content changes, the identifier no longer matches. In that sense, SAIDs are tamper-evident and self-authenticating.

Why Identifier Types Matter in Standards

Many specifications—particularly in research data and interoperability frameworks—depend on identifiers and are explicit about what types are allowed. This ensures consistency, portability, and long-term usability across systems.

One commonly accepted class is the URN (Uniform Resource Name). Because URNs are standardized and designed for persistence, they are frequently permitted in specifications where long-lived, location-independent identifiers are required.

IANA and Global Recognition

The Internet Assigned Numbers Authority (IANA) is responsible for coordinating key elements of the internet’s infrastructure, including identifier namespaces. When IANA registers a namespace, it becomes part of the globally recognized technical foundation used across systems and standards.

SAIDs have now been formally registered with IANA as a new URN namespace: urn:said. This elevates them from an ecosystem-specific mechanism to a globally recognized identifier scheme.

URNs vs URLs

A URN identifies what something is, while a URL (Uniform Resource Locator) identifies where something is located.

URNs are not inherently resolvable—you cannot simply use one to retrieve a resource without additional infrastructure. Instead, they are designed to be persistent names that systems can interpret.

SAIDs fit naturally into this model but add an important property: because they are content-derived, they can be independently verified. Anyone can build a resolver that retrieves content and checks whether it matches the SAID. Trust does not depend on the resolver—it depends on the content itself.

Implications for Research Data Infrastructure

The registration of urn:said means that SAIDs can now be used anywhere URNs are accepted. This has direct implications for research data standards and infrastructure.

The Semantic Engine already uses SAIDs to generate secure, tamper-evident identifiers. With official URN recognition, those identifiers can now integrate cleanly into broader ecosystems—supporting interoperability across repositories, metadata standards, and distributed workflows.

This represents a shift in how identifiers function within research systems. Instead of relying solely on assigned names backed by registries, systems can incorporate identifiers that are self-verifying by design. For research data—where integrity, provenance, and reproducibility are central concerns—this provides a stronger and more flexible foundation.

– Written by Carly Huitema

Imagine this scenario. On her first field season as a principal investigator, a professor watched a graduate student realize—two weeks too late—that no one had recorded soil temperature at the sampling sites. The team had pH, moisture, GPS coordinates… but not the one variable that explained the anomaly in their results. A return trip wasn’t possible. The data gap was permanent.

After that, she changed how her lab collected data.

Instead of relying on ad hoc spreadsheets, she worked with her students to design schemas for their lab’s routine data collection. These weren’t schemas for final data deposit—they were practical structures for the messy, active phase of research. The goal was simple: define in advance what gets collected, how it’s recorded, and which values are allowed.

Researchers can use the Semantic Engine to create schemas that they need for all stages of their research program, from active data collection to final data deposition.

For data collection, once a schema is established, it can be uploaded into the Semantic Engine to generate a Data Entry Excel (DEE) file.

Each DEE contains:

  • A schema description sheet – documentation pulled directly from the schema, including variable definitions and code lists.

  • A data entry sheet – pre-labeled columns that follow the schema rules.

The schema description sheet of a Data Entry Excel.
The schema description sheet of a Data Entry Excel.
Data Entry Excel showing the sheet for data entry.
Data Entry Excel showing the sheet for data entry.

Because the documentation lives in the same file as the data, nothing has to be retyped, reinvented, or remembered from scratch. The schema description sheet also includes code lists that populate the drop-down menus in the data entry sheet, reducing inconsistent terminology and formatting errors.

If the standard schema isn’t sufficient, it can be edited in the Semantic Engine. Researchers can add attributes or adjust fields without rebuilding everything from scratch. The updated schema can then generate a new DEE, preserving previous structure while incorporating the changes.

This approach addresses a common problem: unstructured Excel data. Without standardization, spreadsheets accumulate inconsistent date formats, unit mismatches, ambiguous abbreviations, and missing values. Cleaning that data later is costly and error-prone.

By organizing data entry around a schema:

  • Required information is visible and less likely to be forgotten.

  • Fieldwork becomes more reliable – critical variables are collected the first time.

  • Data from multiple researchers or projects can be harmonized more easily.

  • Manual cleaning and interpretation are reduced.

The generated DEE does not enforce full validation inside Excel (beyond drop-down lists). For formal validation, the completed spreadsheet can be uploaded to the Semantic Engine’s Data Verification tool.

Using schema-driven Data Entry Excel files turns data structure into a practical research tool. Instead of discovering gaps during analysis, researchers define expectations at the point of collection—when it matters most.

Written by Carly Huitema

I recently had the opportunity to conduct an in-person workshop at the Cultivating Resilience: Building Climate-Smart Food Systems Together Summit in Vancouver, BC.  The Summit was hosted by the Agricultural Genomics Action Centre sister hub to the Climate-Smart Data Collaboration Centre, in which ADC is an active partner and supporter.    I chose to talk about Collaboration – a topic that is near and dear to me, yet a topic that can create a lot of chaos and stress.

Collaboration can mean a few different things to people, but I have always taken the lens of people working together to achieve a common goal or to work together and exchange knowledge.  In the workshop, I asked people to introduce themselves at their table and discuss their relationship with “data”.  The noise level in the room rose and the conversations were fantastic!  I closed this part of the workshop with the statement: “See!  how easy it can be to start a collaboration?”   The laughter that ensued and the audience comment:  “Have you WORKED with people!!  They can be… well…. interesting at times!”  Yes, I agree 100% with this statement!  I can make it sound so easy,  yet we all know the challenges behind working and collaborating with people.

Now let’s see what “data collaboration” looks like!   Let’s start with the simple example of a table:

1 13 74 Sunny
2 15 78 Cloudy
3 21 71 Part cloudy
4 28 99 Rain
6 20 75 Sunny

How in the world – can I work with this table, let alone start any collaborations?  Yes – any regular readers will anticipate where I am going with this – DOCUMENTATION!!!  Without it – this is garbage!  Sorry folks, sometimes the truth is ugly.  Doesn’t matter who, how much time, or how much money was spent on collecting the information in this table – without documentation – it’s garbage!

Think about that brief introduction chat you had at the table or with a new colleague and/or collaborator – you usually start with the basic information about yourself: your name, your occupation, where you work, and maybe something personal like city/country you live in or whether you have a pet.  Now, if I had that basic information about this table – I MIGHT be able to do something with it – title? headings?  It’s a start and just like any people collaboration – it needs work.

The amount of work you put into a collaboration – whether it’s people or data – can lead you down a very rewarding path and outcome.   Give it a thought – especially the next time you collect data, forget to document it, and go back to use it in a month or a year – Oops!

Don’t forget the Semantic Engine  a great place to start that data collaboration!

Michelle

 

 

image generated by AI

In research environments, effective data management depends on clarity, transparency, and interoperability. As datasets grow in complexity and scale, institutions must ensure that research data is FAIR; not only accessible but also well-documented, interoperable, and reusable across diverse systems and contexts in research Data Spaces.

The Semantic Engine (which runs OCA Composer), developed by Agri-Food Data Canada (ADC) at the University of Guelph, addresses this need.

What is the OCA Composer

The OCA Composer is based on the Overlays Capture Architecture (OCA), an open standard for describing data in a structured, machine-readable format. Using OCA allows datasets to become self-describing, meaning that each element, unit, and context is clearly defined and portable.

This approach reduces reliance on separate documentation files or institutional knowledge. Instead, OCA schemas ensure that the meaning of data remains attached to the data itself, improving how datasets are shared, reused, and integrated over time. This makes data easier to interpret for both humans and machines.

The OCA Composer provides a visual interface for creating these schemas. Researchers and data managers can build machine-readable documentation without programming skills, making structured data description more accessible to those involved in data governance and research.

Why Use OCA Composer in your Data Space

Implementing standards can be challenging for many Data Spaces and organizations. The OCA Composer simplifies this process by offering a guided workflow for creating structured data documentation. This can help researchers:

  • Standardize data descriptions across projects and teams
  • Improve dataset discoverability and interoperability
  • Support collaboration through consistent documentation templates (e.g. Data Entry Excel)
  • Increase transparency and trust in data definitions

By making metadata a central part of data management, researchers can strengthen their overall data strategy.

Integration and Customization

The OCA Composer can support the creation and running of Data Spaces by organizations, departments, research projects and more. These Data Spaces often have unique digital environments and branding requirements. The OCA Composer supports this through embedding and white labelling features. These allow the tool to be integrated directly into existing platforms, enabling users to create and verify schemas while remaining within the infrastructure of the Data Space. Institutions can also apply their own branding to maintain a consistent visual identity.

This flexibility means the Composer can be incorporated into internal portals, research management systems, or open data platforms including Data Spaces while preserving organizational control and customization.

To integrate the OCA Composer in your systems or Data Space, check out our more technical details. Alternatively, consult with Agri-food Data Canada for help, support or as a partner in your grant application.

 

Written by Ali Asjad and Carly Huitema

Streamlining Data Documentation in Research

In of research, data documentation is often a complex and time-consuming task. To help researchers better document their data ADC has created the Semantic Engine as a powerful tool for creating structured, machine-readable data schemas. These schemas serve as blueprints that describe the various features and constraints of a dataset, making it easier to share, verify, and reuse data across projects and disciplines.

Defining Data

By guiding users through the process of defining their data in a standardized format, the Semantic Engine not only improves data clarity but also enhances interoperability and long-term usability. Researchers can specify the types of data they are working with, the descriptions of data elements, units of measurement used, and other rules that govern their values—all in a way that computers can easily interpret.

Introducing Range Overlays

With the next important update, the Semantic Engine now includes support for a new feature: range overlays.

Range overlays allow researchers to define expected value ranges for specific data fields, and if the values are inclusive or exclusive (e.g. up to but not including zero). This is particularly useful for quality control and verification. For example, if a dataset is expected to contain only positive values—such as measurements of temperature, population counts, or financial figures—the range overlay can be used to enforce this expectation. By specifying acceptable minimum and maximum values, researchers can quickly identify anomalies, catch data entry errors, and ensure their datasets meet predefined standards.

Verifying Data

In addition to enhancing schema definition, range overlay support has now been integrated into the Semantic Engine’s Data Verification tool. This means researchers can not only define expected value ranges in their schema, but also actively check their datasets against those ranges during the verification process.

When you upload your dataset into the Data Verification tool—everything running locally on your machine for privacy and security—you can quickly verify your data within your web browser. The tool scans each field for compliance with the defined range constraints and flags any values that fall outside the expected bounds. This makes it easy to identify and correct data quality issues early in the research workflow, without needing to write custom scripts or rely on external verification services.

Empowering Researchers to Ensure Data Quality

Whether you’re working with clinical measurements, survey responses, or experimental results, this feature lets you to catch outliers, prevent errors, and ensure your data adheres to the standards you’ve set—all in a user-friendly interface.

 

Written by Carly Huitema

Alrighty let’s briefly introduce this topic.  AI or LLMs are the latest shiny object in the world of research and everyone wants to use it and create really cool things!  I, myself, am just starting to drink the Kool-Aid by using CoPilot to clean up some of my writing – not these blog posts – obviously!!

Now, all these really cool AI tools or agents use data.  You’ve all heard the saying “Garbage In…. Garbage Out…”?  So, think about that for a moment.  IF our students and researchers collect data and create little to no documentation with their data – then that data becomes available to an AI agent…  how comfortable are you with the results?  What are they based on?  Data without documentation???

Let’s flip the conversation the other way now.   Using AI agents for data creation or data analysis without understanding how the AI works, what it is using for its data, how do the models work – but throwing all those questions to the wind and using the AI agent results just the same.  How do you think that will affect our research world?

I’m not going to dwell on these questions – but want to get them out there and have folks think about them.   Agri-food Data Canada (ADC) has created data documentation tools that can easily fit into the AI world – let’s encourage everyone to document their data, build better data resources – that can then be used in developing AI agents.

Michelle

 

 

image created by AI

In our ongoing exploration of using the Semantic Engine to describe your data, there’s one concept we haven’t yet discussed—but it’s an important one: cardinality.

Cardinality refers to the number of values that a data field (specifically an array) can contain. It’s a way of describing how many items you’re expecting to appear in a given field, and it plays a crucial role in data descriptions, verification, and interpretation.

What Is an Array?

Before we talk about cardinality, we need to understand arrays. In data terms, an array is a field that can hold multiple values, rather than just one.

For example, imagine a dataset where you’re recording the languages a person speaks. Some people might speak only one language, while others might speak three or more. Instead of creating separate fields for “language1”, “language2”, and so on, you might store them all in one field as an array.

In an Excel spreadsheet, this might look like:

An example of an array attribute with a list of languages.
An example of an array attribute with a list of languages.

Here, the “Languages” column contains comma-separated lists—an informal representation of an array. Each cell in that column holds more one or more values.

What Is Cardinality?

Once you know you’re dealing with arrays, cardinality describes how many values are expected or allowed.

You can define:

  • Minimum cardinality – the fewest number of values allowed

  • Maximum cardinality – the most number of values allowed

Let’s return to the “Languages” example. If every person must list at least one language, you would set the minimum cardinality to 1. If your system supports a maximum of three languages per person, you would set the maximum cardinality to 3. You can also specify a minimum and maximum (for example a minimum of 1 and a maximum of 5).

Why Does Cardinality Matter?

Cardinality helps verify data ensuring that each entry meets the expected structure, and supports machine-readable data because the Semantic Engine supports cardinality descriptions of research data.

Cardinality is a simple but essential concept when working with arrays of data. Whether you’re developing survey responses, cataloguing plant attributes, or managing research metadata, specifying cardinality ensures that your data behaves as expected.

In short: if your data field can hold more than one value, cardinality lets you define how many it should hold.

Written by Carly Huitema

 

Short answer: Not really — but also, kind of.

Why you can’t just use an LLM to write a schema

At first glance, writing an OCA (Overlays Capture Architecture) schema might seem simple. After all, it’s just JSON, and tools like ChatGPT or Microsoft Copilot are great at generating structured text. But when it comes to OCA schemas, large language models (LLMs) run into two big limitations:

  1. LLMs struggle with exact syntax.
    LLMs don’t truly “understand” JSON or schema structures — they generate text by predicting what comes next based on patterns. This means their output might look right but contain subtle errors like missing brackets, incorrect fields, or made-up syntax. Fixing these issues often requires manual correction.

  2. LLMs can’t calculate digests.
    OCA schemas use cryptographic digests — unique strings calculated from the exact contents of the schema. If the schema changes, even slightly, the digest must be recalculated. But LLMs can’t compute these digests — that requires separate code. Without the correct digests, an OCA schema isn’t valid.

Why you kind of can

That said, LLMs can still play a useful role in the schema-writing process.

With the right prompt, an LLM can generate a nearly-correct OCA JSON schema package. While it won’t include valid digests (and may need a few syntax tweaks to fix it enough to be recognized by the Semantic Engine), the Semantic Engine can import this “almost right” schema and help correct remaining errors. Once inside the Semantic Engine, it can calculate the proper digests and export a valid OCA schema package.

This approach is especially helpful if you already have schema information in a structured format — like an Excel table — and want to save time converting it into JSON.

What does a prompt look like?

Here’s an example of a prompt that works well with LLMs to create OCA schema packages. You may need to adjust it for your specific case, but if you’ve got structured schema data, it can be a great starting point for working with the Semantic Engine.

Webpage containing LLM prompt to be copied in two parts.

In short, while you can’t use an LLM to fully generate a valid OCA schema on its own, you can use it to speed up the process — as long as you’re ready to do a bit of post-processing using a tool such as JSON formatter to validate and fix syntax and use the Semantic Engine to fill in the gaps.


Written by Carly Huitema