Funding for Agri-food Data Canada is provided in part by the Canada First Research Excellence Fund
In research environments, effective data management depends on clarity, transparency, and interoperability. As datasets grow in complexity and scale, institutions must ensure that research data is FAIR; not only accessible but also well-documented, interoperable, and reusable across diverse systems and contexts in research Data Spaces.
The Semantic Engine (which runs OCA Composer), developed by Agri-Food Data Canada (ADC) at the University of Guelph, addresses this need.
The OCA Composer is based on the Overlays Capture Architecture (OCA), an open standard for describing data in a structured, machine-readable format. Using OCA allows datasets to become self-describing, meaning that each element, unit, and context is clearly defined and portable.
This approach reduces reliance on separate documentation files or institutional knowledge. Instead, OCA schemas ensure that the meaning of data remains attached to the data itself, improving how datasets are shared, reused, and integrated over time. This makes data easier to interpret for both humans and machines.
The OCA Composer provides a visual interface for creating these schemas. Researchers and data managers can build machine-readable documentation without programming skills, making structured data description more accessible to those involved in data governance and research.
Implementing standards can be challenging for many Data Spaces and organizations. The OCA Composer simplifies this process by offering a guided workflow for creating structured data documentation. This can help researchers:
By making metadata a central part of data management, researchers can strengthen their overall data strategy.
The OCA Composer can support the creation and running of Data Spaces by organizations, departments, research projects and more. These Data Spaces often have unique digital environments and branding requirements. The OCA Composer supports this through embedding and white labelling features. These allow the tool to be integrated directly into existing platforms, enabling users to create and verify schemas while remaining within the infrastructure of the Data Space. Institutions can also apply their own branding to maintain a consistent visual identity.
This flexibility means the Composer can be incorporated into internal portals, research management systems, or open data platforms including Data Spaces while preserving organizational control and customization.
To integrate the OCA Composer in your systems or Data Space, check out our more technical details. Alternatively, consult with Agri-food Data Canada for help, support or as a partner in your grant application.
Written by Ali Asjad and Carly Huitema
In of research, data documentation is often a complex and time-consuming task. To help researchers better document their data ADC has created the Semantic Engine as a powerful tool for creating structured, machine-readable data schemas. These schemas serve as blueprints that describe the various features and constraints of a dataset, making it easier to share, verify, and reuse data across projects and disciplines.
By guiding users through the process of defining their data in a standardized format, the Semantic Engine not only improves data clarity but also enhances interoperability and long-term usability. Researchers can specify the types of data they are working with, the descriptions of data elements, units of measurement used, and other rules that govern their values—all in a way that computers can easily interpret.
With the next important update, the Semantic Engine now includes support for a new feature: range overlays.
Range overlays allow researchers to define expected value ranges for specific data fields, and if the values are inclusive or exclusive (e.g. up to but not including zero). This is particularly useful for quality control and verification. For example, if a dataset is expected to contain only positive values—such as measurements of temperature, population counts, or financial figures—the range overlay can be used to enforce this expectation. By specifying acceptable minimum and maximum values, researchers can quickly identify anomalies, catch data entry errors, and ensure their datasets meet predefined standards.
In addition to enhancing schema definition, range overlay support has now been integrated into the Semantic Engine’s Data Verification tool. This means researchers can not only define expected value ranges in their schema, but also actively check their datasets against those ranges during the verification process.
When you upload your dataset into the Data Verification tool—everything running locally on your machine for privacy and security—you can quickly verify your data within your web browser. The tool scans each field for compliance with the defined range constraints and flags any values that fall outside the expected bounds. This makes it easy to identify and correct data quality issues early in the research workflow, without needing to write custom scripts or rely on external verification services.
Whether you’re working with clinical measurements, survey responses, or experimental results, this feature lets you to catch outliers, prevent errors, and ensure your data adheres to the standards you’ve set—all in a user-friendly interface.
Written by Carly Huitema
Alrighty let’s briefly introduce this topic. AI or LLMs are the latest shiny object in the world of research and everyone wants to use it and create really cool things! I, myself, am just starting to drink the Kool-Aid by using CoPilot to clean up some of my writing – not these blog posts – obviously!!
Now, all these really cool AI tools or agents use data. You’ve all heard the saying “Garbage In…. Garbage Out…”? So, think about that for a moment. IF our students and researchers collect data and create little to no documentation with their data – then that data becomes available to an AI agent… how comfortable are you with the results? What are they based on? Data without documentation???
Let’s flip the conversation the other way now. Using AI agents for data creation or data analysis without understanding how the AI works, what it is using for its data, how do the models work – but throwing all those questions to the wind and using the AI agent results just the same. How do you think that will affect our research world?
I’m not going to dwell on these questions – but want to get them out there and have folks think about them. Agri-food Data Canada (ADC) has created data documentation tools that can easily fit into the AI world – let’s encourage everyone to document their data, build better data resources – that can then be used in developing AI agents.
![]()
In our ongoing exploration of using the Semantic Engine to describe your data, there’s one concept we haven’t yet discussed—but it’s an important one: cardinality.
Cardinality refers to the number of values that a data field (specifically an array) can contain. It’s a way of describing how many items you’re expecting to appear in a given field, and it plays a crucial role in data descriptions, verification, and interpretation.
Before we talk about cardinality, we need to understand arrays. In data terms, an array is a field that can hold multiple values, rather than just one.
For example, imagine a dataset where you’re recording the languages a person speaks. Some people might speak only one language, while others might speak three or more. Instead of creating separate fields for “language1”, “language2”, and so on, you might store them all in one field as an array.
In an Excel spreadsheet, this might look like:

Here, the “Languages” column contains comma-separated lists—an informal representation of an array. Each cell in that column holds more one or more values.
Once you know you’re dealing with arrays, cardinality describes how many values are expected or allowed.
You can define:
Minimum cardinality – the fewest number of values allowed
Maximum cardinality – the most number of values allowed
Let’s return to the “Languages” example. If every person must list at least one language, you would set the minimum cardinality to 1. If your system supports a maximum of three languages per person, you would set the maximum cardinality to 3. You can also specify a minimum and maximum (for example a minimum of 1 and a maximum of 5).
Cardinality helps verify data ensuring that each entry meets the expected structure, and supports machine-readable data because the Semantic Engine supports cardinality descriptions of research data.
Cardinality is a simple but essential concept when working with arrays of data. Whether you’re developing survey responses, cataloguing plant attributes, or managing research metadata, specifying cardinality ensures that your data behaves as expected.
In short: if your data field can hold more than one value, cardinality lets you define how many it should hold.
Written by Carly Huitema
Short answer: Not really — but also, kind of.
At first glance, writing an OCA (Overlays Capture Architecture) schema might seem simple. After all, it’s just JSON, and tools like ChatGPT or Microsoft Copilot are great at generating structured text. But when it comes to OCA schemas, large language models (LLMs) run into two big limitations:
LLMs struggle with exact syntax.
LLMs don’t truly “understand” JSON or schema structures — they generate text by predicting what comes next based on patterns. This means their output might look right but contain subtle errors like missing brackets, incorrect fields, or made-up syntax. Fixing these issues often requires manual correction.
LLMs can’t calculate digests.
OCA schemas use cryptographic digests — unique strings calculated from the exact contents of the schema. If the schema changes, even slightly, the digest must be recalculated. But LLMs can’t compute these digests — that requires separate code. Without the correct digests, an OCA schema isn’t valid.
That said, LLMs can still play a useful role in the schema-writing process.
With the right prompt, an LLM can generate a nearly-correct OCA JSON schema package. While it won’t include valid digests (and may need a few syntax tweaks to fix it enough to be recognized by the Semantic Engine), the Semantic Engine can import this “almost right” schema and help correct remaining errors. Once inside the Semantic Engine, it can calculate the proper digests and export a valid OCA schema package.
This approach is especially helpful if you already have schema information in a structured format — like an Excel table — and want to save time converting it into JSON.
Here’s an example of a prompt that works well with LLMs to create OCA schema packages. You may need to adjust it for your specific case, but if you’ve got structured schema data, it can be a great starting point for working with the Semantic Engine.
Webpage containing LLM prompt to be copied in two parts.
In short, while you can’t use an LLM to fully generate a valid OCA schema on its own, you can use it to speed up the process — as long as you’re ready to do a bit of post-processing using a tool such as JSON formatter to validate and fix syntax and use the Semantic Engine to fill in the gaps.
Written by Carly Huitema
In research and data-intensive environments, precision and clarity are critical. Yet one of the most common sources of confusion—often overlooked—is how units of measure are written and interpreted.
Take the unit micromolar, for example. Depending on the source, it might be written as uM, μM, umol/L, μmol/l, or umol-1. Each of these notations attempts to convey the same concentration unit. But when machines—or even humans—process large amounts of data across systems, this inconsistency introduces ambiguity and errors.
To ensure clarity, consistency, and interoperability, standardized units are essential. This is especially true in environments where data is:
Shared across labs or institutions
Processed by machines or algorithms
Reused or aggregated for meta-analysis
Integrated into digital infrastructures like knowledge graphs or semantic databases
Standardization ensures that “1 μM” in one dataset is understood exactly the same way in another and this ensures that data is FAIR (Findable, Accessible, Interoperable and Reusable).
One widely adopted system for encoding units is UCUM—the Unified Code for Units of Measure. Developed by the Regenstrief Institute, UCUM is designed to be unambiguous, machine-readable, compact, and internationally applicable.
In UCUM:
micromolar becomes umol/L
acre becomes [acr_us]
milligrams per deciliter becomes mg/dL
This kind of clarity is vital when integrating data or automating analyses.
While UCUM covers a broad range of units, it’s not exhaustive. Many disciplines use niche or domain-specific units that UCUM doesn’t yet describe. This can be a problem when strict adherence to UCUM would mean leaving out critical information or forcing awkward approximations. Furthermore, UCUM doesn’t offer and exhaustive list of all possible units, instead the UCUM specification describes rules for creating units. For the Semantic Engine we have adopted and extended existing lists of units to create a list of common units for agri-food which can be used by the Semantic Engine.
To bridge the gap between familiar, domain-specific unit expressions and standardized UCUM representations, the Semantic Engine supports what’s known as a unit framing overlay.
Here’s how it works:
Researchers can input units in a familiar format (e.g., acre or uM).
Researchers can add a unit framing overlay which helps them map their units to UCUM codes (e.g., "[acr_us]" or "umol/L").
The result is data that is human-friendly, machine-readable, and standards-compliant—all at the same time.
This approach offers the both flexibility for researchers and consistency for machines.
Standardized units aren’t just a technical detail—they’re a cornerstone of data reliability, semantic precision, and interoperability. Adopting standards like UCUM helps ensure that your data can be trusted, reused, and integrated with confidence.
By adopting unit framing overlays with UCUM, ADC enables data documentation that meet both the practical needs of researchers and the technical requirements of modern data infrastructure.
Written by Carly Huitema
When designing a data schema, you’re not only choosing what data to collect but also how that data should be structured. Format rules help ensure consistency by defining the expected structure for specific types of data and are especially useful for data verification.
For example, a date might follow the YYYY-MM-DD format, an email address should look like name@example.com, and a DNA sequence may only use the letters A, T, G, and C. These rules are often enforced using regular expressions or standardized format types to validate entries and prevent errors. Using the Semantic Engine, we have already described how users can select format rules for data input. Now we introduce the ability to add custom format rules for your data verification.
Format rules that are understood by the Semantic Engine are written in a language called Regex.
Regex—short for regular expressions—is a powerful pattern-matching language used to define the format that input data must follow. It allows schema designers to enforce specific rules on strings, such as requiring that a postal code follow a certain structure or ensuring that a genetic code only includes valid base characters.
For example, a simple regex for a 4-digit year would be:^\d{4}$
This means the value must consist of exactly four digits.
While regex is a widely adopted standard, it comes in different flavours depending on the programming language or system you’re using. Common flavours include:
PCRE (Perl Compatible Regular Expressions) – used in PHP and many other systems
JavaScript Regex – the flavour used in browsers and front-end validation
Python Regex (re module) – similar to PCRE with some minor syntax differences
POSIX – a more limited, traditional regex used in Unix tools like grep and awk
The Semantic Engine uses a flavor of regex aligned with JavaScript-style regular expressions, so it’s important to test your patterns using tools or environments that support this style.
Regex is notoriously powerful—but also notoriously easy to get wrong. A misplaced symbol or an overly broad pattern can lead to incorrect validation or unexpected matches. You can use online tools such as ChatGPT and other AI assistants to help start writing and understanding your regex. You can also put in unknown Regex expressions and get explanations using an AI agent.
You can also use other online tools such as:
It’s essential to test your regular expressions before using them in your schema. Always test your regex with both expected inputs and edge cases to ensure your data validation is reliable and robust. With the Semantic Engine you can export a schema with your custom regex and then use it with a dataset with the data verification tool to test your regex.
By using regex effectively, the Semantic Engine ensures that your data conforms to the exact formats you need, improving data quality, interoperability, and trust in your datasets.
Written by Carly Huitema
When you’re building a data schema you’re making decisions not only about what data to collect, but also how it should be structured. One of the most useful tools you have is format restrictions.
A format entry in a schema defines a specific pattern or structure that a piece of data must follow. For example:
These formats are usually enforced using rules like regular expressions (regex) or standardized format types.
Restricting the format of data entries is about ensuring data quality, consistency, and usability. Here’s why it’s important:
✅ To Avoid Errors Early
If someone enters a date as “15/03/25” instead of “2025-03-15”, you might not know whether that’s March 15 or March 25 and what year? A clear format prevents confusion and catches errors before they become a problem.
✅ To Make Data Machine-Readable
Computers need consistency. A standardized format means data can be processed, compared, or validated automatically. For example, if every date follows the YYYY-MM-DD format, it’s easy to sort them chronologically or filter them by year. This is especially helpful for sorting files in folders on your computer.
✅ To Improve Interoperability
When data is shared across systems or platforms, shared formats ensure everyone understands it the same way. This is especially important in collaborative research.
Using the Semantic Engine you can add a format feature to your schema and describe what format you want the data to be entered in. While the schema writes the format rule in RegEx, you don’t need to learn how to do this. Instead, the Semantic Engine uses a set of prepared RegEx rules that users can select from. These are documented in the format GitHub repository where new format rules can be proposed by the community.
After you have created format rules in your schema you can use the Data Entry Web tool of the Semantic Engine to verify your results against your rules.
Format restrictions may seem technical, but they’re essential to building reliable, reusable, and clean data. When you use them thoughtfully, they help everyone—from data collectors to analysts—work more confidently and efficiently.
Written by Carly Huitema
If you have been already using the Semantic Engine to write your schemas you will have come across the .zip schema bundle. This is the machine-readable version of your schema written in JSON where each component of you schema is a separate file inside a .zip folder.

Overlays Capture Architecture has transitioned now to a schema package that has the same content with a few extra pieces of information and a new way to extend the functionalities of schemas. First, rather than each component of the schema being in a separate file inside a .zip, they are listed one after another inside a single .json file. The content of each schema component is the same and exists as the schema bundle. We found .zips were a problem for our Mac users especially and they were also challenging to include in repositories. The change to a single JSON object will address these challenges.
Second, ADC has extended the functionality of OCA to cover use cases that aren’t included in the original specification such as attribute ordering. These extra overlays follow the same syntax and structure of OCA and are in a second object called extensions. Both the OCA bundle and its extensions are included in a JSON file called OCA Package. You can read more about this update on the OCA Package specification.
Because OCA uses derived identifiers, this structure means we can give a derived identifier to the OCA Bundle which contains standard, well specified overlays. We also give researchers and developers flexibility to add their own functionalities via the OCA extensions where new additions won’t change the core bundle derived identifier. Finally, the entire OCA Package is given a derived identifier thus binding all the content together. When you reference your OCA schema you should use the Package identifier.

One of the first extensions we have added to an OCA schema is the ability to keep ordering of attributes. You may have noticed that in an OCA schema all the attributes are ordered alphabetically. Now with the ordering overlay we support user-entered ordering and this functionality is included now in all Semantic Engine tools.
The Semantic Engine can continue to consume and use .zip schema bundles but now the default will be to export .json schema packages. The .json packages will continue to be developed and so if you are using any .zip schema bundles you can open them using the Semantic Engine and export them to obtain the .json version.
Written by Carly Huitema
Using the Semantic Engine you can enter both your data and your schema and compare your schema against the rules of your schema. This is useful for data verification and the tool is called Data Entry Web (DEW).
When you use the DEW tool all your data will be verified and the different cells coloured red or green depending if they match the rules set out in the schema or not.

The filtering tool of DEW has been improved to help users more easily find which data doesn’t pass the schema rules. This can be very helpful when you have very large datasets. Now you can filter your data and only shows those rows that have errors. You can even filter further and specify which types of errors you want to look at.

Once you have identified your rows that have errors you can correct them within the DEW tool. After you have corrected all your errors you can verify your data again to check that all corrections have been applied. Then you can export your data and continue on with your analysis.
Written by Carly Huitema
© 2023 University of Guelph