Semantic Engine

An engine to help researchers generate meaning for data.

The benefits of better data schemas

Data must be structured to be understood and a schema describes the structure of the data.

An example table of data where the column headers let us infer the meaning of the numbers in the columns. However, we need more information if we want to be able to reuse and understand the data contained in the table.
An example table of data where the column headers let us infer the meaning of the numbers in the columns. However, we need more information if we want to be able to reuse and understand the data contained in the table.

For example, a schema can describe what information is contained within the columns of a dataset. Researchers can tune the detail of descriptions in their data schemas depending on their needs.

A schema in the form of a table listing the different attributes of the different columns of data that appear in a separate data table.
A table representation of the schema of an associated data table.

The better a schema is, the more value it adds to the associated dataset. For researchers this can give a host of benefits. You can help your present self, your future self, and your collaborators by better documenting your data. You can avoid mystery data, or spending time following references to figure out what you had done with data years ago. You can also avoid costly mistakes when you think you understood your data, but after hours of analysis you realize that your assumptions were wrong (or worse you publish and someone else figures out your wrong assumptions based on incorrectly interpreted data).

A better data schema can also help researchers and the research community when you share your data with other researchers. Better documentation means that you spend less time answering questions from other data users. You can communicate the context of the data better and ensure your data is used relevantly. This can be especially valuable in cross-disciplinary research where other people are less familiar with the conventions of your discipline.

How to easily write better schemas

The desire to write better data schemas can be difficult because of the amount of work needed and the knowledge of how to do it. Agri-food Data Canada is creating the semantic engine to help researchers write better data schemas with less effort. We are developing the semantic engine together with researchers to ensure that it meets researcher needs.

To create the semantic engine, Agri-food Data Canada is partnering with the Human Colossus Foundation to adopt their work on Overlay Capture Architecture (OCA) as the underlying schema standard. Overlay Capture Architecture is an extensible, flexible, international, open, and machine-accessible standard for schemas.

The Human Colossus foundation has developed Overlay Capture Architecture (OCA) which is an open, international standard for data schemas. Agri-food Data Canada is adopting and adapting OCA in partnership with Human Colossus Foundation.
The Human Colossus foundation has developed Overlay Capture Architecture (OCA) which is an open, international standard for data schemas. Agri-food Data Canada is adopting and adapting OCA in partnership with Human Colossus Foundation.

An OCA schema takes a table representation of a schema and splits each feature into a separate layer. Each layer is a separate file (written in a machine-readable format) that recognizes the Capture Base which is the basic foundation of the schema describing the dataset. Layers are added to the schema adding more detail, making easier to understand and use data that has been collected and structured according to the associated schema.

The different features of the data schema can be expressed as layers (or overlays) of the capture base. This is the Overlay Capture Architecture and it can be expressed in a machine-readable format.
The different features of the data schema can be expressed as layers (or overlays) of the capture base. This is the Overlay Capture Architecture and it can be expressed in a machine-readable format.

There are many benefits to this layered schema architecture, especially improved interoperability and extensibility. Each layer is independent and references the unique identifier of the schema base. You can begin with a very basic schema and as it becomes necessary (or popular) you can add layers referencing the capture base and increase schema usability. You can also extend and improve other people’s schemas to fit your needs. For example, you can add a layer with the labels and information in your own language to make it easier for users who don’t speak the original documented language. Rather than creating a new schema, you add new layers while keeping the same schema base which keeps your data interoperable.

Examples of types of layers that are part of the OCA specification. A label entry contains a human readable label for the data field of the data table. An information layer contains text that describes the data in a specific data field. For example - what protocol was used to measure. A unit layer contains what specific units are used for each specific data field in your data table (e.g. uM). An entry layer lets you restrict entry of a specific data field to a list you provide. For example - select from a list of three specific field locations to avoid confusing names.
Examples of types of layers that are part of the Overlay Capture Architecture specification.

Layers can be more than just descriptions and labels. For example, you can add a data transformation layer which contain the instructions for how to transform data from another schema into your format. This might be important when you want to work with data where the units are in an unusual format; the data transformation layer records how to transform data from one schema type to another, making data collected with two different schema bases interoperable.

The semantic engine being created by Agri-food Data Canada in partnership with the Human Colossus Foundation lets researchers create, use and export schemas using the flexible and extensible OCA standard. The semantic engine is an engine to help researchers generate meaning for data.