Semantic Engine

Using the Semantic Engine you can enter both your data and your schema and compare your schema against the rules of your schema. This is useful for data verification and the tool is called Data Entry Web (DEW).

When you use the DEW tool all your data will be verified and the different cells coloured red or green depending if they match the rules set out in the schema or not.

The Data Entry Web tool of the Semantic Engine.
The Data Entry Web tool of the Semantic Engine.

The filtering tool of DEW has been improved to help users more easily find which data doesn’t pass the schema rules. This can be very helpful when you have very large datasets. Now you can filter your data and only shows those rows that have errors. You can even filter further and specify which types of errors you want to look at.

The Data Entry Web tool showing filtering based on rows with errors.
The Data Entry Web tool showing filtering based on rows with errors.

Once you have identified your rows that have errors you can correct them within the DEW tool. After you have corrected all your errors you can verify your data again to check that all corrections have been applied. Then you can export your data and continue on with your analysis.

Written by Carly Huitema

Understanding Data Requires Context

Data without context is challenging to interpret and utilize effectively. Consider an example: raw numbers or text without additional information can be ambiguous and meaningless. Without context, data fails to convey its full value or purpose.

Data does not speak for itself.
Data does not speak for itself.

By providing additional information, we can place data within a specific context, making it more understandable and actionable – more FAIR. This context is often supplied through metadata, which is essentially “data about data.” A schema, for instance, is a form of metadata that helps define the structure and meaning of the data, making it clearer and more usable.

Data is more useful when it can be placed in context.
Data is more useful when it can be placed in context.

The Role of Schemas in Contextualizing Data

A data schema is a structured form of metadata that provides crucial context to help others understand and work with data. It describes the organization, structure, and attributes of a dataset, allowing data to be more effectively interpreted and utilized.

Data are described by schemas.
Data are described by schemas.

A well-documented schema serves as a guide to understanding the dataset’s column labels (attributes), their meanings, the data types, and the units of measurement. In essence, a schema outlines the dataset’s structure, making it accessible to users.

For example, each column in a dataset corresponds to an attribute, and a schema specifies the details of that column:

  • Units: What units the data is measured in (e.g., meters, seconds).
  • Format: What format the data should follow (e.g., date formats).
  • Type: Whether the data is numerical, textual, boolean etc.

The more features included in a schema to describe each attribute, the richer the metadata, and the easier it becomes for users to understand and leverage the dataset.

Richer schemas have more features which better describe the data.
Richer schemas have more features which better describe the data.

Writing and Using Schemas

When preparing to collect data—or after you’ve already gathered a dataset—you can enhance its usability by creating a schema. Tools like the Semantic Engine can help you write a schema, which can then be downloaded as a separate file. When sharing your dataset, including the schema ensures that others can fully understand and use the data.

Reusing and Extending Schemas

Instead of creating a new schema for every dataset, you can reuse existing schemas to save time and effort. By building upon prior work, you can modify or extend existing schemas—adding attributes or adjusting units to align with your specific dataset requirements.

One Schema for Multiple Datasets

In many cases, one schema can be used to describe a family of related datasets. For instance, if you collect similar data year after year, a single schema can be applied across all those datasets.

Publishing schemas in repositories (e.g., Dataverse) and assigning them unique identifiers (such as DOIs) promotes reusability and consistency. Referencing a shared schema ensures that datasets remain interoperable over time, reducing duplication and enhancing collaboration.

Conclusion

Context is essential to making data understandable and usable. Schemas provide this context by describing the structure and attributes of datasets in a standardized way. By creating, reusing, and extending schemas, we can make data more accessible, interoperable, and valuable for users across various domains.

Written by Carly Huitema

Using an ontology in agri-food research provides a structured and standardized way to manage the complex data that is common in this field. Ontologies are an important tool to improve data FAIRness.

Ontologies define relationships between concepts, allowing researchers to organize information about crops, livestock, environmental conditions, agricultural practices, and food systems in a consistent manner. This structured approach ensures that data from different studies, regions, or research teams can be easily integrated and compared, helping with collaboration and knowledge sharing across the agri-food domain.

One key advantage of ontologies in agri-food research is their ability to enable semantic interoperability. By using a shared vocabulary and a defined set of relationships, researchers can ensure that the meaning of data remains consistent across different systems and databases. For example, when studying soil health, an ontology can define related terms such as soil type, nutrient content, and pH level, ensuring that these concepts are understood uniformly across research teams and databases.

Moreover, ontologies allow for enhanced data analysis and discovery. They support advanced querying, reasoning, and the ability to infer new knowledge from existing data. In agri-food research, where data is often generated from diverse sources such as satellite imaging, field sensors, and lab experiments, ontologies provide a framework to draw connections between different datasets, leading to insights into food security, climate resilience, and sustainable agriculture.

Agri-food Data Canada is working to make it easier to incorporate ontologies into research by developing tools to help incorporate ontologies into research data.

One way you can connect ontologies to your research data is through Entry Codes (aka pick lists) of a data schema. By limiting entries for a specific attribute (aka variables or data columns) to a selected list drawn from an ontology you can be sure to use terms and definitions that a community has already established. Using the Data Entry Web tool of the Semantic Engine you can verify that your data uses only allowed terms drawn from the entry code list. This helps maintain data quality and ensures it is ready for analysis.

There are many places to find ontologies as a source for terms, the organization CGIAR has published a resource of common Ontologies for agriculture.

Agri-food Data Canada is continuing to develop ways to more easily incorporate standard terms and ontologies into researcher data, helping improve data FAIRness and contributing to better cross-domain data integration.

 

Written by Carly Huitema

My last post was all about where to store your data schemas and how to search for them.  Now let’s take it to the next step – how do I search for what’s INSIDE a data schema – in other words how do I search for the variables or attributes that someone has described in their data schema?  A little caveat here – up to this point, we have been trying to take advantage of National data platforms that are already available – how can we take advantage of these with our services?  Notice the words in that last statement “up to this point” – yes that means we have some new options and tools coming VERY soon.  But for now – let’s see how we can take advantage of another National data repository odesi.ca.

Findable in the FAIR principles?

How can a data schema help us meet the recommendations of this principle?   Well…. technically I showed you one way in my last post – right?  Finding the data schema or the metadata about our dataset.  But let’s dig a little deeper and try another example using the Ontario Dairy Research Centre (ODRC) data schemas to find the variables that we’re measuring this time.

As I noted in my last post there are more than 30 ODRC data schemas and each has a listing of the variables that are being collected.  As a researcher who works in the dairy industry – I’m REALLY curious to see WHAT is being collected at the ODRC – by this I mean – what variables, measures, attributes.  But, when I look at the README file for the data schemas in Borealis, I have to read it all and manually look through the variable list OR use some keyboard combination to search within the file.  This means I need to search for the data schema first and then search within all the relevant data schemas.   This sounds tedious and heck I’ll even admit too much work!

 

README text

 

BUT! There is another solution – odesi.ca – another National data repository hosted by Ontario Council of University Libraries (OCUL) that curates over 5,700 datasets, and has recently incorporated the Borealis collection of research data. Let’s see what we can see using this interface.

Let’s work through our example – I want to see what milking variables are being used by the ODRC – in other words, are we collecting “milking” data?  Let’s try it together:

  1. Visit odesi.ca

 

odesi.ca homepage

 

  1. To avoid the large number of results from any searches, let’s start by restricting our results to the Borealis entries – Under Collection – Uncheck Select All – Select Borealis Research Data
  2. In the Find Data box type: milking
  3. Since we are interested in what variables that collect milking data – change Anywhere to Variable

odesi.ca search example

 

  1. Click Search
  2. I see over 20 results – notice that there are Replication dataset results along with the data schema results.  The Replication dataset entries refer to studies that have been conducted and whose data has been uploaded and available to researchers – this is fabulous – it shows you how previous projects have also collected milking data.

 

odesi.ca search results

For our purposes –  let’s review the information related to the ODRC Data Schema entries.  Let’s pick the first one on my list ODRC data schema: Tie stalls and maternity milkings.   Notice that it states there is one Matching Variable?  It is the variable called milking_device.   If you select the the data schema you will see all the relevant and limited study level metadata along with a DOI for this schema.  By selecting the variable you will also see a little more detail regarding the chosen attribute.

NOTE – there is NO data with our data schemas – we have added dummy data to allow odesi.ca to search at a variable level, but we are NOT making any data available here – we are using this interface to increase the visibility of the types of variables we are working with.  To access any data associated with these data schemas, researchers need to visit the ODRC website as noted in the associated study metadata.

I hope you found this as exciting as I do!  Researchers across Canada can now see what variables and information is being collected at our Research Centres – so cool!!

Look forward to some more exciting posts on how to search within and across data schemas created by the Semantic Engine.  Go try it out for yourself!!!

Michelle

With the introduction of using OCA schemas for data verification let’s dig a bit more into the format overlay which is an important piece for data verification.

When you are writing a data schema using the Semantic Engine you can build up your schema documentation by adding features. One of the features that you can add is called format.

In an OCA (Overlays Capture Architeccture) schema, you can specify the format for different types of data. This format dictates the structure and type of data expected for each field, ensuring that the data conforms to certain predefined rules. For example, for a numeric data type, you can define the format to expect only integers or decimal numbers, which ensures that the data is valid for calculations or further processing. Similarly, for a text data type, you can set a format that restricts the input to a specific number of characters, such as a string up to 50 characters in length, or constrain it to only allow alphanumeric characters. By defining these formats, the OCA schema provides a mechanism for validating the data, ensuring it meets the expected requirements.

Adding a format rule for data entry in the Semantic Engine.
Adding a format rule for data entry in the Semantic Engine.

Specifying the format for data in an OCA schema is valuable because it guarantees consistency and accuracy in data entry and validation. By imposing these rules, you can prevent errors such as inputting the wrong type of data (e.g., letters instead of numbers) or exceeding field limits. This level of control reduces data corruption, minimizes the risk of system errors, and improves the quality of the information being collected or shared. When systems across different platforms adhere to these defined formats, it enables seamless data exchange and interoperability improving data FAIRness.

The rules for defining data formats in an OCA schema are typically written using Regular Expressions (RegEx). RegEx is a sequence of characters that forms a search pattern, used for matching strings against specific patterns. It allows for very precise and flexible definitions of what is considered valid data. For example, RegEx can specify that a field should contain only digits, letters, or specific formats like dates (YYYY-MM-DD) or email addresses. RegEx is widely used for input validation because of its ability to handle complex patterns and enforce strict rules on data format, making it ideal for ensuring data consistency in systems like OCA.

To help our users be consistent, the Semantic Engine limits users to a set of format rules, which is documented in the format rule GitHub repository. If the rule you want isn’t listed here it can be added by reaching out to us at ADC or raising a GitHub issue in the repository.

After you have added format rules to your data schema you can use the data verification tool to check your data against your new schema rules.

Verifying data using a schema in the DEW tool of the Semantic Engine.
Verifying data using a schema in the DEW tool of the Semantic Engine.

 

Written by Carly Huitema

When data entry into an Excel spreadsheet is not standardized, it can lead to inconsistencies in formats, units, and terminology, making it difficult to interpret and integrate research data. For instance, dates entered in various formats, inconsistent use of abbreviations, or missing values can give problems during analysis leading leading to errors.

Organizing data according to a schema—essentially a predefined structure or set of rules for how data should be entered—makes data entry easier and more standardized. A schema, such as one written using the Semantic Engine, can define fields, formats, and acceptable values for each column in the spreadsheet.

Using a standardized Excel sheet for data entry ensures uniformity across datasets, making it easier to validate, compare, and combine data. The benefits include improved data quality, reduced manual cleaning, and streamlined data analysis, ultimately leading to more reliable research outcomes.

After you have created a schema using the Semantic Engine, you can use this schema (the machine-readable version) to generate a Data Entry Excel.

The link to generating a Data Entry Excel sheet after uploading a schema in the Semantic Engine.
The link to generating a Data Entry Excel sheet after uploading a schema in the Semantic Engine.

When you open your Data Entry Excel you will see it consists of two sheets, one for schema description and one for the entry for data. The schema description sheets takes information from the schema that was uploaded and puts it into an information table.

The schema description sheet of a Data Entry Excel.
The schema description sheet of a Data Entry Excel.

At the very bottom of the information table are listed all of the entry code lists from the schema. This information is used on the data entry side for populating drop-down lists.

On the data entry sheet of the Data Entry Excel you can find the pre-labeled columns for data entry according to the rules of your schema. You can rearrange the columns as you want, and you can see that the Data Entry Excel comes with prefilled dropdown lists from those variables (attributes) that have entry codes. There is no dropdown list if the data is expected to be an array of entries or if the list is very long. As well, you will need to wrestle with Excel time/date attributes to have it appear according to what is documented in the schema description.

Data Entry Excel showing the sheet for data entry.
Data Entry Excel showing the sheet for data entry.

There is no verification of data in Excel that is set up when you generate your Data Entry Excel apart from the creation of the drop-down lists. For data verification you can upload your Data Entry Excel to the Data Verification tool available on the Semantic Engine.

Using the Data Entry Excel feature lets you put your data schemas to use, helping you document and harmonize your data. You can store your data in Excel sheets with pre-filled information about what kind of data you are collecting! You can also use this to easily collect data as part of a larger project where you want to combine data later for analysis.

Written by Carly Huitema

Alrighty – so you have been learning about the Semantic Engine and how important documentation is when it comes to research data – ok, ok,  yes documentation is important to any and all data, but we’ll stay in our lanes here and keep our conversation to research data.  We’ve talked about Research Data Management and how the FAIR principles intertwine and how the Semantic Engine is one fabulous tool to enable our researchers to create FAIR research data.

But…  now that you’ve created your data schema, where can you save it and make it available for others to see and use?   There’s nothing wrong with storing it within your research group environment, but what if there are others around the world working on a related project?  Wouldn’t it be great to share your data schemas?  Maybe get a little extra reference credit along your academic path?

Let me walk you through what we have been doing with the data schemas created for the Ontario Dairy Research Centre data portal.   There are 30+ data schemas that reflect the many data sources/datasets that are collected dynamically at the Ontario Research Dairy Centre (ODRC), and we want to ensure that the information regarding our data collection and data sources is widely available to our users and beyond by depositing our data schemas into a data repository.   We want to encourage the use and reuse of our data schemas – can we say R in FAIR?

Storing the ADC data schemas

Agri-food Data Canada(ADC) supports, encourages, and enables the use of national platforms such as Borealis – Canadian Dataverse Repository.   The ADC team has been working with local researchers to deposit their research data into this repository for many years through our OAC Historical Data project.   As we work on developing FAIR data and ensuring our data resources are available in a national data repository, we began to investigate the use of Borealis as a repository for ADC data schemas.  We recognize the need to share data schemas and encourage all to do so – data repositories are not just for data – let’s publish our data schemas!

If you are interested in publishing your data schemas, please contact adc@uoguelph.ca for more information.   Our YouTube series: Agri-food Data Canada – Data Deposits into Borealis (Agri-environmental Data Repository) will be updated this semester to provide you guidance on recommended practices on publishing data schemas.

Where is the data schema?

So, I hope you understand now that we can deposit data schemas into a data repository – and here at ADC, we are using the Borealis research data repository.  But now the question becomes – how, in the world do I find the data schemas?  I’ll walk you through an example to help you find data schemas that we have created and deposited for the data collected at the ODRC.

  1. Visit Borealis (the Canadian Dataverse Repository) or the data repository for research data.
  2. Borealis data repository entry page

  3. In the search box type:  Milking data schema
  4. You will get a LOT of results (152, 870+) so let’s try that one again
  5. Go back to the Search box and using boolean searching techniques in the search box type: “data schema” AND milking
  6. Borealis search text

  7. Now you should have around 35 results – essentially any entry that has the words data schema together and milking somewhere in the record
  8. From this list select the entry that matches the data you are aiming to collect – let’s say the students were working with the cows in the milking parlour.  So you would select ODRC data schema: Milk parlour data

Now you have a data schema that you can use and share among your colleagues, classmates, labmates, researchers, etc…..

Remember to check out what you else you can do with these schemas by reading about all about Data Verification.

Summary

A quick summary:

  1. I can deposit my data schemas into a repository – safe keeping, sharing, and getting academic credit all in one shot!
  2. I can search for a data schema in a repository such as Borealis
  3. I can use a data schema someone else has created for my own data entry and data verification!

Wow!  Research data life is getting FAIRer by the day!

Michelle

 

 

 

What do you do when you’ve collected data but you need to also include notes in the data. Do you mix the data together with the notes?

Here we build on our previous blog post describing data quality comments with worked examples.

An example of quality comments embedded into numeric data is if you include values such as NULL or NA when you have a data table. Below are some examples of datatypes being assigned to different attributes (variables v1-v8). You can see in v5 that there is are numeric measurements values mixed together with quality notations such as NULL, NA, or BDL (below detection limit).

Examples of different types of datatype classifications.
Examples of different types of datatype classifications.

Technically, this type of data would be given the datatype of text when using the Semantic Engine. However, you may wish to use v5 as a numeric datatype so that you can perform analysis with it. You could delete all the text values, but then you would be losing this important data quality information.

As we described in a previous blog post, one solution to this challenge is to add quality comments to your dataset. How you would do this is demonstrated in the next data example.

In this next example there are two variables: c and v. Variable v contains a mixture of numeric values and text.

step 1: Rename v to v_raw. It is good practice to always keep raw data in its original state.

step 2: copy the values into v_analysis and here you can remove any text values and make other adjustments to values.

step 3: document your adjustments in a new column called v_quality and using a quality code table.

The quality code table is noted on the right of the data. When using the Semantic Engine you would put this in a separate .csv file and import it as an entry code list. You would also remove the highlighted dataypes (numeric, text etc.) which don’t belong in the dataset but are written here to make it easier to understand.

Example of adding additional rows of data to a dataset with quality comments.
Quality Annotations

You can watch the entire example being worked through using the Semantic Engine in this YouTube video. Note that even without using the Semantic Engine you can annotate data with quality comments, the Semantic Engine just makes the process easier.

 

Written by Carly Huitema

There is a new feature just released in the Semantic Engine!

Now, after you have written your schema you can use this schema to enter and verify data using your web browser.

Find the link to the new tool in the Quick Link lists, after you have uploaded a schema. Watch our video tutorial on how to easily create your own schema.

Link to the Data Entry Web plus Verification tool in the Quick Links section.
Link to the Data Entry Web plus Verification tool in the Quick Links section.

Add data

The Data Entry Web tool lets you upload your schema and then you can optionally upload a dataset. If you choose to upload a dataset, remember that Agri-food Data Canada and the Semantic Engine tool never receive your data. Instead, your data is ‘uploaded’ into your browser and all the data processing happens locally.

If you don’t want to upload a dataset, you can skip this step and go right to the end where you can enter and verify your data in the web browser. You add rows of blank data using the ‘Add rows’ button at the bottom and then enter the data. You can hover over the ?’s to see what data is expected, or click on the ‘verification rules’ to see the schema again to help you enter your data.

 

Screenshot of entering data following the rules of a schema using Data Entry Web.
Screenshot of entering data following the rules of a schema using Data Entry Web.

 

If you upload your dataset you will be able to use the ‘match attributes’ feature. If your schema and your dataset use the same column headers (aka variables or attributes), then the DEW tool will automatically match those columns with the corresponding schema attributes. Your list of unmatched data column headers are listed in the unassigned variables box to help you identify what is still available to be matched. You can create a match by selecting the correct column name in the associated drop-down. By selecting the column name you can unmatch an assigned match.

 

Matching attributes between schema and dataset in the DEW tool.
Matching attributes between schema and dataset in the DEW tool.

 

Matching data does two things:

1) Lets you verify the data in a data column (aka variable or attribute) against the rules of the schema. No matching, no verification.

2) When you export data from the DEW tool you have the option of renaming your column names to the schema name. This will automate future matching attempts and can also help you harmonize your dataset to the schema. No matching, no renaming.

Verify data

After you have either entered or ‘uploaded’ data, it is time to use one of the important tools of DEW – the verification tool! (read our blog post about why it is verification and not validation).

Verification works by comparing the data you have entered against the rules of the schema. It can only verify against the schema rules so if the rule isn’t documented or described correctly in the schema it won’t verify correctly either. You can always schedule a consultation with ADC to receive one-on-one help with writing your schema.

 

Verifying data using a schema in the DEW tool of the Semantic Engine.
Verifying data using a schema in the DEW tool of the Semantic Engine.

 

In the above example you can see the first variable/attribute/column is called farm and the DEW tool displays it as a list to select items from. In your schema you would set this feature up by making an attribute a list (aka entry codes). The other errors we can see in this table are the times. When looking up the schema rules (either via the link to verification rules which pops up the schema for reference, or by hovering over the column’s ?) you can see the expected time should be in ISO standard (HH:MM:SS), which means two digits for hour. The correct times would be something like 09:15:00. These format rules and more are available as the format overlay in the Semantic Engine when writing your schema. See the figure below for an example of adding a format rule to a schema using the Semantic Engine.

 

Add format rules for data entry using the Semantic Engine
Add format rules for data entry using the Semantic Engine

Export data

A key thing to remember, because ADC and the Semantic Engine don’t ever store your data, if you leave the webpage, you lose the data! After you have done all the hard work of fixing your data you will want to export the data to keep your results.

You have a few choices when you export the data. If you export to .csv you have the option of keeping your original data headers or changing your headers to the matched schema attributes. When you export to Excel you will generate an Excel following our Data Entry Excel template. The first sheet will contain all the schema documentation and then next sheet will contain your data with the matching schema attribute names.

The new Data Entry Web tool of the Semantic Engine can help you enter and verify your data. Reuse your schema and improve your data quality using these tools available at the the Semantic Engine.

 

Written by Carly Huitema

When submitting a publication to a journal you are often asked to submit data, publish it in a repository, or otherwise make it available. The journals may ask that your data supports FAIR principles (that data is Findable, Accessible, Interoperable and Reusable). You may be asked to submit supplementary data to a generalist or specialist repository, or you may choose to make the data available on request.

More FAIR data

Writing schemas to document your data using the Semantic Engine can help you meet these journal submission goals and requirements. The information documented in a schema (which may also be described as the data dictionary or the dataset metadata) helps your research data be more FAIR.

Documented information makes the data more findable in searches, accessible because people know what is in your datasets and can understand it, interoperable because people don’t need to guess what your data means, what your units are, and how you measured certain variables. All these contribute to improve the reusability of your dataset.

Deposit a schema

When you submit a dataset in any repository you can include the schemas (both the machine-readable .zip/JSON version and the human-readable and archival Readme.txt version) in your submission.

If you only want to make your data available by request you could publish just your schema, giving it a DOI, and referencing it in your publication. This way, anyone who wants to know if your data is useful before requesting it can look at the schema to see if it could contain information that they need.

The Semantic Engine makes it easy to document your schema because it is an easy to follow web interface with prompts and help information which assist you in writing your data schema. Follow our tutorial video to see how easy it is to create your own schema. You can use this documentation when submitting your data to a journal publication so that other people can understand and benefit from your data.

 

Written by Carly Huitema