Funding for Agri-food Data Canada is provided in part by the Canada First Research Excellence Fund
My last post was all about where to store your data schemas and how to search for them. Now let’s take it to the next step – how do I search for what’s INSIDE a data schema – in other words how do I search for the variables or attributes that someone has described in their data schema? A little caveat here – up to this point, we have been trying to take advantage of National data platforms that are already available – how can we take advantage of these with our services? Notice the words in that last statement “up to this point” – yes that means we have some new options and tools coming VERY soon. But for now – let’s see how we can take advantage of another National data repository odesi.ca.
How can a data schema help us meet the recommendations of this principle? Well…. technically I showed you one way in my last post – right? Finding the data schema or the metadata about our dataset. But let’s dig a little deeper and try another example using the Ontario Dairy Research Centre (ODRC) data schemas to find the variables that we’re measuring this time.
As I noted in my last post there are more than 30 ODRC data schemas and each has a listing of the variables that are being collected. As a researcher who works in the dairy industry – I’m REALLY curious to see WHAT is being collected at the ODRC – by this I mean – what variables, measures, attributes. But, when I look at the README file for the data schemas in Borealis, I have to read it all and manually look through the variable list OR use some keyboard combination to search within the file. This means I need to search for the data schema first and then search within all the relevant data schemas. This sounds tedious and heck I’ll even admit too much work!
BUT! There is another solution – odesi.ca – another National data repository hosted by Ontario Council of University Libraries (OCUL) that curates over 5,700 datasets, and has recently incorporated the Borealis collection of research data. Let’s see what we can see using this interface.
Let’s work through our example – I want to see what milking variables are being used by the ODRC – in other words, are we collecting “milking” data? Let’s try it together:
For our purposes – let’s review the information related to the ODRC Data Schema entries. Let’s pick the first one on my list ODRC data schema: Tie stalls and maternity milkings. Notice that it states there is one Matching Variable? It is the variable called milking_device. If you select the the data schema you will see all the relevant and limited study level metadata along with a DOI for this schema. By selecting the variable you will also see a little more detail regarding the chosen attribute.
NOTE – there is NO data with our data schemas – we have added dummy data to allow odesi.ca to search at a variable level, but we are NOT making any data available here – we are using this interface to increase the visibility of the types of variables we are working with. To access any data associated with these data schemas, researchers need to visit the ODRC website as noted in the associated study metadata.
I hope you found this as exciting as I do! Researchers across Canada can now see what variables and information is being collected at our Research Centres – so cool!!
Look forward to some more exciting posts on how to search within and across data schemas created by the Semantic Engine. Go try it out for yourself!!!
With the introduction of using OCA schemas for data verification let’s dig a bit more into the format overlay which is an important piece for data verification.
When you are writing a data schema using the Semantic Engine you can build up your schema documentation by adding features. One of the features that you can add is called format.
In an OCA (Overlays Capture Architeccture) schema, you can specify the format for different types of data. This format dictates the structure and type of data expected for each field, ensuring that the data conforms to certain predefined rules. For example, for a numeric data type, you can define the format to expect only integers or decimal numbers, which ensures that the data is valid for calculations or further processing. Similarly, for a text data type, you can set a format that restricts the input to a specific number of characters, such as a string up to 50 characters in length, or constrain it to only allow alphanumeric characters. By defining these formats, the OCA schema provides a mechanism for validating the data, ensuring it meets the expected requirements.
Specifying the format for data in an OCA schema is valuable because it guarantees consistency and accuracy in data entry and validation. By imposing these rules, you can prevent errors such as inputting the wrong type of data (e.g., letters instead of numbers) or exceeding field limits. This level of control reduces data corruption, minimizes the risk of system errors, and improves the quality of the information being collected or shared. When systems across different platforms adhere to these defined formats, it enables seamless data exchange and interoperability improving data FAIRness.
The rules for defining data formats in an OCA schema are typically written using Regular Expressions (RegEx). RegEx is a sequence of characters that forms a search pattern, used for matching strings against specific patterns. It allows for very precise and flexible definitions of what is considered valid data. For example, RegEx can specify that a field should contain only digits, letters, or specific formats like dates (YYYY-MM-DD
) or email addresses. RegEx is widely used for input validation because of its ability to handle complex patterns and enforce strict rules on data format, making it ideal for ensuring data consistency in systems like OCA.
To help our users be consistent, the Semantic Engine limits users to a set of format rules, which is documented in the format rule GitHub repository. If the rule you want isn’t listed here it can be added by reaching out to us at ADC or raising a GitHub issue in the repository.
After you have added format rules to your data schema you can use the data verification tool to check your data against your new schema rules.
Written by Carly Huitema
When data entry into an Excel spreadsheet is not standardized, it can lead to inconsistencies in formats, units, and terminology, making it difficult to interpret and integrate research data. For instance, dates entered in various formats, inconsistent use of abbreviations, or missing values can give problems during analysis leading leading to errors.
Organizing data according to a schema—essentially a predefined structure or set of rules for how data should be entered—makes data entry easier and more standardized. A schema, such as one written using the Semantic Engine, can define fields, formats, and acceptable values for each column in the spreadsheet.
Using a standardized Excel sheet for data entry ensures uniformity across datasets, making it easier to validate, compare, and combine data. The benefits include improved data quality, reduced manual cleaning, and streamlined data analysis, ultimately leading to more reliable research outcomes.
After you have created a schema using the Semantic Engine, you can use this schema (the machine-readable version) to generate a Data Entry Excel.
When you open your Data Entry Excel you will see it consists of two sheets, one for schema description and one for the entry for data. The schema description sheets takes information from the schema that was uploaded and puts it into an information table.
At the very bottom of the information table are listed all of the entry code lists from the schema. This information is used on the data entry side for populating drop-down lists.
On the data entry sheet of the Data Entry Excel you can find the pre-labeled columns for data entry according to the rules of your schema. You can rearrange the columns as you want, and you can see that the Data Entry Excel comes with prefilled dropdown lists from those variables (attributes) that have entry codes. There is no dropdown list if the data is expected to be an array of entries or if the list is very long. As well, you will need to wrestle with Excel time/date attributes to have it appear according to what is documented in the schema description.
There is no verification of data in Excel that is set up when you generate your Data Entry Excel apart from the creation of the drop-down lists. For data verification you can upload your Data Entry Excel to the Data Verification tool available on the Semantic Engine.
Using the Data Entry Excel feature lets you put your data schemas to use, helping you document and harmonize your data. You can store your data in Excel sheets with pre-filled information about what kind of data you are collecting! You can also use this to easily collect data as part of a larger project where you want to combine data later for analysis.
Written by Carly Huitema
Alrighty – so you have been learning about the Semantic Engine and how important documentation is when it comes to research data – ok, ok, yes documentation is important to any and all data, but we’ll stay in our lanes here and keep our conversation to research data. We’ve talked about Research Data Management and how the FAIR principles intertwine and how the Semantic Engine is one fabulous tool to enable our researchers to create FAIR research data.
But… now that you’ve created your data schema, where can you save it and make it available for others to see and use? There’s nothing wrong with storing it within your research group environment, but what if there are others around the world working on a related project? Wouldn’t it be great to share your data schemas? Maybe get a little extra reference credit along your academic path?
Let me walk you through what we have been doing with the data schemas created for the Ontario Dairy Research Centre data portal. There are 30+ data schemas that reflect the many data sources/datasets that are collected dynamically at the Ontario Research Dairy Centre (ODRC), and we want to ensure that the information regarding our data collection and data sources is widely available to our users and beyond by depositing our data schemas into a data repository. We want to encourage the use and reuse of our data schemas – can we say R in FAIR?
Agri-food Data Canada(ADC) supports, encourages, and enables the use of national platforms such as Borealis – Canadian Dataverse Repository. The ADC team has been working with local researchers to deposit their research data into this repository for many years through our OAC Historical Data project. As we work on developing FAIR data and ensuring our data resources are available in a national data repository, we began to investigate the use of Borealis as a repository for ADC data schemas. We recognize the need to share data schemas and encourage all to do so – data repositories are not just for data – let’s publish our data schemas!
If you are interested in publishing your data schemas, please contact adc@uoguelph.ca for more information. Our YouTube series: Agri-food Data Canada – Data Deposits into Borealis (Agri-environmental Data Repository) will be updated this semester to provide you guidance on recommended practices on publishing data schemas.
So, I hope you understand now that we can deposit data schemas into a data repository – and here at ADC, we are using the Borealis research data repository. But now the question becomes – how, in the world do I find the data schemas? I’ll walk you through an example to help you find data schemas that we have created and deposited for the data collected at the ODRC.
Now you have a data schema that you can use and share among your colleagues, classmates, labmates, researchers, etc…..
Remember to check out what you else you can do with these schemas by reading about all about Data Verification.
A quick summary:
Wow! Research data life is getting FAIRer by the day!
What do you do when you’ve collected data but you need to also include notes in the data. Do you mix the data together with the notes?
Here we build on our previous blog post describing data quality comments with worked examples.
An example of quality comments embedded into numeric data is if you include values such as NULL or NA when you have a data table. Below are some examples of datatypes being assigned to different attributes (variables v1-v8). You can see in v5 that there is are numeric measurements values mixed together with quality notations such as NULL, NA, or BDL (below detection limit).
Technically, this type of data would be given the datatype of text when using the Semantic Engine. However, you may wish to use v5 as a numeric datatype so that you can perform analysis with it. You could delete all the text values, but then you would be losing this important data quality information.
As we described in a previous blog post, one solution to this challenge is to add quality comments to your dataset. How you would do this is demonstrated in the next data example.
In this next example there are two variables: c and v. Variable v contains a mixture of numeric values and text.
step 1: Rename v to v_raw. It is good practice to always keep raw data in its original state.
step 2: copy the values into v_analysis and here you can remove any text values and make other adjustments to values.
step 3: document your adjustments in a new column called v_quality and using a quality code table.
The quality code table is noted on the right of the data. When using the Semantic Engine you would put this in a separate .csv file and import it as an entry code list. You would also remove the highlighted dataypes (numeric, text etc.) which don’t belong in the dataset but are written here to make it easier to understand.
You can watch the entire example being worked through using the Semantic Engine in this YouTube video. Note that even without using the Semantic Engine you can annotate data with quality comments, the Semantic Engine just makes the process easier.
Written by Carly Huitema
There is a new feature just released in the Semantic Engine!
Now, after you have written your schema you can use this schema to enter and verify data using your web browser.
Find the link to the new tool in the Quick Link lists, after you have uploaded a schema. Watch our video tutorial on how to easily create your own schema.
The Data Entry Web tool lets you upload your schema and then you can optionally upload a dataset. If you choose to upload a dataset, remember that Agri-food Data Canada and the Semantic Engine tool never receive your data. Instead, your data is ‘uploaded’ into your browser and all the data processing happens locally.
If you don’t want to upload a dataset, you can skip this step and go right to the end where you can enter and verify your data in the web browser. You add rows of blank data using the ‘Add rows’ button at the bottom and then enter the data. You can hover over the ?’s to see what data is expected, or click on the ‘verification rules’ to see the schema again to help you enter your data.
If you upload your dataset you will be able to use the ‘match attributes’ feature. If your schema and your dataset use the same column headers (aka variables or attributes), then the DEW tool will automatically match those columns with the corresponding schema attributes. Your list of unmatched data column headers are listed in the unassigned variables box to help you identify what is still available to be matched. You can create a match by selecting the correct column name in the associated drop-down. By selecting the column name you can unmatch an assigned match.
Matching data does two things:
1) Lets you verify the data in a data column (aka variable or attribute) against the rules of the schema. No matching, no verification.
2) When you export data from the DEW tool you have the option of renaming your column names to the schema name. This will automate future matching attempts and can also help you harmonize your dataset to the schema. No matching, no renaming.
After you have either entered or ‘uploaded’ data, it is time to use one of the important tools of DEW – the verification tool! (read our blog post about why it is verification and not validation).
Verification works by comparing the data you have entered against the rules of the schema. It can only verify against the schema rules so if the rule isn’t documented or described correctly in the schema it won’t verify correctly either. You can always schedule a consultation with ADC to receive one-on-one help with writing your schema.
In the above example you can see the first variable/attribute/column is called farm and the DEW tool displays it as a list to select items from. In your schema you would set this feature up by making an attribute a list (aka entry codes). The other errors we can see in this table are the times. When looking up the schema rules (either via the link to verification rules which pops up the schema for reference, or by hovering over the column’s ?) you can see the expected time should be in ISO standard (HH:MM:SS), which means two digits for hour. The correct times would be something like 09:15:00. These format rules and more are available as the format overlay in the Semantic Engine when writing your schema. See the figure below for an example of adding a format rule to a schema using the Semantic Engine.
A key thing to remember, because ADC and the Semantic Engine don’t ever store your data, if you leave the webpage, you lose the data! After you have done all the hard work of fixing your data you will want to export the data to keep your results.
You have a few choices when you export the data. If you export to .csv you have the option of keeping your original data headers or changing your headers to the matched schema attributes. When you export to Excel you will generate an Excel following our Data Entry Excel template. The first sheet will contain all the schema documentation and then next sheet will contain your data with the matching schema attribute names.
The new Data Entry Web tool of the Semantic Engine can help you enter and verify your data. Reuse your schema and improve your data quality using these tools available at the the Semantic Engine.
Written by Carly Huitema
When submitting a publication to a journal you are often asked to submit data, publish it in a repository, or otherwise make it available. The journals may ask that your data supports FAIR principles (that data is Findable, Accessible, Interoperable and Reusable). You may be asked to submit supplementary data to a generalist or specialist repository, or you may choose to make the data available on request.
Writing schemas to document your data using the Semantic Engine can help you meet these journal submission goals and requirements. The information documented in a schema (which may also be described as the data dictionary or the dataset metadata) helps your research data be more FAIR.
Documented information makes the data more findable in searches, accessible because people know what is in your datasets and can understand it, interoperable because people don’t need to guess what your data means, what your units are, and how you measured certain variables. All these contribute to improve the reusability of your dataset.
When you submit a dataset in any repository you can include the schemas (both the machine-readable .zip/JSON version and the human-readable and archival Readme.txt version) in your submission.
If you only want to make your data available by request you could publish just your schema, giving it a DOI, and referencing it in your publication. This way, anyone who wants to know if your data is useful before requesting it can look at the schema to see if it could contain information that they need.
The Semantic Engine makes it easy to document your schema because it is an easy to follow web interface with prompts and help information which assist you in writing your data schema. Follow our tutorial video to see how easy it is to create your own schema. You can use this documentation when submitting your data to a journal publication so that other people can understand and benefit from your data.
Written by Carly Huitema
When you create a schema using the Semantic Engine you are documenting information that can make your dataset more FAIR, helping others use and understand your data. The schema created using the Semantic Engine is understood by machines and is written in JSON. At first glance, it is not so easy for people to read JSON which is where the readme.txt file version comes to help. All information of the schema bundle is copied into the readme.txt along with some extra helping information. To support long-term archiving it is important to document using low requirement data formats which is why the plain-text format has been selected for a human-readable, archive ready version of your schema written using the Semantic Engine.
The readme text file begins with reference material. This reference material is the same for every OCA schema readme.txt. At the top it gives the version number of the readme (1.0 in this example), provides citations of where the information is coming from, and gives a short introduction to what a schema is.
BEGIN_REFERENCE_MATERIAL ****************************************************************** OCA_READ_ME/1.0 This is a human-readable schema, based on the OCA schema standard. Reference for Overlays Capture Architecture (OCA): https://doi.org/10.5281/zenodo.7707467 Reference for OCA_READ_ME/1.0: https://github.com/agrifooddatacanada/OCA_README A schema describes details about a dataset. In OCA, a schema consists of a capture_base which documents the attributes and their most basic features. A schema may also contain overlays which add details to the capture_base. For each overlay and capture_base, a hash of their original contents has been calculated and is reported here as the SAID value. This README format documents the capture_base and overlays that were associated together in a single OCA Bundle. OCA_MANIFEST lists all components of the OCA Bundle. For the OCA_BUNDLE, each section between rows of ****'s contains the details of one "layer type/version" of the OCA Bundle. ****************************************************************** END_REFERENCE_MATERIAL
After the reference material we list the manifest – the contents of schema listed overlay by overlay along with their digest identifiers. The digest identifiers are calculated from the contents of the schema components and are written here to help with reproducibility.
BEGIN_OCA_MANIFEST ********************************************************************** Bundle SAID/digest: unavailable capture_base SAID/digest: ElQVB8ffr4TdvPvCgxmHjZxhUR_JcPkuLRpuHY1oU7HA, character_encoding SAID/digest: EKwa4p3qiRjizl-bhiVy-sC5jd8FzNLyhL842vbEGpXM, conformance SAID/digest: ECj97Q3zZQYLyuyHli2x7rLvLaPKmpKkurPnnPMD9wbY, entry (en) SAID/digest: EIbRDpClXxWw202M3D5sTYPq5G4ZnLEta8FvK9lclunQ, entry_code SAID/digest: E6AuDvomYlHQ6k9HMRUCRYQnkESaGPZzh17CkVgsltPo, format SAID/digest: EDozfjgDRT3YzWoGo23E2VYt-Nh4iepYMc3kf02Uh1u4, information (en) SAID/digest: EU-VGxKVUPBqBPqdQvi_pdLBduJvFIjrQJZHKHlBsAvM, label (en) SAID/digest: EgOwKdgjdcEP5y0l8Nx8RmpU74GKB-opBZj7LF-Y1hFc, meta (en) SAID/digest: EUmhlW5XLF7GtyZeToaaP0XNcaOKD61s_48bFCX6J-sw, unit SAID/digest: "EaN1jMNQamXdPTRm-CB4Si5Oj6kt3xjmE2BjXkOzT664" ********************************************************************** END_OCA_MANIFEST
Next comes the components of the schema bundle where each component is separated by a row of *’s. Each layer is described with a name and version (e.g. capture_base layer version 1.0) and the SAID reproduced from the manifest.
In this section, the capture_base is documented with the the schema classification (RDF402) and any attributes marked as sensitive (animal_id). After that comes a list of all the attributes (variables) in the schema along with the attribute’s datatype.
BEGIN_OCA_BUNDLE ********************************************************************** Layer name: capture_base/1.0 SAID/digest: ElQVB8ffr4TdvPvCgxmHjZxhUR_JcPkuLRpuHY1oU7HA classification: RDF402 flagged_attributes: [animal_id] Schema attribute: data type animal_id: Numeric begin_time: DateTime date: DateTime dim: Numeric duration: DateTime end_date: DateTime end_time: DateTime lact_n: Numeric milking_location: Text session_n: Numeric total_yield: Numeric
Each overlay of the schema bundle is documented in the readme.txt file. For example here is the format overlay (version 1.0) listed each attribute and the format feature for each attribute (written in Regular Expressions).
********************************************************************** Layer name: format/1.0 SAID/digest: EDozfjgDRT3YzWoGo23E2VYt-Nh4iepYMc3kf02Uh1u4 Schema attribute: format/1.0 animal_id: ^-?[0-9]+$ begin_time: ^([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]$/gm date: ^(?:(?:19|20)\\d2)-(?:0[1-9]|1[0-2])-(?:0[1-9]|[1-2]\\d|3[0-1])$ dim: ^-?[0-9]+$ duration: ^([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]$/gm end_date: ^(?:(?:19|20)\\d2)-(?:0[1-9]|1[0-2])-(?:0[1-9]|[1-2]\\d|3[0-1])$ end_time: ^([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]$/gm lact_n: ^-?[0-9]+$ milking_location: ^.050$ session_n: ^-?[0-9]+$ total_yield: ^[-+]?\\d*\\.?\\d+$
One by one, each overlay is described until the end of the schema bundle. The readme.txt file can be renamed to whatever is suitable for your dataset and can be stored as a human-readable and archival version of your schema to accompany your machine-readable JSON version of a schema.
Written by Carly Huitema
At the Semantic Engine we have created a new video example where we walk through the process of describing a dataset with a schema. We are using a dataset with milking data that has been downloaded from the research dairy barn.
You can watch the video on YouTube or follow along in the schema writing tutorial, and then go to the Semantic Engine and write your own dataset schema.
The video covers several tips and tricks that have been discussed here in our blog including:
Importing Entry Codes from another schema
Using ISO standards for dates and times
Written by Carly Huitema
The Semantic Engine has a new upgrade for importing existing entry codes!
If you don’t know what entry codes are, you can check out our blog post about how to use entry codes. We also walk through an example of entry codes in our video tutorial.
While you can type your entry codes and labels in directly when writing your schema, if you have a lot of entry codes it might be easier to import them. We already discussed how to import entry codes from a .csv file, or copy them from another attribute, but you can also import them from another OCA schema. You use the same process for uploading the schema bundle as you would for the .csv file.
The advantage of using entry codes from an existing schema is that you can reuse work that someone has already done. If you like their choice of entry codes now your schema can also include them. After importing a list of entry codes you can extend the list by adding more codes as needed.
You can watch an example of entry codes in action in our tutorial video.
Entry codes are very valuable and can really help with your data standardization. The Semantic Engine can help you add them to your data schemas.
Written by Carly Huitema
© 2023 University of Guelph