Funding for Agri-food Data Canada is provided in part by the Canada First Research Excellence Fund
You are using a dataset and you come across some missing values, or unusual entries like NA or ND. What do these mean? Why is there data missing?
Quality indicators of individual measurement or observation data can be included directly in your data and will help users of your data understand your data and use it correctly.
Rather than store a quality indicator directly within a data column (such as using a text entry such as NULL or NA in what should be a numerical list of measurements) you should pair your measurement data column with a quality column. This way your measurements will not be ‘contaminated’ with non-numerical indicators and not interfere with analysis.
Documenting your schema with OCA using the Semantic Engine gives you many helpful tools for managing and communicating data quality such as the ability to add entry codes and descriptions to your schema so that others can interpret your data and coding.
To begin with adding quality information you can add an additional column to a measurement column and append _qual to the variable name (or something similar that you recognize and that you document). It is in the ‘var_qual’ column that you can record quality information about the measurements held in the associated ‘var’ column.
Next, to ensure that your ‘var_qual’ column has consistent data entry you can use a system of entry codes. For example, you can set the attribute DataType for the var_qual attribute to be numerical and that it is a list (aka, use entry codes). When it comes to the screen for entering Entry Codes you can enter the following table. You can also adjust the table to be what you need.
You can also see that you can continue to expand on data quality attributes as you continue your analysis. For example you could create additional quality columns where you might specify the reasons for rejecting values from analysis (e.g. ‘confirmed outlier’ might be one of your entry code labels for this new quality column). Schemas for different stages of collection and analsysi would fit into your data organization structure, such as within the codebook folders of the TIER Protocol.
An example of a dataset that scores the quality of data would be as follows:
Numbers that are unusual because of equipment errors or other reasons can be flagged and dealt with appropriately during the data analysis cycles.
Written by Carly Huitema
When you document your data schema using the Semantic Engine, you are writing a schema in the schema language of Overlays Capture Architecture (OCA). Let’s do a deeper dive into one of the features of OCA.
Attributes in OCA
When you start using the Semantic Engine, you can either drag and drop your dataset, or you can begin to manually add attributes. Attributes are the names of your columns in your dataset, which should match your variable names in your experiments. In fact, when you drag a dataset into the Semantic Engine, the engine reads the first line of your data, assumes they are your column headers and uses them to create the list of attributes in your schema.
If you are entering your attributes manually, they should be matching the column headers of your dataset.
Labels in OCA
A few screens into the Semantic Engine and you will be asked to add language specific labels. This might be a bit confusing if you’ve already entered in your attributes and you only have one language! The attribute labels and the English language labels might even look exactly the same – and this is OK!
The attributes and their corresponding labels may be the same (or very close), but sometimes they can be very different, especially if the column names are very cryptic. This is your chance to give your data more human readable labels while still preserving the underlying data structure.
Internationalization with OCA
With labels as well as attributes OCA is able to support internationalization. This means that many people can use the same schemas but they can have helpful information provided to them in their own language.
Your labels may not be very different than your attribute names if they are both in English, but the ability to give labels to attributes will help make your schema more accessible in other languages. All you will need to do is edit your schema in the Semantic Engine and add additional languages.
Written by Carly Huitema
Maintaining data quality can be a constant challenge. One effective solution is the use of entry codes. Let’s explore what entry codes entail, why they are crucial for clean data, and how they are seamlessly integrated into the Semantic Engine using Overlays Capture Architecture (OCA).
Understanding Entry Codes in Data Entry
Entry codes serve as structured identifiers in data entry, offering a systematic approach to input data. Instead of allowing free-form text, entry codes limit choices to a predefined list, ensuring consistency and accuracy in the dataset.
The Need for Clean Data
Data cleanliness is essential for meaningful analysis and decision-making. Without restrictions on data entry, datasets often suffer from various spellings and abbreviations of the same terms, leading to confusion and misinterpretation.
Practical Examples of Entry Code Implementation
Consider scenarios in scientific research where specific information, such as research locations, gene names, or experimental conditions, needs to be recorded. Entry codes provide a standardized framework, reducing the likelihood of inconsistent data entries.
Overcoming Cleanup Challenges
In the past, when working with datasets lacking entry codes, manual cleanup or tools like Open Refine were essential. Open Refine is a useful data cleaning tool that lets users standardize data after collection has been completed.
Leveraging OCA for Improved Data Management
Overlays Capture Architecture (OCA) takes entry codes a step further by allowing the creation of lists to limit data entry choices. Invalid entries, those not on the predefined list (entry code list), are easily identified, enhancing the overall quality of the dataset.
Language-specific Labels in OCA
OCA introduces a noteworthy feature – language-specific labels for entry codes. In instances like financial data entry, where numerical codes may be challenging to remember, users can associate user-friendly labels (e.g., account names) with numerical entry codes. This ensures ease of data entry without compromising accuracy.
Multilingual Support for Global Usability
OCA’s multilingual support adds a layer of inclusivity, enabling the incorporation of labels in multiple languages. This feature facilitates international collaboration, allowing users worldwide to engage with the dataset in a language they are comfortable with.
Crafting Acceptable Data Entries in OCA
When creating lists in OCA, users define acceptable data entries for specific attributes. Labels accompanying entry codes aid users in understanding and selecting the correct code, contributing to cleaner datasets.
Clarifying the Distinction between Labels and Entry Codes
It’s important to note that, in OCA, the emphasis is on entry codes rather than labels. While labels provide user-friendly descriptions, it is the entry code itself that becomes part of the dataset, ensuring data uniformity.
In conclusion, entry codes play an important role in streamlining data entry and enhancing the quality of datasets. Through the practical implementation of entry codes supported by Overlays Capture Architecture, organizations can ensure that their data remains accurate, consistent, and accessible on a global scale.
When you use the Semantic Engine to create a schema, one of the first things you are asked to do is to classify your schema.
It might seem simple, but as you move further from your domain, what seems like an obvious classification to you may not be so obvious to people outside of your specialty. For example, if someone is talking about a bar there are multiple meanings depending on the context. It could be the location for socialization and drinking, or it could be the exam for lawyers.
With the addition of machine learning and machine assisted searching it is even more important to add important contextual queues to our information to help machines produce more reasonable responses.
A recent publication from the Canadian Federated Research Data Repository (FRDR) demonstrated the challenge they had with automated metadata (e.g. classification) reconciliation. A working group investigated how to build an automated or semi-automated workflow to reconcile metadata keywords from harvested datasets. The majority of their term reconciliation work could not be automated. Ultimately FRDR chose to abandon the assignment of standardized terms to metadata records. The downstream impact means relevant datasets may not appear in relevant searches and research will miss out on opportunities to find and potentially reuse data.
The Semantic Engine supports the findability and categorization of schemas through the addition of schema classifications using the controlled vocabulary of Statistics Canada, specifically the Canadian Research and Development Classification (CRDC) 2020 Version 1.0 – Field of Research (FOR). When you enter your schema classification you are using one of the terms from this controlled list.
Ultimately, by classifying your schema you help ensure that both machines and people can better understand and find your schema and be more confident that they are using it for its intended purpose.
Written by Carly Huitema
Recommendation: Long format of datasets is recommended for schema documentation. Long format is more flexible because it is more general. The schema is reusable for other experiments, either by the researcher or by others. It is also easier to reuse the data and combine it with similar experiments.
Data must help answer specific questions or meet specific goals and that influences the way the data can be represented. For example, analysis often depends on data in a specific format, generally referred to as wide vs long format. Wide datasets are more intuitive and easier to grasp when there are relatively few variables, while long datasets are more flexible and efficient for managing complex, structured data with many variables or repeated measures. Researchers and data analysts often transform data between these formats based on the requirements of their analysis.
Wide Dataset:
Format: In a wide dataset, each variable or attribute has its own column, and each observation or data point is a single row. Repeated measures often have their own data column. This representation is typically seen in Excel.
Structure: It typically has a broader structure with many columns, making it easier to read and understand when there are relatively few variables.
Use Cases: Wide datasets are often used for summary or aggregated data, and they are suitable for simple statistical operations like means and sums.
For example, here is a dataset in wide format. Repeated measures (HT1-6) are described in separate columns (e.g. HT1 is the height of the subject measured at the end of week 1; HT2 is the height of the subject measured at the end of week 2 etc.).
ID | TREATMENT | HT1 | HT2 | HT3 | HT4 | HT5 | HT6 |
01 | A | 12 | 18 | 19 | 26 | 34 | 55 |
02 | A | 10 | 15 | 19 | 24 | 30 | 45 |
03 | B | 11 | 16 | 20 | 25 | 32 | 50 |
04 | B | 9 | 11 | 14 | 22 | 38 | 42 |
Long Dataset:
Format: In a long dataset, there are fewer columns, and the data is organized with multiple rows for each unique combination of variables. Typically, you have columns for “variable,” “value,” and potentially other categorical identifiers.
Structure: It is more compact and vertically oriented, making it easier to work with when you have a large number of variables or need to perform complex data transformations.
Use Cases: Long datasets are well-suited for storing and analyzing data with multiple measurements or observations over time or across different categories. They facilitate advanced statistical analyses like regression and mixed-effects modeling. In Excel you can use pivot tables to view summary statistics of long datasets.
For example, here is some of the same data represented in a long format. The repeated measures don’t have separate columns, compared to the wide format, the height (H) is a column, and the weeks (1-6) are now recorded in a ‘week’ column.
ID | TREATMENT | WEEK | HEIGHT |
01 | A | 1 | 12 |
01 | A | 2 | 18 |
01 | A | 3 | 19 |
01 | A | 4 | 26 |
01 | A | 5 | 34 |
01 | A | 6 | 55 |
Long format data is a better choice when choosing a format to be documented with a schema as it is easier to document and more clear to understand.
For example, column headers (attributes) in the wide format are repetitive and this results in duplicated documentation. It is also less flexible as each additional week needs an additional column and therefore another attribute described in the schema. This means each time you add a variable you change the structure of the capture base of the schema reducing interoperability.
Documenting a schema in long format is more flexible because it is more general. This makes the schema reusable for other experiments, either by the researcher or by others. It is also easier to reuse the data and combine it with similar experiments.
At the time of analysis, the data can be transformed from long to wide if necessary and many data analysis programs have specialized functions that help researchers with this task.
Written by: Carly Huitema
Overlays Capture Architecture (OCA) is a structured way to describe data schemas. By making a schema a collection of separate functional parts bundled together, you can introduce the benefits of modular design. A modular schema means you can have multiple parties using their own expertise for separate features of a schema. For example, you could have someone with ontology experience annotate your schema to connect it to ontologies, you could have another person contributing the data validation rules, and you could have your subject matter expert describe in detail how to understand data for each attribute. All these parts come together into a single, useful schema that can perform as many (or as few) functions as a researcher needs. A modular design also means you can reuse and recombine parts from other schemas.
The OCA schema architecture can be remixed and redisplayed into multiple formats which have different functionalities.
OCA Excel Template – the first step in the development of the OCA standard, the Excel Template lets users write their schema using Excel. This Excel file is then read by the OCA parser to make the other OCA formats. While it can be convenient to write a schema in Excel, it requires a separate tutorial to learn the Excel syntax. As the OCA Ecosystem evolves we will be moving away from using the OCA Excel Template.
OCA Bundle – this format is a machine-readable and in a .zip format. If you open the zip file you can see the separate documents that together describe all the overlays and capture base of a schema. Each file is written in JSON, an open standard file format for data exchange. It looks a little tricky to understand because it doesn’t have any line breaks which would help people understand the text, but you can find lots of online tools for viewing the contents of a JSON file in a more human-readable way.
OCA Readme – the readme format takes the schema content and puts it into a human readable and archivable plain text format. OCA Readme represents the contents of a schema in a way that is accessible now and into the future because of its technological simplicity. It is a lengthy but complete document of schema features and eminently suitable for sharing and archiving alongside the machine-readable OCA Bundle.
OCA File – this is a somewhat hidden representation of OCA schema data. It is the core data format of an OCA schema ecosystem. It is the format stored in an OCA Repository and it tracks the history of the schema which is the key to enabling rich search. If OCA Bundle is like a compiled program, OCA File is like the source code.
These different representations of the content of an OCA Schema all serve different purposes. This highlights the flexibility of OCA for representing schemas. Information can be presented in either human- or machine-readable formats, or stored in a way to connect the history of the schema to enable better search. All are connected through the schema identifier (the SAID) and all represent the exact same content.
Written by Carly Huitema
In the realm of data analysis, having well-structured and clear data is fundamental for meaningful insights. Data schemas provide a blueprint for understanding and organizing datasets. In today’s digital landscape, where automation and machine-readability are vital, machine-actionable schemas offer enhanced utility. This blog post will guide you through the process of creating a machine-actionable schema using SemanticEngine.org, a user-friendly platform. We’ll explore how to develop such schemas step by step and the benefits of utilizing the Overlays Capture Architecture (OCA) for better data management.
The Practicality of Machine-Actionable Schemas
Think of a schema as a roadmap that navigates you through your data landscape. Machine-actionable schemas take this up a notch by making the schema easily understandable for computers. The advantages range from ensuring data accuracy to streamlining data integration and analysis.
Getting Started: Crafting Your Machine-Actionable Schema
SemanticEngine.org simplifies schema creation through a tutorial on crafting an Excel Template schema. Before you begin, have a clear picture of your dataset or the data you intend to collect. Decide on concise attribute names, avoiding spaces and complex characters to maintain clarity and consistency.
Overlays Capture Architecture (OCA) is the language that SemanticEngine.org uses to express your schema. It allows flexibility to schema design, such as having descriptive labels for your schema attributes. These labels provide context and descriptive information, ensuring accessibility for diverse users. This feature enriches the schema’s comprehensibility. OCA also has features such as adding units to your schema attributes, descriptions to help users understand attributes and more.
A Gradual Approach to Schema Improvement
The flexibility of OCA enables you to start with a basic schema and add details as your project progresses. This adaptable approach accommodates evolving project needs without overwhelming you with upfront complexity.
Creating the Machine-Actionable Version
SemanticEngine.org‘s parser transforms your Excel Template schema into an official OCA Bundle. This bundle compiles all schema features into machine-readable JSON format, packaged as a .zip file.
Future Developments and the OCA Standard
SemanticEngine.org aims to expand beyond Excel Templates and introduce additional functionalities. The OCA Bundle adheres to the OCA Open Standard hosted by the Human Colossus Foundation. This standardization ensures compatibility and interoperability within the data ecosystem.
Contributing to the OCA Standard
For those interested in shaping OCA’s future, participating in OCA Standard meetings offers an avenue for contribution. By sharing insights, you can contribute to refining this open standard.
SemanticEngine.org empowers researchers with practical tools to create machine-actionable schemas. By simplifying schema creation and incorporating OCA, SemanticEngine.org facilitates efficient, accurate, and collaborative data-driven research. Delve into the world of machine-actionable schemas and optimize your data management with SemanticEngine.org.
Written by Carly Huitema
© 2023 University of Guelph