Schema

You are using a dataset and you come across some missing values, or unusual entries like NA or ND. What do these mean? Why is there data missing?

Quality indicators of individual measurement or observation data can be included directly in your data and will help users of your data understand your data and use it correctly.

Rather than store a quality indicator directly within a data column (such as using a text entry such as NULL or NA in what should be a numerical list of measurements) you should pair your measurement data column with a quality column. This way your measurements will not be ‘contaminated’ with non-numerical indicators and not interfere with analysis.

Documenting your schema with OCA using the Semantic Engine gives you many helpful tools for managing and communicating data quality such as the ability to add entry codes and descriptions to your schema so that others can interpret your data and coding.

To begin with adding quality information you can add an additional column to a measurement column and append _qual to the variable name (or something similar that you recognize and that you document).  It is in the ‘var_qual’ column that you can record quality information about the measurements held in the associated ‘var’ column.

Next, to ensure that your ‘var_qual’ column has consistent data entry you can use a system of entry codes. For example, you can set the attribute DataType for the var_qual attribute to be numerical and that it is a list (aka, use entry codes). When it comes to the screen for entering Entry Codes you can enter the following table. You can also adjust the table to be what you need.

Examples of data entry error codes
Examples of data entry error codes

You can also see that you can continue to expand on data quality attributes as you continue your analysis. For example you could create additional quality columns where you might specify the reasons for rejecting values from analysis (e.g. ‘confirmed outlier’ might be one of your entry code labels for this new quality column). Schemas for different stages of collection and analsysi would fit into your data organization structure, such as within the codebook folders of the TIER Protocol.

An example of a dataset that scores the quality of data would be as follows:

Sample of data showing the measurement and associated quality column measurement_qual
Sample of data showing the measurement and associated quality column measurement_qual

Numbers that are unusual because of equipment errors or other reasons can be flagged and dealt with appropriately during the data analysis cycles.

Written by Carly Huitema

When you document your data schema using the Semantic Engine, you are writing a schema in the schema language of Overlays Capture Architecture (OCA). Let’s do a deeper dive into one of the features of OCA.

Attributes in OCA

When you start using the Semantic Engine, you can either drag and drop your dataset, or you can begin to manually add attributes. Attributes are the names of your columns in your dataset, which should match your variable names in your experiments. In fact, when you drag a dataset into the Semantic Engine, the engine reads the first line of your data, assumes they are your column headers and uses them to create the list of attributes in your schema.

List of attributes in the Semantic Engine.
A list of attributes displayed the Semantic Engine.

If you are entering your attributes manually, they should be matching the column headers of your dataset.

Labels in OCA

A few screens into the Semantic Engine and you will be asked to add language specific labels. This might be a bit confusing if you’ve already entered in your attributes and you only have one language! The attribute labels and the English language labels might even look exactly the same – and this is OK!

adding English attribute labels in the Semantic Engine.
Adding English attribute labels in the Semantic Engine.

The attributes and their corresponding labels may be the same (or very close), but sometimes they can be very different, especially if the column names are very cryptic. This is your chance to give your data more human readable labels while still preserving the underlying data structure.

Attributes and English labels can sometimes be the same or similar.
Attributes and English labels can sometimes be the same or similar.

Internationalization with OCA

With labels as well as attributes OCA is able to support internationalization. This means that many people can use the same schemas but they can have helpful information provided to them in their own language.

An example of attributes where they are not written in English.
An example of attributes where they are not written in English.

Your labels may not be very different than your attribute names if they are both in English, but the ability to give labels to attributes will help make your schema more accessible in other languages. All you will need to do is edit your schema in the Semantic Engine and add additional languages.

Written by Carly Huitema

Maintaining data quality can be a constant challenge. One effective solution is the use of entry codes. Let’s explore what entry codes entail, why they are crucial for clean data, and how they are seamlessly integrated into the Semantic Engine using Overlays Capture Architecture (OCA).

 

Understanding Entry Codes in Data Entry

Entry codes serve as structured identifiers in data entry, offering a systematic approach to input data. Instead of allowing free-form text, entry codes limit choices to a predefined list, ensuring consistency and accuracy in the dataset.

 

The Need for Clean Data

Data cleanliness is essential for meaningful analysis and decision-making. Without restrictions on data entry, datasets often suffer from various spellings and abbreviations of the same terms, leading to confusion and misinterpretation.

 

Practical Examples of Entry Code Implementation

Consider scenarios in scientific research where specific information, such as research locations, gene names, or experimental conditions, needs to be recorded. Entry codes provide a standardized framework, reducing the likelihood of inconsistent data entries.

 

Overcoming Cleanup Challenges

In the past, when working with datasets lacking entry codes, manual cleanup or tools like Open Refine were essential. Open Refine is a useful data cleaning tool that lets users standardize data after collection has been completed.

 

Leveraging OCA for Improved Data Management

Overlays Capture Architecture (OCA) takes entry codes a step further by allowing the creation of lists to limit data entry choices. Invalid entries, those not on the predefined list (entry code list), are easily identified, enhancing the overall quality of the dataset.

 

Language-specific Labels in OCA

OCA introduces a noteworthy feature – language-specific labels for entry codes. In instances like financial data entry, where numerical codes may be challenging to remember, users can associate user-friendly labels (e.g., account names) with numerical entry codes. This ensures ease of data entry without compromising accuracy.

An example of adding entry codes to a schema using the Semantic Engine.
An example of adding entry codes to a schema using the Semantic Engine.

Multilingual Support for Global Usability

OCA’s multilingual support adds a layer of inclusivity, enabling the incorporation of labels in multiple languages. This feature facilitates international collaboration, allowing users worldwide to engage with the dataset in a language they are comfortable with.

 

Crafting Acceptable Data Entries in OCA

When creating lists in OCA, users define acceptable data entries for specific attributes. Labels accompanying entry codes aid users in understanding and selecting the correct code, contributing to cleaner datasets.

 

Clarifying the Distinction between Labels and Entry Codes

It’s important to note that, in OCA, the emphasis is on entry codes rather than labels. While labels provide user-friendly descriptions, it is the entry code itself that becomes part of the dataset, ensuring data uniformity.

 

In conclusion, entry codes play an important role in streamlining data entry and enhancing the quality of datasets. Through the practical implementation of entry codes supported by Overlays Capture Architecture, organizations can ensure that their data remains accurate, consistent, and accessible on a global scale.

When you use the Semantic Engine to create a schema, one of the first things you are asked to do is to classify your schema.

An example of adding a schema classification to a schema using the Semantic Engine.

It might seem simple, but as you move further from your domain, what seems like an obvious classification to you may not be so obvious to people outside of your specialty. For example, if someone is talking about a bar there are multiple meanings depending on the context. It could be the location for socialization and drinking, or it could be the exam for lawyers.

With the addition of machine learning and machine assisted searching it is even more important to add important contextual queues to our information to help machines produce more reasonable responses.

A recent publication from the Canadian Federated Research Data Repository (FRDR) demonstrated the challenge they had with automated metadata (e.g. classification) reconciliation. A working group investigated how to build an automated or semi-automated workflow to reconcile metadata keywords from harvested datasets. The majority of their term reconciliation work could not be automated. Ultimately FRDR chose to abandon the assignment of standardized terms to metadata records. The downstream impact means relevant datasets may not appear in relevant searches and research will miss out on opportunities to find and potentially reuse data.

The Semantic Engine supports the findability and categorization of schemas through the addition of schema classifications using the controlled vocabulary of Statistics Canada, specifically the Canadian Research and Development Classification (CRDC) 2020 Version 1.0 – Field of Research (FOR). When you enter your schema classification you are using one of the terms from this controlled list.

Ultimately, by classifying your schema you help ensure that both machines and people can better understand and find your schema and be more confident that they are using it for its intended purpose.

Written by Carly Huitema

Recommendation: Long format of datasets is recommended for schema documentation. Long format is more flexible because it is more general. The schema is reusable for other experiments, either by the researcher or by others. It is also easier to reuse the data and combine it with similar experiments.

Data must help answer specific questions or meet specific goals and that influences the way the data can be represented. For example, analysis often depends on data in a specific format, generally referred to as wide vs long format. Wide datasets are more intuitive and easier to grasp when there are relatively few variables, while long datasets are more flexible and efficient for managing complex, structured data with many variables or repeated measures. Researchers and data analysts often transform data between these formats based on the requirements of their analysis.

Wide Dataset:

Format: In a wide dataset, each variable or attribute has its own column, and each observation or data point is a single row. Repeated measures often have their own data column. This representation is typically seen in Excel.

Structure: It typically has a broader structure with many columns, making it easier to read and understand when there are relatively few variables.

Use Cases: Wide datasets are often used for summary or aggregated data, and they are suitable for simple statistical operations like means and sums.

For example, here is a dataset in wide format. Repeated measures (HT1-6) are described in separate columns (e.g. HT1 is the height of the subject measured at the end of week 1; HT2 is the height of the subject measured at the end of week 2 etc.).

ID TREATMENT HT1 HT2 HT3 HT4 HT5 HT6
01 A 12 18 19 26 34 55
02 A 10 15 19 24 30 45
03 B 11 16 20 25 32 50
04 B 9 11 14 22 38 42

 

Long Dataset:

Format: In a long dataset, there are fewer columns, and the data is organized with multiple rows for each unique combination of variables. Typically, you have columns for “variable,” “value,” and potentially other categorical identifiers.

Structure: It is more compact and vertically oriented, making it easier to work with when you have a large number of variables or need to perform complex data transformations.

Use Cases: Long datasets are well-suited for storing and analyzing data with multiple measurements or observations over time or across different categories. They facilitate advanced statistical analyses like regression and mixed-effects modeling. In Excel you can use pivot tables to view summary statistics of long datasets.

For example, here is some of the same data represented in a long format. The repeated measures don’t have separate columns, compared to the wide format, the height (H) is a column, and the weeks (1-6) are now recorded in a ‘week’ column.

ID TREATMENT WEEK HEIGHT
01 A 1 12
01 A 2 18
01 A 3 19
01 A 4 26
01 A 5 34
01 A 6 55

 

Long format data is a better choice when choosing a format to be documented with a schema as it is easier to document and more clear to understand.

For example, column headers (attributes) in the wide format are repetitive and this results in duplicated documentation. It is also less flexible as each additional week needs an additional column and therefore another attribute described in the schema. This means each time you add a variable you change the structure of the capture base of the schema reducing interoperability.

Documenting a schema in long format is more flexible because it is more general. This makes the schema reusable for other experiments, either by the researcher or by others. It is also easier to reuse the data and combine it with similar experiments.

At the time of analysis, the data can be transformed from long to wide if necessary and many data analysis programs have specialized functions that help researchers with this task.

 

Written by: Carly Huitema

Overlays Capture Architecture (OCA) is a structured way to describe data schemas. By making a schema a collection of separate functional parts bundled together, you can introduce the benefits of modular design. A modular schema means you can have multiple parties using their own expertise for separate features of a schema. For example, you could have someone with ontology experience annotate your schema to connect it to ontologies, you could have another person contributing the data validation rules, and you could have your subject matter expert describe in detail how to understand data for each attribute. All these parts come together into a single, useful schema that can perform as many (or as few) functions as a researcher needs. A modular design also means you can reuse and recombine parts from other schemas.

The OCA schema architecture can be remixed and redisplayed into multiple formats which have different functionalities.

OCA Excel Template – the first step in the development of the OCA standard, the Excel Template lets users write their schema using Excel. This Excel file is then read by the OCA parser to make the other OCA formats. While it can be convenient to write a schema in Excel, it requires a separate tutorial to learn the Excel syntax. As the OCA Ecosystem evolves we will be moving away from using the OCA Excel Template.

OCA Bundle – this format is a machine-readable and in a .zip format. If you open the zip file you can see the separate documents that together describe all the overlays and capture base of a schema. Each file is written in JSON, an open standard file format for data exchange. It looks a little tricky to understand because it doesn’t have any line breaks which would help people understand the text, but you can find lots of online tools for viewing the contents of a JSON file in a more human-readable way.

OCA Readme – the readme format takes the schema content and puts it into a human readable and archivable plain text format. OCA Readme represents the contents of a schema in a way that is accessible now and into the future because of its technological simplicity. It is a lengthy but complete document of schema features and eminently suitable for sharing and archiving alongside the machine-readable OCA Bundle.

OCA File – this is a somewhat hidden representation of OCA schema data. It is the core data format of an OCA schema ecosystem. It is the format stored in an OCA Repository and it tracks the history of the schema which is the key to enabling rich search. If OCA Bundle is like a compiled program, OCA File is like the source code.

These different representations of the content of an OCA Schema all serve different purposes. This highlights the flexibility of OCA for representing schemas. Information can be presented in either human- or machine-readable formats, or stored in a way to connect the history of the schema to enable better search. All are connected through the schema identifier (the SAID) and all represent the exact same content.

Logo for Overlays Capture Architecture

Written by Carly Huitema

In the realm of data analysis, having well-structured and clear data is fundamental for meaningful insights. Data schemas provide a blueprint for understanding and organizing datasets. In today’s digital landscape, where automation and machine-readability are vital, machine-actionable schemas offer enhanced utility. This blog post will guide you through the process of creating a machine-actionable schema using SemanticEngine.org, a user-friendly platform. We’ll explore how to develop such schemas step by step and the benefits of utilizing the Overlays Capture Architecture (OCA) for better data management.

The Practicality of Machine-Actionable Schemas

Think of a schema as a roadmap that navigates you through your data landscape. Machine-actionable schemas take this up a notch by making the schema easily understandable for computers. The advantages range from ensuring data accuracy to streamlining data integration and analysis.

Getting Started: Crafting Your Machine-Actionable Schema

SemanticEngine.org simplifies schema creation through a tutorial on crafting an Excel Template schema. Before you begin, have a clear picture of your dataset or the data you intend to collect. Decide on concise attribute names, avoiding spaces and complex characters to maintain clarity and consistency.

Overlays Capture Architecture (OCA) is the language that SemanticEngine.org uses to express your schema. It allows flexibility to schema design, such as having descriptive labels for your schema attributes. These labels provide context and descriptive information, ensuring accessibility for diverse users. This feature enriches the schema’s comprehensibility. OCA also has features such as adding units to your schema attributes, descriptions to help users understand attributes and more.

A Gradual Approach to Schema Improvement

The flexibility of OCA enables you to start with a basic schema and add details as your project progresses. This adaptable approach accommodates evolving project needs without overwhelming you with upfront complexity.

Creating the Machine-Actionable Version

SemanticEngine.org‘s parser transforms your Excel Template schema into an official OCA Bundle. This bundle compiles all schema features into machine-readable JSON format, packaged as a .zip file.

A pathway for working with schemas. First write your schema template in Excel, then parse this template to your .zip machine-readable bundle at semanticengine.org. You can save this schema and the Excel template together with your data and share it with your data. You can also deposit the .zip and Excel template schemas in Borealis through the library process. You will then have a published schema with a DOI.

Future Developments and the OCA Standard

SemanticEngine.org aims to expand beyond Excel Templates and introduce additional functionalities. The OCA Bundle adheres to the OCA Open Standard hosted by the Human Colossus Foundation. This standardization ensures compatibility and interoperability within the data ecosystem.

Contributing to the OCA Standard

For those interested in shaping OCA’s future, participating in OCA Standard meetings offers an avenue for contribution. By sharing insights, you can contribute to refining this open standard.

SemanticEngine.org empowers researchers with practical tools to create machine-actionable schemas. By simplifying schema creation and incorporating OCA, SemanticEngine.org facilitates efficient, accurate, and collaborative data-driven research. Delve into the world of machine-actionable schemas and optimize your data management with SemanticEngine.org.

Written by Carly Huitema

In the world of research and data analysis, understanding and structuring data effectively is crucial. The way data is documented and organized can greatly impact its usability and value. This is where data schemas come into play. A data schema is like a roadmap that guides researchers through the intricacies of their datasets, ensuring clarity, consistency, and accurate interpretation. In this blog post, we’ll explore the challenges posed by working with poorly described data, the benefits of well-documented data schemas, and how the University of Guelph’s Semantic Engine is revolutionizing the way researchers create and utilize these schemas.

The Challenge of Poor Data Descriptions

Imagine receiving a dataset with columns of numbers and labels that are cryptic at best. Without proper documentation, interpreting the data becomes a daunting task. Researchers may struggle to understand what each column represents, the units of measurement, and the data types involved. This lack of clarity not only hinders individual research efforts but also makes it challenging to collaborate, share, and replicate findings accurately.

The Need for Well-Documented Data Schemas

To make sense of data, researchers rely on data schemas – structured descriptions that outline the composition and meaning of the dataset. A robust schema provides insights into column labels, data types, units, and relationships between different data elements. By offering this comprehensive view, a well-documented schema ensures that researchers can quickly grasp the essence of the data, minimizing misinterpretations and errors.

Furthermore, data schemas play a pivotal role in data sharing. Researchers often collaborate across disciplines, and clear documentation ensures that the context of the data is communicated effectively. It’s especially valuable when researchers from different domains come together, as they might not be familiar with the conventions and nuances of each other’s fields.

Enter the Semantic Engine: Simplifying Schema Creation

Recognizing the importance of data schemas, the University of Guelph has developed the Semantic Engine – a powerful set of tools designed to help researchers generate machine-accessible meaning for their data. One of the standout features of this engine is its ability to simplify the creation and utilization of data schemas, making it easier for researchers to harness the full potential of their data.

Developed in collaboration with researchers, the Semantic Engine takes a user-friendly approach to schema creation. It provides an intuitive interface that allows researchers to craft data schemas with minimal effort, even if they lack specialized knowledge in schema design.

Machine-Actionable Schemas and Beyond

One of the key advantages of the Semantic Engine is its ability to generate machine-actionable schemas. This means that the schemas created using the engine can be easily understood and processed by computers. This machine-readability comes in handy when verifying data – the schema enforces formatting rules, ensuring that the data aligns with the intended structure.

Moreover, having a well-structured schema becomes invaluable when researchers aim to combine datasets. Whether merging data from different sources or conducting meta-analyses, a clear schema facilitates seamless integration, leading to more robust and comprehensive insights.

The Overlays Capture Architecture (OCA) Advantage

At the heart of the Semantic Engine’s schema standard lies the Overlays Capture Architecture (OCA), developed by the Human Colossus Foundation. OCA is an open, international standard that organizes data schemas into layers or overlays. This layered approach provides a detailed and organized representation of the schema, making it easier to comprehend and use.

Each layer of the OCA schema corresponds to a specific feature of the data, and these layers can be added or modified independently. This modularity enhances flexibility and ensures that researchers can adapt their schemas as their research evolves.

Storing and Sharing Schemas

The Semantic Engine not only helps in creating effective data schemas but also facilitates their storage and sharing. Researchers can store the schemas alongside their data, ensuring that the context and structure are preserved. Researchers can also share their schemas when they share their data or deposit their schemas in repositories, assigning them citable identifiers like Digital Object Identifiers (DOIs). This makes the schemas citable and shareable, promoting transparency and reproducibility in research.

In the world of research, the importance of well-documented data cannot be overstated. Clear and structured data schemas empower researchers to unlock the true potential of their datasets, facilitating understanding, collaboration, and meaningful insights. The University of Guelph’s Semantic Engine, with its focus on user-friendly schema creation, machine-actionable designs, and utilization of the Overlays Capture Architecture, is a game-changer in the realm of data schemas. By simplifying schema creation and enhancing data clarity, the Semantic Engine is paving the way for more efficient, accurate, and impactful research across disciplines.

 

Written by Carly Huitema