Overlays Capture Architecture

When you use the Semantic Engine to create a schema, one of the first things you are asked to do is to classify your schema.

An example of adding a schema classification to a schema using the Semantic Engine.

It might seem simple, but as you move further from your domain, what seems like an obvious classification to you may not be so obvious to people outside of your specialty. For example, if someone is talking about a bar there are multiple meanings depending on the context. It could be the location for socialization and drinking, or it could be the exam for lawyers.

With the addition of machine learning and machine assisted searching it is even more important to add important contextual queues to our information to help machines produce more reasonable responses.

A recent publication from the Canadian Federated Research Data Repository (FRDR) demonstrated the challenge they had with automated metadata (e.g. classification) reconciliation. A working group investigated how to build an automated or semi-automated workflow to reconcile metadata keywords from harvested datasets. The majority of their term reconciliation work could not be automated. Ultimately FRDR chose to abandon the assignment of standardized terms to metadata records. The downstream impact means relevant datasets may not appear in relevant searches and research will miss out on opportunities to find and potentially reuse data.

The Semantic Engine supports the findability and categorization of schemas through the addition of schema classifications using the controlled vocabulary of Statistics Canada, specifically the Canadian Research and Development Classification (CRDC) 2020 Version 1.0 – Field of Research (FOR). When you enter your schema classification you are using one of the terms from this controlled list.

Ultimately, by classifying your schema you help ensure that both machines and people can better understand and find your schema and be more confident that they are using it for its intended purpose.

Written by Carly Huitema

Recommendation: Long format of datasets is recommended for schema documentation. Long format is more flexible because it is more general. The schema is reusable for other experiments, either by the researcher or by others. It is also easier to reuse the data and combine it with similar experiments.

Data must help answer specific questions or meet specific goals and that influences the way the data can be represented. For example, analysis often depends on data in a specific format, generally referred to as wide vs long format. Wide datasets are more intuitive and easier to grasp when there are relatively few variables, while long datasets are more flexible and efficient for managing complex, structured data with many variables or repeated measures. Researchers and data analysts often transform data between these formats based on the requirements of their analysis.

Wide Dataset:

Format: In a wide dataset, each variable or attribute has its own column, and each observation or data point is a single row. Repeated measures often have their own data column. This representation is typically seen in Excel.

Structure: It typically has a broader structure with many columns, making it easier to read and understand when there are relatively few variables.

Use Cases: Wide datasets are often used for summary or aggregated data, and they are suitable for simple statistical operations like means and sums.

For example, here is a dataset in wide format. Repeated measures (HT1-6) are described in separate columns (e.g. HT1 is the height of the subject measured at the end of week 1; HT2 is the height of the subject measured at the end of week 2 etc.).

ID TREATMENT HT1 HT2 HT3 HT4 HT5 HT6
01 A 12 18 19 26 34 55
02 A 10 15 19 24 30 45
03 B 11 16 20 25 32 50
04 B 9 11 14 22 38 42

 

Long Dataset:

Format: In a long dataset, there are fewer columns, and the data is organized with multiple rows for each unique combination of variables. Typically, you have columns for “variable,” “value,” and potentially other categorical identifiers.

Structure: It is more compact and vertically oriented, making it easier to work with when you have a large number of variables or need to perform complex data transformations.

Use Cases: Long datasets are well-suited for storing and analyzing data with multiple measurements or observations over time or across different categories. They facilitate advanced statistical analyses like regression and mixed-effects modeling. In Excel you can use pivot tables to view summary statistics of long datasets.

For example, here is some of the same data represented in a long format. The repeated measures don’t have separate columns, compared to the wide format, the height (H) is a column, and the weeks (1-6) are now recorded in a ‘week’ column.

ID TREATMENT WEEK HEIGHT
01 A 1 12
01 A 2 18
01 A 3 19
01 A 4 26
01 A 5 34
01 A 6 55

 

Long format data is a better choice when choosing a format to be documented with a schema as it is easier to document and more clear to understand.

For example, column headers (attributes) in the wide format are repetitive and this results in duplicated documentation. It is also less flexible as each additional week needs an additional column and therefore another attribute described in the schema. This means each time you add a variable you change the structure of the capture base of the schema reducing interoperability.

Documenting a schema in long format is more flexible because it is more general. This makes the schema reusable for other experiments, either by the researcher or by others. It is also easier to reuse the data and combine it with similar experiments.

At the time of analysis, the data can be transformed from long to wide if necessary and many data analysis programs have specialized functions that help researchers with this task.

 

Written by: Carly Huitema

Overlays Capture Architecture (OCA) is a structured way to describe data schemas. By making a schema a collection of separate functional parts bundled together, you can introduce the benefits of modular design. A modular schema means you can have multiple parties using their own expertise for separate features of a schema. For example, you could have someone with ontology experience annotate your schema to connect it to ontologies, you could have another person contributing the data validation rules, and you could have your subject matter expert describe in detail how to understand data for each attribute. All these parts come together into a single, useful schema that can perform as many (or as few) functions as a researcher needs. A modular design also means you can reuse and recombine parts from other schemas.

The OCA schema architecture can be remixed and redisplayed into multiple formats which have different functionalities.

OCA Excel Template – the first step in the development of the OCA standard, the Excel Template lets users write their schema using Excel. This Excel file is then read by the OCA parser to make the other OCA formats. While it can be convenient to write a schema in Excel, it requires a separate tutorial to learn the Excel syntax. As the OCA Ecosystem evolves we will be moving away from using the OCA Excel Template.

OCA Bundle – this format is a machine-readable and in a .zip format. If you open the zip file you can see the separate documents that together describe all the overlays and capture base of a schema. Each file is written in JSON, an open standard file format for data exchange. It looks a little tricky to understand because it doesn’t have any line breaks which would help people understand the text, but you can find lots of online tools for viewing the contents of a JSON file in a more human-readable way.

OCA Readme – the readme format takes the schema content and puts it into a human readable and archivable plain text format. OCA Readme represents the contents of a schema in a way that is accessible now and into the future because of its technological simplicity. It is a lengthy but complete document of schema features and eminently suitable for sharing and archiving alongside the machine-readable OCA Bundle.

OCA File – this is a somewhat hidden representation of OCA schema data. It is the core data format of an OCA schema ecosystem. It is the format stored in an OCA Repository and it tracks the history of the schema which is the key to enabling rich search. If OCA Bundle is like a compiled program, OCA File is like the source code.

These different representations of the content of an OCA Schema all serve different purposes. This highlights the flexibility of OCA for representing schemas. Information can be presented in either human- or machine-readable formats, or stored in a way to connect the history of the schema to enable better search. All are connected through the schema identifier (the SAID) and all represent the exact same content.

Logo for Overlays Capture Architecture

Written by Carly Huitema

In the realm of data analysis, having well-structured and clear data is fundamental for meaningful insights. Data schemas provide a blueprint for understanding and organizing datasets. In today’s digital landscape, where automation and machine-readability are vital, machine-actionable schemas offer enhanced utility. This blog post will guide you through the process of creating a machine-actionable schema using SemanticEngine.org, a user-friendly platform. We’ll explore how to develop such schemas step by step and the benefits of utilizing the Overlays Capture Architecture (OCA) for better data management.

The Practicality of Machine-Actionable Schemas

Think of a schema as a roadmap that navigates you through your data landscape. Machine-actionable schemas take this up a notch by making the schema easily understandable for computers. The advantages range from ensuring data accuracy to streamlining data integration and analysis.

Getting Started: Crafting Your Machine-Actionable Schema

SemanticEngine.org simplifies schema creation through a tutorial on crafting an Excel Template schema. Before you begin, have a clear picture of your dataset or the data you intend to collect. Decide on concise attribute names, avoiding spaces and complex characters to maintain clarity and consistency.

Overlays Capture Architecture (OCA) is the language that SemanticEngine.org uses to express your schema. It allows flexibility to schema design, such as having descriptive labels for your schema attributes. These labels provide context and descriptive information, ensuring accessibility for diverse users. This feature enriches the schema’s comprehensibility. OCA also has features such as adding units to your schema attributes, descriptions to help users understand attributes and more.

A Gradual Approach to Schema Improvement

The flexibility of OCA enables you to start with a basic schema and add details as your project progresses. This adaptable approach accommodates evolving project needs without overwhelming you with upfront complexity.

Creating the Machine-Actionable Version

SemanticEngine.org‘s parser transforms your Excel Template schema into an official OCA Bundle. This bundle compiles all schema features into machine-readable JSON format, packaged as a .zip file.

A pathway for working with schemas. First write your schema template in Excel, then parse this template to your .zip machine-readable bundle at semanticengine.org. You can save this schema and the Excel template together with your data and share it with your data. You can also deposit the .zip and Excel template schemas in Borealis through the library process. You will then have a published schema with a DOI.

Future Developments and the OCA Standard

SemanticEngine.org aims to expand beyond Excel Templates and introduce additional functionalities. The OCA Bundle adheres to the OCA Open Standard hosted by the Human Colossus Foundation. This standardization ensures compatibility and interoperability within the data ecosystem.

Contributing to the OCA Standard

For those interested in shaping OCA’s future, participating in OCA Standard meetings offers an avenue for contribution. By sharing insights, you can contribute to refining this open standard.

SemanticEngine.org empowers researchers with practical tools to create machine-actionable schemas. By simplifying schema creation and incorporating OCA, SemanticEngine.org facilitates efficient, accurate, and collaborative data-driven research. Delve into the world of machine-actionable schemas and optimize your data management with SemanticEngine.org.

Written by Carly Huitema