Data Usage

Imagine this scenario. On her first field season as a principal investigator, a professor watched a graduate student realize—two weeks too late—that no one had recorded soil temperature at the sampling sites. The team had pH, moisture, GPS coordinates… but not the one variable that explained the anomaly in their results. A return trip wasn’t possible. The data gap was permanent.

After that, she changed how her lab collected data.

Instead of relying on ad hoc spreadsheets, she worked with her students to design schemas for their lab’s routine data collection. These weren’t schemas for final data deposit—they were practical structures for the messy, active phase of research. The goal was simple: define in advance what gets collected, how it’s recorded, and which values are allowed.

Researchers can use the Semantic Engine to create schemas that they need for all stages of their research program, from active data collection to final data deposition.

For data collection, once a schema is established, it can be uploaded into the Semantic Engine to generate a Data Entry Excel (DEE) file.

Each DEE contains:

  • A schema description sheet – documentation pulled directly from the schema, including variable definitions and code lists.

  • A data entry sheet – pre-labeled columns that follow the schema rules.

The schema description sheet of a Data Entry Excel.
The schema description sheet of a Data Entry Excel.
Data Entry Excel showing the sheet for data entry.
Data Entry Excel showing the sheet for data entry.

Because the documentation lives in the same file as the data, nothing has to be retyped, reinvented, or remembered from scratch. The schema description sheet also includes code lists that populate the drop-down menus in the data entry sheet, reducing inconsistent terminology and formatting errors.

If the standard schema isn’t sufficient, it can be edited in the Semantic Engine. Researchers can add attributes or adjust fields without rebuilding everything from scratch. The updated schema can then generate a new DEE, preserving previous structure while incorporating the changes.

This approach addresses a common problem: unstructured Excel data. Without standardization, spreadsheets accumulate inconsistent date formats, unit mismatches, ambiguous abbreviations, and missing values. Cleaning that data later is costly and error-prone.

By organizing data entry around a schema:

  • Required information is visible and less likely to be forgotten.

  • Fieldwork becomes more reliable – critical variables are collected the first time.

  • Data from multiple researchers or projects can be harmonized more easily.

  • Manual cleaning and interpretation are reduced.

The generated DEE does not enforce full validation inside Excel (beyond drop-down lists). For formal validation, the completed spreadsheet can be uploaded to the Semantic Engine’s Data Verification tool.

Using schema-driven Data Entry Excel files turns data structure into a practical research tool. Instead of discovering gaps during analysis, researchers define expectations at the point of collection—when it matters most.

Written by Carly Huitema

Generalist and Specialist Data Repositories

Research data repositories can be described along two important dimensions:

  1. how broad or specialized their scope is, and

  2. where they sit in the research data lifecycle.

Understanding these distinctions helps understand the technologies and repositories available for research data.

Generalist Repositories

Generalist repositories are designed to accept many different kinds of data across disciplines. They prioritize inclusivity and flexibility, offering a common technical platform where researchers can deposit datasets that do not fit neatly into a single domain.

A useful metaphor is the junk drawer in a kitchen. A junk drawer contains many useful items—batteries, spare cables, elastic bands—but finding a specific item often requires some effort. Similarly, generalist repositories can hold valuable datasets, but those datasets may be described with relatively generic metadata and limited domain-specific structure.

As a result, data in generalist repositories can be:

  • Harder to discover through precise searches

  • More difficult to interpret without additional context

  • Less immediately reusable by domain experts

Examples of generalist repositories include Dataverse (Borealis in Canada), Figshare and OSF.

Specialist Repositories

Specialist repositories focus on a specific discipline, data type, or research community. They typically enforce domain-specific metadata standards, controlled vocabularies, and structured submission requirements.

Continuing the kitchen metaphor, specialist repositories resemble a cutlery drawer: clearly organized, purpose-built, and easy to use—provided you are looking for the right type of item. Knives go in one place, forks in another, and everything has a defined role.

Because of this structure, specialist repositories tend to make data:

  • More findable through precise, domain-aware search

  • Easier to interpret due to consistent metadata

  • More interoperable with related tools and systems

  • More reusable for future research

In other words, data in specialist repositories are often more FAIR than data in generalist repositories. However, this specialization also limits what they can accept. Many interdisciplinary datasets—particularly in agri-food research—do not align cleanly with the strict models of existing specialist repositories and therefore end up in generalist ones. Examples of specialist repositories include Genbank, PDB and GEO.

The Research Data Lifecycle: Active and Archival Data

Another important way to think about data repositories is in relation to the research data lifecycle.

Research data typically move through several phases:

  1. Planning and collection

  2. Processing, active analysis and refinement

  3. Publication and dissemination

  4. Long-term preservation and reuse

Repositories are often designed to support either active data or archival data, but not both equally well.

Active Data

Active data are produced and used during the course of research. They may be incomplete, frequently updated, or subject to access restrictions due to confidentiality, sensitivity, or competitive concerns.

This is the phase where data are still being cleaned, analyzed, and interpreted. Changes are expected, and collaboration is often ongoing. Most formal repositories are not designed to support this stage, which is typically handled through local storage, shared drives, or project-specific platforms.

Archival Data

Once research is complete and results have been published, data generally move into an archival phase. At this point, datasets are more stable, less likely to change, and often less sensitive—especially if they have been anonymized or if concerns about being “scooped” no longer apply.

Most well-known repositories, including Dataverse, Figshare, and domain-specific archives such as the Protein Data Bank (PDB), are designed primarily for archival data. Their strengths lie in long-term preservation, persistent identifiers (PIDs like DOIs), citation, and access, rather than supporting ongoing analysis or frequent updates.

Bridging the Gaps

It would be inefficient to build a highly specialized repository for every possible type of dataset—much like building a kitchen with a separate drawer for every object that might otherwise end up in the junk drawer. Instead, a more scalable approach is to improve the organization and description of data held in generalist repositories.

Agri-food Data Canada’s approach focuses on developing tools, guidance, and training that help researchers add structure and context to their data wherever it is deposited. By enhancing metadata quality and enabling interoperability between repositories, it becomes possible to make data in generalist repositories more FAIR—without requiring a proliferation of narrowly specialized infrastructure.

Together, specialist and generalist repositories, along with active and archival data systems, form complementary parts of the research data ecosystem. Recognizing their respective roles helps researchers choose appropriate platforms and supports more effective data reuse over time.

Written by Carly Huitema

How Pivot Tables Simplify Analysis of Field Measurements

If you’re working with agricultural or experimental data, Excel’s pivot tables can make summarizing results quick and intuitive. Instead of manually calculating averages or totals for each treatment, you can let Excel do the heavy lifting—organizing your measurements by treatment, cultivar, and replicate automatically.


The Scenario

Suppose you ran a field experiment measuring plant height under different fertilizer treatments and cultivars.
Your spreadsheet might look like this:

Replicate Fertilizer Cultivar Height (cm)
1 Control A 42
2 Control A 45
3 Control A 43
1 High-N A 56
2 High-N A 58
3 High-N A 57
1 Control B 39
2 Control B 41
3 Control B 40
1 High-N B 50
2 High-N B 52
3 High-N B 51

 

That’s a lot of data points—especially if you have more cultivars or treatments. A pivot table can summarize this instantly.


Step 1: Create the Pivot Table

  1. Select your data range.
  2. Go to Insert → PivotTable.
  3. In the PivotTable Field List:
    • Drag Fertilizer to Rows.
    • Drag Cultivar to Columns.
    • Drag Height (cm) to Values.

By default, Excel might show the Sum of height values—but you can change this.


Step 2: Start with a Count Check

Before jumping into averages, it’s good practice to first check that your dataset is complete.
In the Values field, choose Value Field Settings → Count.

This shows how many measurements were recorded for each fertilizer–cultivar combination.
For example:

Pivot table in Excel showing the count for a dataset.
Pivot table in Excel showing the count for a dataset.

If you notice missing or extra counts, you’ll know your data entry needs review before proceeding. This quick check often catches typos or missing replicates that might otherwise distort your summary statistics.


Step 3: Calculate Averages and Standard Deviations

Once the counts look correct:

  • Change the Values field to show Average of Height (cm) for mean comparison.
  • To assess variability, you can add Height (cm) to Values again and set it to StdDev.
  • Remove totals and grand totals which don’t make sense for this analysis.

Your pivot table might now look like this:

pivot table with settings to calculate average and standard deviation.
pivot table with settings to calculate average and standard deviation.

To save the data outside of the pivot table, copy the fields and paste as values.


Why It’s Powerful

  • Data validation: Count first to ensure consistent replication.
  • Efficient summarization: Quickly compute averages and standard deviations.
  • Flexible exploration: Swap Cultivar and Fertilizer fields, or add Replicate to drill into variability.
  • Instant updates: Refresh when new data is added—no formula updates required.

 


Takeaway

Pivot tables turn raw experimental data into structured insights. In field trials, this means you can check data completeness, summarize treatment effects, and explore cultivar differences—all from the same dataset, all within Excel.

Written by Carly Huitema

Alrighty let’s briefly introduce this topic.  AI or LLMs are the latest shiny object in the world of research and everyone wants to use it and create really cool things!  I, myself, am just starting to drink the Kool-Aid by using CoPilot to clean up some of my writing – not these blog posts – obviously!!

Now, all these really cool AI tools or agents use data.  You’ve all heard the saying “Garbage In…. Garbage Out…”?  So, think about that for a moment.  IF our students and researchers collect data and create little to no documentation with their data – then that data becomes available to an AI agent…  how comfortable are you with the results?  What are they based on?  Data without documentation???

Let’s flip the conversation the other way now.   Using AI agents for data creation or data analysis without understanding how the AI works, what it is using for its data, how do the models work – but throwing all those questions to the wind and using the AI agent results just the same.  How do you think that will affect our research world?

I’m not going to dwell on these questions – but want to get them out there and have folks think about them.   Agri-food Data Canada (ADC) has created data documentation tools that can easily fit into the AI world – let’s encourage everyone to document their data, build better data resources – that can then be used in developing AI agents.

Michelle

 

 

image created by AI

When you’re building a data schema you’re making decisions not only about what data to collect, but also how it should be structured. One of the most useful tools you have is format restrictions.

What Are Format Entries?

A format entry in a schema defines a specific pattern or structure that a piece of data must follow. For example:

  • A date must look like YYYY-MM-DD or be in the ISO duration format.
  • An email address must have the format name@example.com
  • A DNA sequence might only include the letters A, T, G, and C

These formats are usually enforced using rules like regular expressions (regex) or standardized format types.

Why Would You Want to Restrict Format?

Restricting the format of data entries is about ensuring data quality, consistency, and usability. Here’s why it’s important:

To Avoid Errors Early

If someone enters a date as “15/03/25” instead of “2025-03-15”, you might not know whether that’s March 15 or March 25 and what year? A clear format prevents confusion and catches errors before they become a problem.

To Make Data Machine-Readable

Computers need consistency. A standardized format means data can be processed, compared, or validated automatically. For example, if every date follows the YYYY-MM-DD format, it’s easy to sort them chronologically or filter them by year. This is especially helpful for sorting files in folders on your computer.

✅ To Improve Interoperability

When data is shared across systems or platforms, shared formats ensure everyone understands it the same way. This is especially important in collaborative research.

Format in the Semantic Engine

Using the Semantic Engine you can add a format feature to your schema and describe what format you want the data to be entered in. While the schema writes the format rule in RegEx, you don’t need to learn how to do this. Instead, the Semantic Engine uses a set of prepared RegEx rules that users can select from. These are documented in the format GitHub repository where new format rules can be proposed by the community.

After you have created format rules in your schema you can use the Data Entry Web tool of the Semantic Engine to verify your results against your rules.

Final Thoughts

Format restrictions may seem technical, but they’re essential to building reliable, reusable, and clean data. When you use them thoughtfully, they help everyone—from data collectors to analysts—work more confidently and efficiently.

Written by Carly Huitema

…and we’re back to the data ownership quandry…

Just when I think I may have heard all the different types of questions and situations that may arise in the context of data ownership – I hear a new one.  When I first heard the situation I’m going to share with you in a moment – I thought nah..  this must be a one-off.  But then I heard it again from a different individual and situation – so it MUST be a “thing”!  When I’m honest with myself, look back, and contemplate my own situations – I’m left wondering too!!!

So let’s work through a research situation.  You have been hired onto a project as a graduate student – working towards your MSc.  You’re SO excited and happy about this wonderful opportunity you have.  You work with your supervisor and lab group to create the most appropriate experimental design to answer your research question, and begin your data collection.   You heard about the Semantic Engine and created your data schema to match your data collection.  Two years down the road and you’re ready to move on – your thesis is complete and you’ve graduated.  What about your data?  What do you do with it?

The BIG question here – WHO owns this data?  The supervisor – who is the PI on the research project you’ve been hired onto?  OR you as the data collector and analyser?  Hmmm…… When you think about these questions – the next question becomes WHO is responsible for the data and what happens to it?   I would love to hear what readers think about this?  Email me at edwardsm@uoguelph.ca if you have an opinion.

OK what are my thoughts? I’ll let you know on my next blog post 🙂

Michelle

 

 

image created by CoPilot