Data Usage

In research data management, small decisions about naming can have outsized consequences. One of the simplest—and most important—best practices is to avoid spaces when you are assigning names. While spaces may seem harmless and human-friendly, they often create problems when data is processed, shared, or analyzed across different tools and systems. The Semantic Engine for example, enforces simple rules for better attribute names without spaces.

The Problem with Spaces

Spaces can introduce ambiguity and errors in many computational environments. For example:

  • Programming languages may require additional syntax (such as quotes or special characters) to correctly reference file names or fields containing spaces.
  • APIs and URLs convert spaces into encoded forms (like %20), making paths harder to read and increasing the chance of errors.

In contrast, using consistent, space-free naming makes data easier to automate, share, and scale—key goals in reproducible research.

A Better Approach: Naming Conventions

Instead of spaces, researchers use structured naming conventions that improve readability while remaining machine-friendly. The most common conventions are outlined below.

1. Snake Case (snake_case)

  • Words are written in lowercase and separated by underscores.
  • Example: sample_id, gene_expression_level
  • Best for: Data schemas, file names, and many programming contexts (especially Python).
  • Benefits: Clear separation between words, widely supported, easy to read.

2. Kebab Case (kebab-case)

  • Words are lowercase and separated by hyphens.
  • Example: sample-id, gene-expression-level
  • Best for: File names and URLs.
  • Benefits: Clean and readable; commonly used in web contexts.
  • Note: Some programming languages interpret hyphens as minus signs, so avoid in variable names.

3. Camel Case (camelCase)

  • The first word is lowercase; subsequent words are capitalized.
  • Example: sampleId, geneExpressionLevel
  • Best for: Programming variables (common in JavaScript and many APIs).
  • Benefits: Compact and readable without separators.

4. Pascal Case (PascalCase)

  • Each word starts with a capital letter, including the first.
  • Example: SampleId, GeneExpressionLevel
  • Best for: Class names and structured types in many programming languages.
  • Benefits: Clearly distinguishes multi-word identifiers, often used for higher-level objects.

Choosing the Right Convention

The key is consistency. Pick a convention appropriate for your tools and stick with it across your dataset, codebase, and documentation.

Conclusion

Eliminating spaces in file and schema names is a small change that greatly improves the reliability and portability of research data. By adopting clear naming conventions like snake case, kebab case, camel case, or Pascal case, researchers create datasets that are easier to process, share, and reproduce.

Written by Carly Huitema

Maintaining clean, consistent data remains one of the biggest challenges in data management. Entry codes—also known as picklists—have long played a key role in improving data quality by standardizing how information is captured. Building on this foundation, a new Entry Code Library feature has been introduced in the Semantic Engine schema writer, making it easier than ever to reuse proven standards and reduce errors at the point of data entry.

The Value of Entry Codes (Picklists)

Entry codes provide a structured alternative to free-text data entry. Instead of allowing users to manually type values, entry codes limit input to a predefined list of acceptable options. This approach helps:

  • Prevent spelling mistakes and inconsistent terminology
  • Ensure uniform data across datasets and projects
  • Improve searchability, aggregation, and downstream analysis

By capturing standardized codes rather than variable text, datasets become more reliable, interoperable, and easier to maintain over time.

Introducing the Entry Code Library

Based on direct user feedback, the Semantic Engine team has introduced an Entry Code Library to streamline schema creation and encourage reuse of existing work.

When defining a variable in the schema writer, users who select List as their initial data type now gain access to a premade library of entry codes.

Adding a list to a variable.
Adding a list to a variable.

Rather than building a list from scratch each time, you can browse and search the library for existing code lists that meet your needs.

Selecting entry codes from the entry code library.
Selecting entry codes from the entry code library.

Search, Reuse, and Align with Standards

The Entry Code Library is designed to save time and improve consistency by helping users:

  • Search for commonly used entry code lists
  • Reuse established vocabularies and standards
  • Avoid duplication of effort across projects
  • Reduce data cleanup caused by inconsistent entry values

By leveraging shared entry code lists, datasets across teams and domains can align more easily, improving overall data interoperability.

Contributing to the Library

The Entry Code Library is a growing resource. If you have created—or identified—a code list that you believe would be valuable to others, we encourage you to contribute.

If you see a list you would like added to the library, please contact us at adc@uoguelph.ca.

Your contributions help build a stronger, more reusable ecosystem for high-quality data entry.

Moving Toward Cleaner Data by Design

Entry codes have always been a powerful tool for enforcing consistency at the point of data capture. With the introduction of the Entry Code Library in the Semantic Engine schema writer, users now have even greater support for creating standardized, reusable, and error-resistant schemas.

By combining structured entry codes with shared libraries and community input, data quality improves not after collection—but from the very beginning.

Written by Carly Huitema

Uh-oh here she goes again!  I’ve been pondering the direction I wanted to take with this post – do I dig more into the data opportunities that may exist in the OAC Annual reports? or twist my Rubik’s cube to look at a different facet of this ongoing conversation.  Let’s look at another facet.

You may recall that we created a tagline for Agri-food Data Canada (ADC) a few years ago:

Making agri-food data FAIR!

Remember Findable Accessible Interoperable Reusable.  If you’ve been following this blog, you know we have gone into detail on FAIR and the tools that we have created to make the data FAIR.  Now the question- which I have posed in the past also – do we need to qualify WHAT data we make FAIR?  Does someone need to decide which data has more “value” and should be FAIR?  Oh you know that question will not end well!  So, let’s be objective and take that tagline at face value – agri-food data – whether it is created today or 150+ years ago.   I know this may seem a little counter to what I said a few weeks ago – but I do feel that we need to think about the goal!  Why would we want to make that 150+ data FAIR?  What am I going to use that historical data for?   If we can find it, dig it out, and document the older data – we also need to think about the resources that are needed to create that FAIR data resource AND do they outweigh how and if the data will be Reused?  Back to that circular conversation again 🙂

So why am I bringing this up yet again?  Because I was reminded this week – that there are INDEED projects out there – Canadian projects – that are indeed using older and historic datafiles.   For instance check out:  The GeoREACH Lab at UPEI – a FABULOUS use case for using historical data.

So….  back to my original question – is there VALUE in historical ag data?  Should we spend the resources to uncover that hidden data trove of OAC research?

Michelle

Imagine this scenario. On her first field season as a principal investigator, a professor watched a graduate student realize—two weeks too late—that no one had recorded soil temperature at the sampling sites. The team had pH, moisture, GPS coordinates… but not the one variable that explained the anomaly in their results. A return trip wasn’t possible. The data gap was permanent.

After that, she changed how her lab collected data.

Instead of relying on ad hoc spreadsheets, she worked with her students to design schemas for their lab’s routine data collection. These weren’t schemas for final data deposit—they were practical structures for the messy, active phase of research. The goal was simple: define in advance what gets collected, how it’s recorded, and which values are allowed.

Researchers can use the Semantic Engine to create schemas that they need for all stages of their research program, from active data collection to final data deposition.

For data collection, once a schema is established, it can be uploaded into the Semantic Engine to generate a Data Entry Excel (DEE) file.

Each DEE contains:

  • A schema description sheet – documentation pulled directly from the schema, including variable definitions and code lists.

  • A data entry sheet – pre-labeled columns that follow the schema rules.

The schema description sheet of a Data Entry Excel.
The schema description sheet of a Data Entry Excel.
Data Entry Excel showing the sheet for data entry.
Data Entry Excel showing the sheet for data entry.

Because the documentation lives in the same file as the data, nothing has to be retyped, reinvented, or remembered from scratch. The schema description sheet also includes code lists that populate the drop-down menus in the data entry sheet, reducing inconsistent terminology and formatting errors.

If the standard schema isn’t sufficient, it can be edited in the Semantic Engine. Researchers can add attributes or adjust fields without rebuilding everything from scratch. The updated schema can then generate a new DEE, preserving previous structure while incorporating the changes.

This approach addresses a common problem: unstructured Excel data. Without standardization, spreadsheets accumulate inconsistent date formats, unit mismatches, ambiguous abbreviations, and missing values. Cleaning that data later is costly and error-prone.

By organizing data entry around a schema:

  • Required information is visible and less likely to be forgotten.

  • Fieldwork becomes more reliable – critical variables are collected the first time.

  • Data from multiple researchers or projects can be harmonized more easily.

  • Manual cleaning and interpretation are reduced.

The generated DEE does not enforce full validation inside Excel (beyond drop-down lists). For formal validation, the completed spreadsheet can be uploaded to the Semantic Engine’s Data Verification tool.

Using schema-driven Data Entry Excel files turns data structure into a practical research tool. Instead of discovering gaps during analysis, researchers define expectations at the point of collection—when it matters most.

Written by Carly Huitema

Generalist and Specialist Data Repositories

Research data repositories can be described along two important dimensions:

  1. how broad or specialized their scope is, and

  2. where they sit in the research data lifecycle.

Understanding these distinctions helps understand the technologies and repositories available for research data.

Generalist Repositories

Generalist repositories are designed to accept many different kinds of data across disciplines. They prioritize inclusivity and flexibility, offering a common technical platform where researchers can deposit datasets that do not fit neatly into a single domain.

A useful metaphor is the junk drawer in a kitchen. A junk drawer contains many useful items—batteries, spare cables, elastic bands—but finding a specific item often requires some effort. Similarly, generalist repositories can hold valuable datasets, but those datasets may be described with relatively generic metadata and limited domain-specific structure.

As a result, data in generalist repositories can be:

  • Harder to discover through precise searches

  • More difficult to interpret without additional context

  • Less immediately reusable by domain experts

Examples of generalist repositories include Dataverse (Borealis in Canada), Figshare and OSF.

Specialist Repositories

Specialist repositories focus on a specific discipline, data type, or research community. They typically enforce domain-specific metadata standards, controlled vocabularies, and structured submission requirements.

Continuing the kitchen metaphor, specialist repositories resemble a cutlery drawer: clearly organized, purpose-built, and easy to use—provided you are looking for the right type of item. Knives go in one place, forks in another, and everything has a defined role.

Because of this structure, specialist repositories tend to make data:

  • More findable through precise, domain-aware search

  • Easier to interpret due to consistent metadata

  • More interoperable with related tools and systems

  • More reusable for future research

In other words, data in specialist repositories are often more FAIR than data in generalist repositories. However, this specialization also limits what they can accept. Many interdisciplinary datasets—particularly in agri-food research—do not align cleanly with the strict models of existing specialist repositories and therefore end up in generalist ones. Examples of specialist repositories include Genbank, PDB and GEO.

The Research Data Lifecycle: Active and Archival Data

Another important way to think about data repositories is in relation to the research data lifecycle.

Research data typically move through several phases:

  1. Planning and collection

  2. Processing, active analysis and refinement

  3. Publication and dissemination

  4. Long-term preservation and reuse

Repositories are often designed to support either active data or archival data, but not both equally well.

Active Data

Active data are produced and used during the course of research. They may be incomplete, frequently updated, or subject to access restrictions due to confidentiality, sensitivity, or competitive concerns.

This is the phase where data are still being cleaned, analyzed, and interpreted. Changes are expected, and collaboration is often ongoing. Most formal repositories are not designed to support this stage, which is typically handled through local storage, shared drives, or project-specific platforms.

Archival Data

Once research is complete and results have been published, data generally move into an archival phase. At this point, datasets are more stable, less likely to change, and often less sensitive—especially if they have been anonymized or if concerns about being “scooped” no longer apply.

Most well-known repositories, including Dataverse, Figshare, and domain-specific archives such as the Protein Data Bank (PDB), are designed primarily for archival data. Their strengths lie in long-term preservation, persistent identifiers (PIDs like DOIs), citation, and access, rather than supporting ongoing analysis or frequent updates.

Bridging the Gaps

It would be inefficient to build a highly specialized repository for every possible type of dataset—much like building a kitchen with a separate drawer for every object that might otherwise end up in the junk drawer. Instead, a more scalable approach is to improve the organization and description of data held in generalist repositories.

Agri-food Data Canada’s approach focuses on developing tools, guidance, and training that help researchers add structure and context to their data wherever it is deposited. By enhancing metadata quality and enabling interoperability between repositories, it becomes possible to make data in generalist repositories more FAIR—without requiring a proliferation of narrowly specialized infrastructure.

Together, specialist and generalist repositories, along with active and archival data systems, form complementary parts of the research data ecosystem. Recognizing their respective roles helps researchers choose appropriate platforms and supports more effective data reuse over time.

Written by Carly Huitema

How Pivot Tables Simplify Analysis of Field Measurements

If you’re working with agricultural or experimental data, Excel’s pivot tables can make summarizing results quick and intuitive. Instead of manually calculating averages or totals for each treatment, you can let Excel do the heavy lifting—organizing your measurements by treatment, cultivar, and replicate automatically.


The Scenario

Suppose you ran a field experiment measuring plant height under different fertilizer treatments and cultivars.
Your spreadsheet might look like this:

Replicate Fertilizer Cultivar Height (cm)
1 Control A 42
2 Control A 45
3 Control A 43
1 High-N A 56
2 High-N A 58
3 High-N A 57
1 Control B 39
2 Control B 41
3 Control B 40
1 High-N B 50
2 High-N B 52
3 High-N B 51

 

That’s a lot of data points—especially if you have more cultivars or treatments. A pivot table can summarize this instantly.


Step 1: Create the Pivot Table

  1. Select your data range.
  2. Go to Insert → PivotTable.
  3. In the PivotTable Field List:
    • Drag Fertilizer to Rows.
    • Drag Cultivar to Columns.
    • Drag Height (cm) to Values.

By default, Excel might show the Sum of height values—but you can change this.


Step 2: Start with a Count Check

Before jumping into averages, it’s good practice to first check that your dataset is complete.
In the Values field, choose Value Field Settings → Count.

This shows how many measurements were recorded for each fertilizer–cultivar combination.
For example:

Pivot table in Excel showing the count for a dataset.
Pivot table in Excel showing the count for a dataset.

If you notice missing or extra counts, you’ll know your data entry needs review before proceeding. This quick check often catches typos or missing replicates that might otherwise distort your summary statistics.


Step 3: Calculate Averages and Standard Deviations

Once the counts look correct:

  • Change the Values field to show Average of Height (cm) for mean comparison.
  • To assess variability, you can add Height (cm) to Values again and set it to StdDev.
  • Remove totals and grand totals which don’t make sense for this analysis.

Your pivot table might now look like this:

pivot table with settings to calculate average and standard deviation.
pivot table with settings to calculate average and standard deviation.

To save the data outside of the pivot table, copy the fields and paste as values.


Why It’s Powerful

  • Data validation: Count first to ensure consistent replication.
  • Efficient summarization: Quickly compute averages and standard deviations.
  • Flexible exploration: Swap Cultivar and Fertilizer fields, or add Replicate to drill into variability.
  • Instant updates: Refresh when new data is added—no formula updates required.

 


Takeaway

Pivot tables turn raw experimental data into structured insights. In field trials, this means you can check data completeness, summarize treatment effects, and explore cultivar differences—all from the same dataset, all within Excel.

Written by Carly Huitema

Alrighty let’s briefly introduce this topic.  AI or LLMs are the latest shiny object in the world of research and everyone wants to use it and create really cool things!  I, myself, am just starting to drink the Kool-Aid by using CoPilot to clean up some of my writing – not these blog posts – obviously!!

Now, all these really cool AI tools or agents use data.  You’ve all heard the saying “Garbage In…. Garbage Out…”?  So, think about that for a moment.  IF our students and researchers collect data and create little to no documentation with their data – then that data becomes available to an AI agent…  how comfortable are you with the results?  What are they based on?  Data without documentation???

Let’s flip the conversation the other way now.   Using AI agents for data creation or data analysis without understanding how the AI works, what it is using for its data, how do the models work – but throwing all those questions to the wind and using the AI agent results just the same.  How do you think that will affect our research world?

I’m not going to dwell on these questions – but want to get them out there and have folks think about them.   Agri-food Data Canada (ADC) has created data documentation tools that can easily fit into the AI world – let’s encourage everyone to document their data, build better data resources – that can then be used in developing AI agents.

Michelle

 

 

image created by AI

When you’re building a data schema you’re making decisions not only about what data to collect, but also how it should be structured. One of the most useful tools you have is format restrictions.

What Are Format Entries?

A format entry in a schema defines a specific pattern or structure that a piece of data must follow. For example:

  • A date must look like YYYY-MM-DD or be in the ISO duration format.
  • An email address must have the format name@example.com
  • A DNA sequence might only include the letters A, T, G, and C

These formats are usually enforced using rules like regular expressions (regex) or standardized format types.

Why Would You Want to Restrict Format?

Restricting the format of data entries is about ensuring data quality, consistency, and usability. Here’s why it’s important:

To Avoid Errors Early

If someone enters a date as “15/03/25” instead of “2025-03-15”, you might not know whether that’s March 15 or March 25 and what year? A clear format prevents confusion and catches errors before they become a problem.

To Make Data Machine-Readable

Computers need consistency. A standardized format means data can be processed, compared, or validated automatically. For example, if every date follows the YYYY-MM-DD format, it’s easy to sort them chronologically or filter them by year. This is especially helpful for sorting files in folders on your computer.

✅ To Improve Interoperability

When data is shared across systems or platforms, shared formats ensure everyone understands it the same way. This is especially important in collaborative research.

Format in the Semantic Engine

Using the Semantic Engine you can add a format feature to your schema and describe what format you want the data to be entered in. While the schema writes the format rule in RegEx, you don’t need to learn how to do this. Instead, the Semantic Engine uses a set of prepared RegEx rules that users can select from. These are documented in the format GitHub repository where new format rules can be proposed by the community.

After you have created format rules in your schema you can use the Data Entry Web tool of the Semantic Engine to verify your results against your rules.

Final Thoughts

Format restrictions may seem technical, but they’re essential to building reliable, reusable, and clean data. When you use them thoughtfully, they help everyone—from data collectors to analysts—work more confidently and efficiently.

Written by Carly Huitema

…and we’re back to the data ownership quandry…

Just when I think I may have heard all the different types of questions and situations that may arise in the context of data ownership – I hear a new one.  When I first heard the situation I’m going to share with you in a moment – I thought nah..  this must be a one-off.  But then I heard it again from a different individual and situation – so it MUST be a “thing”!  When I’m honest with myself, look back, and contemplate my own situations – I’m left wondering too!!!

So let’s work through a research situation.  You have been hired onto a project as a graduate student – working towards your MSc.  You’re SO excited and happy about this wonderful opportunity you have.  You work with your supervisor and lab group to create the most appropriate experimental design to answer your research question, and begin your data collection.   You heard about the Semantic Engine and created your data schema to match your data collection.  Two years down the road and you’re ready to move on – your thesis is complete and you’ve graduated.  What about your data?  What do you do with it?

The BIG question here – WHO owns this data?  The supervisor – who is the PI on the research project you’ve been hired onto?  OR you as the data collector and analyser?  Hmmm…… When you think about these questions – the next question becomes WHO is responsible for the data and what happens to it?   I would love to hear what readers think about this?  Email me at edwardsm@uoguelph.ca if you have an opinion.

OK what are my thoughts? I’ll let you know on my next blog post 🙂

Michelle

 

 

image created by CoPilot