Documentation

In research data management, small decisions about naming can have outsized consequences. One of the simplest—and most important—best practices is to avoid spaces when you are assigning names. While spaces may seem harmless and human-friendly, they often create problems when data is processed, shared, or analyzed across different tools and systems. The Semantic Engine for example, enforces simple rules for better attribute names without spaces.

The Problem with Spaces

Spaces can introduce ambiguity and errors in many computational environments. For example:

  • Programming languages may require additional syntax (such as quotes or special characters) to correctly reference file names or fields containing spaces.
  • APIs and URLs convert spaces into encoded forms (like %20), making paths harder to read and increasing the chance of errors.

In contrast, using consistent, space-free naming makes data easier to automate, share, and scale—key goals in reproducible research.

A Better Approach: Naming Conventions

Instead of spaces, researchers use structured naming conventions that improve readability while remaining machine-friendly. The most common conventions are outlined below.

1. Snake Case (snake_case)

  • Words are written in lowercase and separated by underscores.
  • Example: sample_id, gene_expression_level
  • Best for: Data schemas, file names, and many programming contexts (especially Python).
  • Benefits: Clear separation between words, widely supported, easy to read.

2. Kebab Case (kebab-case)

  • Words are lowercase and separated by hyphens.
  • Example: sample-id, gene-expression-level
  • Best for: File names and URLs.
  • Benefits: Clean and readable; commonly used in web contexts.
  • Note: Some programming languages interpret hyphens as minus signs, so avoid in variable names.

3. Camel Case (camelCase)

  • The first word is lowercase; subsequent words are capitalized.
  • Example: sampleId, geneExpressionLevel
  • Best for: Programming variables (common in JavaScript and many APIs).
  • Benefits: Compact and readable without separators.

4. Pascal Case (PascalCase)

  • Each word starts with a capital letter, including the first.
  • Example: SampleId, GeneExpressionLevel
  • Best for: Class names and structured types in many programming languages.
  • Benefits: Clearly distinguishes multi-word identifiers, often used for higher-level objects.

Choosing the Right Convention

The key is consistency. Pick a convention appropriate for your tools and stick with it across your dataset, codebase, and documentation.

Conclusion

Eliminating spaces in file and schema names is a small change that greatly improves the reliability and portability of research data. By adopting clear naming conventions like snake case, kebab case, camel case, or Pascal case, researchers create datasets that are easier to process, share, and reproduce.

Written by Carly Huitema

Maintaining clean, consistent data remains one of the biggest challenges in data management. Entry codes—also known as picklists—have long played a key role in improving data quality by standardizing how information is captured. Building on this foundation, a new Entry Code Library feature has been introduced in the Semantic Engine schema writer, making it easier than ever to reuse proven standards and reduce errors at the point of data entry.

The Value of Entry Codes (Picklists)

Entry codes provide a structured alternative to free-text data entry. Instead of allowing users to manually type values, entry codes limit input to a predefined list of acceptable options. This approach helps:

  • Prevent spelling mistakes and inconsistent terminology
  • Ensure uniform data across datasets and projects
  • Improve searchability, aggregation, and downstream analysis

By capturing standardized codes rather than variable text, datasets become more reliable, interoperable, and easier to maintain over time.

Introducing the Entry Code Library

Based on direct user feedback, the Semantic Engine team has introduced an Entry Code Library to streamline schema creation and encourage reuse of existing work.

When defining a variable in the schema writer, users who select List as their initial data type now gain access to a premade library of entry codes.

Adding a list to a variable.
Adding a list to a variable.

Rather than building a list from scratch each time, you can browse and search the library for existing code lists that meet your needs.

Selecting entry codes from the entry code library.
Selecting entry codes from the entry code library.

Search, Reuse, and Align with Standards

The Entry Code Library is designed to save time and improve consistency by helping users:

  • Search for commonly used entry code lists
  • Reuse established vocabularies and standards
  • Avoid duplication of effort across projects
  • Reduce data cleanup caused by inconsistent entry values

By leveraging shared entry code lists, datasets across teams and domains can align more easily, improving overall data interoperability.

Contributing to the Library

The Entry Code Library is a growing resource. If you have created—or identified—a code list that you believe would be valuable to others, we encourage you to contribute.

If you see a list you would like added to the library, please contact us at adc@uoguelph.ca.

Your contributions help build a stronger, more reusable ecosystem for high-quality data entry.

Moving Toward Cleaner Data by Design

Entry codes have always been a powerful tool for enforcing consistency at the point of data capture. With the introduction of the Entry Code Library in the Semantic Engine schema writer, users now have even greater support for creating standardized, reusable, and error-resistant schemas.

By combining structured entry codes with shared libraries and community input, data quality improves not after collection—but from the very beginning.

Written by Carly Huitema

The Case for Uploadable Form Data: A More Flexible Approach to Online Submissions

Anyone who has worked extensively with online submission systems will recognize a familiar frustration: you have done the hard work of gathering, drafting, and refining your content – often collaboratively, across multiple documents and tools – and now you face the tedious task of manually copying everything into a web form, field by field. The content is ready; the process of getting it into the system is not.

This challenge comes up repeatedly in our work with the Agri-food Data Canada (ADC) and the Climate-Smart Data Collaboration Centre (CS-DCC), and it is not unique to any one platform or domain. It reflects a structural gap in how most online forms are designed: they are built for data entry, not data transfer.

The Collaboration Problem

This tension is particularly well illustrated in the context of data management plans. As noted in a recent article from Upstream:

“Data management planning is often a collaborative process, involving researchers, librarians, and institutional support staff. External tools and shared documents make it easier to iterate on plans, incorporate guidance, and ensure alignment with institutional policies and available resources. When plans are created directly within submission systems, that collaborative process can become more difficult.”

The same dynamic applies across many submission workflows. Researchers, teams, and support staff work best when they can iterate freely in shared documents, draw on existing resources, and incorporate guidance from multiple sources. Forcing that process into a single submission interface introduces friction at exactly the wrong moment – when content is nearly complete and should be easy to finalize.

Why APIs Are Not Always the Answer

The conventional infrastructure response to interoperability problems is to connect systems via APIs. While APIs are powerful and appropriate in many contexts, they come with real constraints. Both parties must be ready and willing to build and maintain the connection. Integration work requires technical resources on both sides. Security and access management become more complex. And the result is a series of point-to-point connections rather than a flexible, open approach that any participant can use.

APIs are well suited for tightly coupled systems with dedicated integration teams. They are less well suited for the diverse, distributed ecosystems that characterize research infrastructure – where institutions, tools, and workflows vary enormously and where not every participant has the capacity to build custom integrations.

A Simpler Proposal: Publishable Schemas and Uploadable Data Files

At ADC, we have been exploring a different approach. The core idea is straightforward: if an online form publishes its expected data structure as a downloadable schema or sample data file, then users can prepare their submissions outside the system – collaboratively, using whatever tools work best for them – and upload a structured file (such as a .json file) when they are ready to submit.

The form receives the uploaded file, validates it against the expected format, and populates the interface with the pre-filled content. The user retains final control, reviewing and editing within the UI before submitting. This preserves the benefits of collaborative, tool-agnostic preparation while keeping the submission process human-centered and editable.

Consider the DMP Assistant, the Alliance’s data management planning tool. Currently, users draft their plans directly in the online interface. Under this model, a researcher could instead work with their existing lab documentation, institutional guidance, and an AI assistant to compile all the relevant information into a structured .json file, then upload it into the DMP Assistant to populate the form in a single step – arriving at the editing stage with a complete draft rather than a blank form.

We Already Do This

This is not a theoretical proposal. ADC already supports this kind of workflow in the Semantic Engine’s schema development tool. We provide a structured prompt that helps users draft their schema content with an AI assistant, then upload the resulting .json file directly into the Semantic Engine editor. The file is parsed, the fields are populated, and the user continues working from there – a draft version that may contain AI errors but offers the opportunity to correct and improve.

The pattern works. It reduces manual data entry, supports collaborative preparation, and lowers the barrier for users who want to work in familiar tools before engaging with a specialized system. We believe it is a model worth broader adoption, and one that submission systems of all kinds could implement without requiring significant infrastructure investment from either side.

Written by Carly Huitema

 

Wow!  Isn’t it amazing how our world can change in an instant?  Remember not that long ago when AI was an up and coming “thing” but not yet a mainstream facet of our research lives?  Now it seems everything is about AI or has some AI component to it.  I’m not saying that it’s a bad thing nor am I loving it –  I’m just stating the facts!  It does worry me a bit though – but then again I’m an old fogey who is kinda set in her ways 😀

Research funding calls now seem to be centred around AI – which is great but again there is a challenge to this that I think folks are missing.  If you are building tools to enable researchers – how will an LLM fit in?  How do we hit pause or rather encourage the research to slow down just a little until we can convince our researchers that we need those basic building blocks before we go and build the next CN tower.  Alright Michelle – what are you really trying to say?

AI and LLM can help us build and expand our data ecosystems – but if we don’t have the basics – aka documentation – how can build an effective tool and LLM?  Remember the old adage used in statistics?  Garbage in – garbage out?  If we cannot create proper documentation for our datasets – then what will the AI models use?

AI is a great tool – but I think we need to remember exactly that – it is a tool!  We can use it to enhance our current data ecosystems – but let’s not rely on it.  We need to teach our up and coming students those basic building blocks – what is data, what is an attribute, what is a data type, etc…. before we unleash the AI tools that promise to make our lives easier.

Would love to hear your thoughts on this!

Michelle

That is the question to ask – when it comes to historical research data.

So – yes I found some of the original research data that was collected by OAC researchers back in 1877 – BUT…   do we spend the time and resources into pulling it out of the PDFs and making it accessible to the world?  I know I’ve asked this question in a different way in previous posts – but it’s a question that keeps coming up.

There are definitely aspects of this question that need to be objectively reviewed:

  • WHY? undertake this venture?  For curiosity or because it is valuable to you as a researcher?
  • HOW far back do we go? 10 yrs? 50 yrs? 150 yrs?
  • WHERE will you keep this data?  Hmm…  that’s a big question today!
  • WHO? will steward this data?

As a data geek – I want to pull this out and steward it – but I have to be practical about it as well.  How much have research technologies changed over this time period?  How valid is this data today?  Let’s think about this in a different way.  Textbooks – When we teach we update our textbooks on a regular basis since there is new materials to teach and new ways to view and teach the materials.  I have Statistics textbooks going way back – 1950s – but I don’t use these to teach – I use the new updated 2025 texts.  I may read the older texts to get a perspective on why or how things have changed –  but I will use the newer texts as resources for my students.

Let’s go back to data.   I have this cool data on weights and feed intakes of animals in the 1870s but our animals have changed over the past 150+ years.  Is the historical data really of use in today’s active research projects?   Probably not – unless you are a historian?   See – how I can always find a way or reason to keep this data?  But really, time and available resources come into play – do we have the time,  money, and resources to create and preserve this data – that may or may not be of use?

Oish!  I can talk circles around this!  I believe that everyone will come to a point where they will need to make these types of decisions –  but for today’s research data – let’s document it, and deposit into a repository – the data is relevant!  Not like my 150+yr old data 🙂

Michelle

Imagine this scenario. On her first field season as a principal investigator, a professor watched a graduate student realize—two weeks too late—that no one had recorded soil temperature at the sampling sites. The team had pH, moisture, GPS coordinates… but not the one variable that explained the anomaly in their results. A return trip wasn’t possible. The data gap was permanent.

After that, she changed how her lab collected data.

Instead of relying on ad hoc spreadsheets, she worked with her students to design schemas for their lab’s routine data collection. These weren’t schemas for final data deposit—they were practical structures for the messy, active phase of research. The goal was simple: define in advance what gets collected, how it’s recorded, and which values are allowed.

Researchers can use the Semantic Engine to create schemas that they need for all stages of their research program, from active data collection to final data deposition.

For data collection, once a schema is established, it can be uploaded into the Semantic Engine to generate a Data Entry Excel (DEE) file.

Each DEE contains:

  • A schema description sheet – documentation pulled directly from the schema, including variable definitions and code lists.

  • A data entry sheet – pre-labeled columns that follow the schema rules.

The schema description sheet of a Data Entry Excel.
The schema description sheet of a Data Entry Excel.
Data Entry Excel showing the sheet for data entry.
Data Entry Excel showing the sheet for data entry.

Because the documentation lives in the same file as the data, nothing has to be retyped, reinvented, or remembered from scratch. The schema description sheet also includes code lists that populate the drop-down menus in the data entry sheet, reducing inconsistent terminology and formatting errors.

If the standard schema isn’t sufficient, it can be edited in the Semantic Engine. Researchers can add attributes or adjust fields without rebuilding everything from scratch. The updated schema can then generate a new DEE, preserving previous structure while incorporating the changes.

This approach addresses a common problem: unstructured Excel data. Without standardization, spreadsheets accumulate inconsistent date formats, unit mismatches, ambiguous abbreviations, and missing values. Cleaning that data later is costly and error-prone.

By organizing data entry around a schema:

  • Required information is visible and less likely to be forgotten.

  • Fieldwork becomes more reliable – critical variables are collected the first time.

  • Data from multiple researchers or projects can be harmonized more easily.

  • Manual cleaning and interpretation are reduced.

The generated DEE does not enforce full validation inside Excel (beyond drop-down lists). For formal validation, the completed spreadsheet can be uploaded to the Semantic Engine’s Data Verification tool.

Using schema-driven Data Entry Excel files turns data structure into a practical research tool. Instead of discovering gaps during analysis, researchers define expectations at the point of collection—when it matters most.

Written by Carly Huitema

I’m sure by now you’ve heard of the AAFC news – seven research facilities closing with many job cuts.  Research facilities with over a century of research, data, reports….  Oh you all know where I’m going with this!!!  Yup!  Where is all that data?  Gone?  Hidden?  Maybe in some repository?  I don’t know!

What I do know is that we, as an industry and as data researchers and archivists, need to seriously think about that data!   Lacombe Research Centre – 119 years of research – many of these in the field of meat science!  If anyone works in that area – you are well aware of the changes we’ve made over time in the quality of our meats – how we evaluate and grade – a lot of that research was developed at Lacombe!   As a beef geneticist who worked in the meat science field, I am crying if that data is not saved or at least documented!  Uh-oh I said that magic word “document”.

I’m trying to stay optimistic and hopeful – but when I attend industry related meetings and the primary question that arises is “What data?” followed by “Where is the data?” I get scared!  The only reason I am familiar with the type of research and data that was collected at Lacombe is because of my research background.  If I was to run a search today for pork grade data – ok – let’s try it for giggles.

Screenshot of google search results
Screenshot: Google results of “pork grade data”

Hmmm…  ok I should add Canada and see if that changes anything….

Canadian pork grade data google results
Screenshot of Google results for “Canadian pork grade data”

 

Yup!  As I suspected nothing but reports – no data!  So – that initial question of “What data?” followed by “Where is the data?” is not being answered!

Two points I want to make here:

  1. Data is NOT easy to find – nothing showing up for Lacombe information?   If you didn’t know this data existed you wouldn’t know to ask about it.  The classic “If you don’t know you don’t know!”  So – if we don’t know it exists then it’s ok to let it go?  Maybe I shouldn’t worry about the data that’s been collected for the past 119 years?
  2. This is the MAIN problem that we are trying to solve with both ADC and the CS-DCC!   A catalogue of data sources to search across.  A place to visit to determine IF the data exists – followed by where the data exists.  BUT if we don’t know it exists or if it disappears then….

Let’s wake up and acknowledge that our data is VALUABLE and needs to be preserved!

Let’s hope I am wrong and the data collected at the seven AAFC facilities slated for closure will be preserved and FAIR!

Michelle

You’ve seen this word thrown around a lot!  Data about data.  Data Documentation.  Information about your data.  So many different ways to define “metadata”.

If you’ve been reading our blog posts – you know that we are STRONG advocates for data documentation!!  I, personally, am a STRONG believer in metadata – without it – all that time and money that was put into data collection has been flushed away.  Without that crucial documentation or metadata – the data you or your team collected is useless since no one can understand what the data is – let alone understand how to use it.

Let’s add another word now – Standards.  Yes!  Believe it or not there are many different metadata standards out there!  I would argue that most scientific disciplines have an established metadata standard.  Now – as a researcher – are you familiar with these?  Did you know there was a metadata standard for your field of research?

At Agri-food Data Canada, we are aware that this can be very overwhelming – so that’s one of the primary reasons we encourage you to use the Semantic Engine to document the data that you collect – as you would collect it.  Let’s work at documenting the data in a machine readable/actionable format – then we can translate it to the metadata standard that your field of research uses.  WOW!  Easy peasy?  Ok there’s some work involved in creating crosswalks across metadata standards – but first and foremost – let’s NOT fret about what the best or recommended metadata standard is in your field – let’s DOCUMENT that data – and cross-walk it over later.  Let’s be honest – most of us forget to document and need to go back months later and remember what we did!  So document now in an easy to use format – Semantic Engine – and then come talk to us about how to cross-walk to the metadata standard in your field.

Hang on!  One more word today:

INTEROPERABILITY!   

Let me just drop that one here – isn’t this part of what I’m rambling on about today?

Michelle