Funding for Agri-food Data Canada is provided in part by the Canada First Research Excellence Fund
ADC team members are away until October 17.
Part of the blog series on Collaborative Research IT Infrastructure
In our last post, we explored how shared storage provides more than just space—it delivers reliability, cost-effectiveness, and compliance while ensuring that researchers maintain secure, separate environments. That conversation naturally leads us to the bigger picture: how do we prepare not just for today’s data needs, but for tomorrow’s?
Research is never static. Projects that begin with modest requirements can quickly grow into large-scale endeavors generating terabytes or even petabytes of data. Computational models that once ran overnight on a single server may soon demand clusters of GPUs or cloud-scale capacity. Without careful planning, institutions can find themselves stuck in a cycle of constantly reinvesting in fragmented, short-term solutions.
Shared infrastructure breaks this cycle. Because it is designed as a pooled resource, it can scale both horizontally and vertically: adding new storage tiers, expanding compute capacity, and adopting emerging technologies without requiring each individual research group to reinvent the wheel. This adaptability allows universities to grow capacity in step with research demands, ensuring that no project is limited by yesterday’s infrastructure.
It’s important where centralization happens. To build a system that truly grows with research, centralization would ideally happen at the institutional level, where resources can be managed for efficiency, compliance, and security. If decentralization drops down to departments or individual labs, scaling becomes fragmented and harder to secure. At the same time, decentralization at the inter-university level—where robust systems can be federated—creates opportunities for large-scale collaboration without sacrificing autonomy. The right balance between these layers is what makes scaling sustainable, secure, and future-ready.
Equally important is to recognize that many departments have already invested heavily in their own infrastructure. Transitioning to a shared system should not be seen as abandoning those investments, but as building upon them. By working with IT, researchers can integrate existing resources into the shared system, ensuring past investments continue to provide value while gaining the benefits of enhanced security, professional management, and long-term efficiency.
By investing in shared systems today, universities position themselves to take advantage of tomorrow’s advancements in research computing—from AI-driven analytics to new storage technologies—without requiring researchers to overhaul their individual setups. This creates an environment where innovation is not slowed by technical limitations, but instead supported by a strong, adaptable foundation.
Stay tuned for our next post, where we’ll discuss the key considerations for transitioning to a shared system at the University of Guelph, from funding to governance to staffing.
Written by Lucas Alcantara
Featured picture generated by Pixlr
Part of the blog series on Collaborative Research IT Infrastructure
In our last post, we debunked the myth that shared infrastructure reduces research autonomy. In fact, we showed how a shared system can actually enhance flexibility by offering tailored environments, freeing researchers from the burdens of IT management, and fostering seamless collaboration across teams and institutions. With that foundation in mind, it’s time to look more closely at one of the most critical components of shared infrastructure: storage.
Research today generates enormous amounts of data, from raw experimental outputs to refined datasets ready for analysis. Managing this data effectively is no small task. Without the right systems in place, researchers face risks such as data loss, limited scalability, and unnecessary costs. Shared storage solutions address these challenges directly, providing not just space, but a strategic foundation for secure, efficient, and sustainable research.
Let’s first address a common misunderstanding that “shared storage” means everyone’s files end up in the same place. In reality, shared storage means we share the infrastructure, not the data. Each research group has its own secure, separate space for their data, but the underlying platform is common. Unlike isolated lab servers or external drives, shared storage systems are professionally managed and built with reliability in mind. They include features such as automated backups, replication, and monitoring, ensuring that data is preserved even in the face of hardware failures or unexpected disruptions. This peace of mind allows researchers to focus on discovery rather than worrying about whether their results will still be there tomorrow.
Cost efficiency is another major advantage. Instead of each research group investing in its own storage infrastructure, resources are pooled across the institution. This consolidation lowers per-unit costs and allows universities to invest in high-quality systems that would be prohibitively expensive for individual teams. Shared storage also scales easily—expanding capacity as projects grow, without the need for researchers to procure and maintain new hardware themselves.
Equally important is flexibility. Modern shared storage platforms offer multiple tiers tailored to different research needs. High-performance storage supports active datasets that require frequent access and rapid analysis, while lower-cost archival tiers ensure long-term preservation of completed projects or regulatory records. Researchers can move seamlessly between tiers, paying only for the performance and capacity they actually need at each stage of the research lifecycle.
Another key benefit is compliance. In addition to internal institutional data storage policies, certain research projects must meet strict national and international requirements—such as Canada’s National Security Guidelines for Research Partnerships or the Sensitive Technology Research and Affiliations of Concern (STRAC) policy. Managing these obligations individually is difficult and uneven, but shared storage makes compliance consistent across the institution. With centralized oversight, encryption, access controls, and professional monitoring built into the system, researchers can trust that their data is protected and their projects meet security obligations without extra effort on their part.
Shared storage also supports collaboration by making it easier to share datasets securely across departments and institutions. Standardized access controls and compliance measures ensure that sensitive data remains protected while still enabling the kind of data sharing that drives interdisciplinary research and open science.
In short, shared storage is more than a convenient place to put data—it’s an enabler of modern research. By combining reliability, scalability, cost savings, flexibility, and compliance into an enhanced collaborative environment, it provides a strong foundation for both everyday work and ambitious, large-scale projects.
Stay tuned for our next post, Scaling for the Future: Building a System That Grows with Your Research, where we’ll explore how shared infrastructure offers adaptability for evolving projects and prepares universities to leverage the next generation of research computing tools.
Written by Lucas Alcantara
Featured picture generated by Pixlr
I hope you’ve enjoyed our Blog posts in 2024! There are more to come in 2025 – but we’ll be taking a short holiday break!
See you back in 2025 – January 10 will be our first blogpost of the year!
Happy New Year!
Photo generated by AI

In the world of academic research, data is the cornerstone of discovery and innovation. Graduate students and faculty members invest countless hours into gathering, analyzing, and interpreting data, making its protection crucial. Despite this, the importance of data backups often gets overlooked until disaster strikes. By cultivating a backup culture within research teams, we can ensure the safety and integrity of our valuable research data.
Why Backup Culture Matters in Academia
Imagine spending months on a groundbreaking experiment, only to lose all your data to a computer crash. Or consider the nightmare of a virus corrupting your thesis draft days before submission. These scenarios are not just horror stories—they’re real risks that researchers face daily. A strong backup culture can prevent these disasters, ensuring that your hard work is never lost.
In academia, where research data can be irreplaceable, a backup culture ensures that data loss doesn’t derail your projects. It also supports compliance with institutional requirements and funding bodies, which increasingly mandate robust data management plans.
Building a Robust Backup Culture
So, how do we create a culture where backing up data is second nature? It starts with understanding and implementing best practices. Let’s explore some key strategies.
Regular Backup Schedules
Backing up data should be a routine part of your research workflow. Automated backup software can handle this effortlessly, ensuring that backups occur without needing to remember. For critical data, daily backups are ideal, while less critical information might be backed up weekly. Moreover, maintaining multiple versions of your data allows you to revert to previous states if recent changes introduce errors.
Multiple Backup Locations
Relying on a single backup location is risky. Instead, diversify where you store your data:
Data Encryption
Securing your backups is as important as making them. Encrypt your backup data to protect it from unauthorized access, especially when dealing with sensitive research information. Ensure that the encryption methods used comply with institutional and funding body regulations.
Regular Testing and Verification
It’s not enough to just back up your data; you need to ensure those backups are usable:
Fostering a Backup Culture in Research Teams
Creating a backup culture goes beyond individual practices—it involves building a team-wide commitment to data safety.
Encouraging Regular Backups
Team leaders and senior researchers play a crucial role in setting the tone. By leading by example and consistently backing up their own data, they can influence others to do the same. Recognizing and rewarding team members who follow backup protocols can also foster this culture. Regular reminders and checklists can help keep the habit of backing up data top of mind.
Providing Training and Resources
Not everyone is familiar with the best practices for data backup. Providing training sessions and resources can equip team members with the knowledge they need. Create a central repository of guides and tutorials on backup procedures and best practices, and establish a support system where team members can get help with backup-related issues.
Monitoring and Enforcing Backup Policies
Regular audits can ensure that backup policies are being followed. These audits can identify gaps and areas for improvement. Creating a feedback loop allows team members to suggest improvements and report issues, making the process collaborative and dynamic.
Celebrating Successes and Learning from Failures
Highlighting instances where effective backups saved the day can reinforce the importance of regular backups. Sharing these success stories within the team can motivate others to maintain good backup habits. Conversely, analyzing any data loss incidents to understand what went wrong and how backup processes can be improved helps prevent future occurrences.
Conclusion
In the fast-paced world of academic research, protecting your data should be a top priority. By cultivating a backup culture and implementing best practices, graduate students and faculty can safeguard their invaluable research data. Start building your backup culture today, and ensure that your hard work and discoveries are never lost to chance.
Remember, a strong backup culture not only protects your data but also enhances the credibility and reliability of your research. Take the first step now and make data backup an integral part of your academic journey.
Lucas Alcantara
I recently attended the 46th ADSA Discover Conference, themed “Milking the Data – Value Driven Dairy Farming,” and the discussions there really drove home how crucial data integration is for the dairy industry. I want to share some insights and reflections, highlighting why this topic is so important, the challenges we face, and some specific use cases in research.
Why Data Integration Matters
First, let’s talk about why integrating data in dairy farming is such a game-changer. Imagine having a single dashboard where you can see everything about your farm—milk yields, feed efficiency, animal health, and even environmental impact. That’s the power of data integration. It brings together information from various sources to give you a complete picture, enabling you to make smarter, more informed decisions.
While the benefits of data integration in dairy farming are evident for farmers, its importance extends far beyond the barn. Researchers also stand to gain significantly from integrated data systems in the agricultural sector.
Challenges to Overcome
Of course, the journey to fully integrated data systems isn’t without its hurdles. Here are some of the main challenges we need to tackle:
Conference Insights
The conference sessions really brought these points to life. One session that stood out was on precision livestock farming, where experts discussed the latest in AI and sensor technologies. These advancements are paving the way for more precise and actionable insights into farm operations.
Another highlight was the discussion on data governance. It’s not just about collecting data; it’s about managing it responsibly. Who owns the data? How should it be used? These questions are critical, and the conference provided a platform to explore these ethical and practical considerations.
Discussions also underscored a significant obstacle: the lack of APIs and fully documented dataset schemas from most software providers. This is a major bottleneck for the seamless flow of data across platforms. Without standardized APIs and comprehensive documentation, accessing and consolidating data becomes extremely challenging.
Research Use Cases for Data Integration
A prime example of how data integration is improving research is the Ontario Dairy Research Centre (ODRC), where we’ve integrated a vast array of data to support cutting-edge research. This includes everything from milk production records, feed intake, and health monitoring data to environmental conditions. Here’s how it makes a difference:
Impact on Research Time
Imagine the typical research project timeline. A significant portion is usually dedicated to data collection, cleaning, and integration—often consuming up to 50% of the total project time. By streamlining these processes through integrated data systems, researchers can potentially halve their preliminary work phase. This not only accelerates the pace of research but also amplifies the impact of findings by enabling more rapid dissemination and application of results.
Steps Forward: Leveraging Agri-Food Data Canada (ADC)
At ADC we are leading the way in addressing these challenges. Here are some steps ADC is taking to enhance data integration and usability:
The 46th ADSA Discover Conference was a fantastic opportunity to learn about the current state and future potential of data integration in dairy farming. It’s clear that while there are challenges to overcome, the benefits are immense. By embracing these changes, we can create a more efficient, sustainable, and profitable dairy industry. Let’s keep the conversation going about how we can harness the power of data to improve research and transform dairy farming through innovation!
Lucas Alcantara
The Ontario Dairy Research Centre is owned by the Agricultural Research Institute of Ontario and managed by the University of Guelph through the Ontario Agri-Food Innovation Alliance, a collaboration between the Government of Ontario and the University of Guelph.
Ensuring code consistency and reproducibility is paramount. Imagine collaborating on a project where each member uses different package versions, leading to inconsistencies in the results obtained. One of the fundamental steps in ensuring reproducibility is setting up an organized and self-contained R project and leveraging renv, an R package manager, providing a robust solution to manage project-specific dependencies and environments.
Creating an R project in RStudio is a straightforward process
Step 1: Open RStudio
Launch RStudio on your computer. If you haven’t installed RStudio yet, you can download it from the official website: RStudio Download Page.
Step 2: Create a New Project
Once RStudio is open, navigate to the top menu and click on “File” > “New Project” > “New Directory”. You’ll see a dialog box appear with options for creating a new project.
Step 3: Choose Project Type
In the dialog box, you’ll see several project types to choose from. Select “New Directory” and then choose the type of project you want to create. For a generic R project, select “New Project”.
Step 4: Choose Project Directory
After selecting “New Project”, click “Next”. You’ll be prompted to choose a directory for your new project. This is where all your project files will be stored. You can either create a new directory or choose an existing one. I suggest you always create a new directory.
Step 5: Enter Project Name
Give your project a name in the “Directory name” field. This name will be used to name the new directory and will also be the name that identifies your project in RStudio.
Step 6: Additional Options
In the same screen above, you’ll see additional options. Check the box that says “Create a git repository” if you want to initialize a Git repository for version control. Next, check the box that says “Use renv with this project” to utilize renv for managing project dependencies. This will automatically setup renv to manage your project’s package dependencies.
Step 7: Create Project
Once you’ve chosen a directory, entered a project name, and selected the desired options, click “Create Project”. RStudio will create the project directory, set up Git (if selected), activate renv, and open the project as a new RStudio session.
Step 8: Start Working
Your new project is now set up and ready to use. You’ll see the project directory in the “Files” pane on the bottom right of the RStudio interface. You can start working on your R scripts, import data, create plots, and more within this project.
Using renv Package Manager
If you have already initialized renv when you created your project, skip to Step 2.
Step 1: Initializing renv
Start by installing and loading the renv package. If it’s not already installed, a simple installation command gets the job done. Once initialized, you don’t need to load and initialize it again, so you should comment those lines out.
# Install, load and initialize renv
install.packages(“renv”)
library(renv)
renv::init()
Step 2: Installing and Managing Packages
With renv activated, installing and managing packages becomes a breeze. You can install packages as usual from various sources like CRAN, GitHub, or even specific versions.
# Install the latest dplyr version
install.packages(“dplyr”)
# Or install a specific dplyr version directly using renv
renv::install(“dplyr@1.0.7”)
Step 3: Saving Project Dependencies
A crucial step in ensuring reproducibility is saving project dependencies. renv accomplishes this by creating a lockfile (renv.lock) that records the exact versions of all installed packages. To ensure new dependencies are added to the lockfile, you can create a snapshot of your project using renv::snapshot().
Step 4: Collaborating and Restoring Environments
Sharing your project with collaborators is seamless. Just share the project along with the renv.lock file. Collaborators can then restore the project environment to its exact state using renv::restore().
Why does this matter?
Let’s dive into an example showcasing the importance of renv in maintaining code consistency over time. Consider the scenario where the dplyr package introduces a new feature, such as “.by” in version 1.1.0.
# Summarise mean height by species and homeworld
starwars %>%
summarise(
mean_height = mean(height),
.by = c(species, homeworld)
)
# If you run the code above, you will get the following on the R console:
# A tibble: 57 × 3
species homeworld mean_height
<chr> <chr> <dbl>
1 Human Tatooine 179.
2 Droid Tatooine 132
3 Droid Naboo 96
4 Human Alderaan 176.
5 Human Stewjon 182
6 Human Eriadu 180
7 Wookiee Kashyyyk 231
8 Human Corellia 175
9 Rodian Rodia 173
10 Hutt Nal Hutta 175
Now, if your collaborators are using an older version of dplyr, say version 1.0.7, that did not have the “.by” feature, inconsistencies will arise.
# Running the same code as above, would return the following:
# A tibble: 174 × 2
mean_height .by
<dbl> <chr>
1 NA Human
2 NA Droid
3 NA Droid
4 NA Human
5 NA Human
6 NA Human
7 NA Human
8 NA Droid
9 NA Human
10 NA Human
By leveraging renv, you can ensure that your R projects remain reproducible and consistent across different environments. Managing dependencies, sharing projects, and adapting to package updates becomes effortless, enabling smooth collaboration and reliable analysis.
So, next time you start a new project, make sure to setup an R project on RStudio, and remember the power of renv in keeping your code reproducible and your results consistent.
Happy coding!
Written by Lucas Alcantara
As researchers, we’re no strangers to the complexities of data management, especially when it comes to handling date and time information. Whether you’re conducting experiments, analyzing trends, or collaborating on projects, accurate temporal data is crucial. Like in many other fields, precision is key, and one powerful tool at our disposal for managing temporal data is the ISO 8601.
Understanding ISO Date and Time
ISO 8601, the international standard for representing dates and times, provides a unified format that is recognized and utilized across various disciplines. At its core, ISO date and time formatting adheres to a logical and consistent structure, making it ideal for storing and exchanging temporal data.
In the ISO 8601 format:
Altogether, ISO 8601 allows for a comprehensive framework for managing temporal information with precision and clarity. For example:
Advantages of ISO Date and Time
Best Practices for Working with ISO Date and Time
Working with Date and Time in R and Python
Using R for Date and Time Manipulation
R provides powerful libraries like lubridate from the tidyverse for easy and intuitive date and time manipulation. With functions like ymd_hms() and with_tz(), parsing and converting date-time strings to different time zones is straightforward. Additionally, R offers extensive support for extracting and manipulating various components of date-time objects.
For code examples in R, refer to this code snippet on GitHub.
Using Python for Date and Time Manipulation
Python’s datetime and pytz modules offers comprehensive functionalities for handling date and time operations. Parsing datetime strings and converting timezones can be achieved using fromisoformat() and astimezone() methods. Python also allows for arithmetic operations on datetime objects using timedelta.
For code examples in Python, refer to this code snippet on GitHub.
Conclusion
When it comes to accurate research data, effective management of temporal data is indispensable for conducting rigorous analyses and drawing meaningful conclusions. By embracing the ISO 8601 standard for date and time representation, researchers can harness the power of standardized formatting to ensure data FAIRness.
Written by Lucas Alcantara
Comic’s Source: https://xkcd.com/1179
Introduction
Data is the backbone of informed decision-making in livestock management. However, the volume and complexity of data generated in modern livestock farms pose challenges to maintaining its quality. Inaccurate or unreliable data can have profound consequences on research programs and overall farm operations. In this technical exploration, we delve into the realm of automated data cleaning and quality assurance in livestock databases, more specifically on the impact of missing data and data outliers.
The Need for Data Quality in Livestock Databases
Livestock management relies heavily on data-driven insights. Accurate and reliable data is critical for making informed decisions regarding breeding, health monitoring, and resource allocation, as well as for conducting research projects. Aside from inaccurate research findings, poor data quality can lead to misguided decisions, affecting animal welfare and farm profitability. Ensuring high-quality data is, therefore, foundational to the success of livestock operations. Let’s explore two common data quality issues in livestock databases.
Missing Data
Missing data can sometimes compromise the accuracy and reliability of decision-making in livestock management. When critical information is missing, analyses may be skewed, leading to incomplete insights and potentially flawed conclusions.
This is particularly concerning in scenarios where missing data is not random, introducing bias into the analysis. For example, if certain health records are more likely to be missing for a specific group of livestock, any decision based on the available data may not accurately represent the entire population.
Moreover, the handling of missing data can impact statistical analyses. Traditional methods, like row wise deletion, may discard entire records with missing values, potentially reducing the sample size, and introducing bias. Whenever applicable, livestock data professionals should employ robust imputation techniques to address missing data systematically.
There are three main mechanisms through which data can be missing:
Understanding these mechanisms is crucial for selecting appropriate imputation methods and addressing missing data effectively in livestock databases.
Data Outliers
Outliers in livestock data can distort analyses and lead to misguided decisions. An outlier, which is an observation significantly different from other data points, may indicate a measurement error, a rare event, or an underlying issue requiring attention. Failing to identify and handle outliers can result in skewed statistical measures and inaccurate predictions, potentially impacting the health and productivity of the livestock.
Outliers in livestock data can arise from various sources, including:
Addressing outliers involves a combination of statistical methods and machine learning approaches to ensure robust and accurate analyses.
Some statistical methods and machine learning approaches for detecting and addressing outliers are commonly used with livestock data, such as:
Applying a combination of statistical and machine learning techniques can also help identify and address outliers, ensuring the integrity of livestock data analyses. These approaches play a critical role in maintaining data quality and, consequently, making informed decisions in the dynamic field of livestock management.
Conclusion
In this initial exploration, we’ve laid the groundwork for understanding the importance of data quality in livestock databases and highlighted two critical challenges: missing data and outliers. Subsequent sections will delve into the technical aspects of automated data cleaning, providing insights into techniques, tools, and best practices to overcome these challenges. As we navigate through the intricacies of data cleaning and quality assurance, we aim to empower technical audiences to implement robust processes that elevate the reliability and utility of their livestock data. Stay tuned for deeper insights into automated data cleaning techniques in future posts.
Written by Lucas Alcantara
© 2023 University of Guelph