Funding for Agri-food Data Canada is provided in part by the Canada First Research Excellence Fund
At Agri-food Data Canada (ADC), we often emphasize the importance of content-derived identifiers—unique fingerprints generated from the actual content of a resource. These identifiers are especially valuable in research and data analysis, where reproducibility and long-term verification are essential. When you cite a resource using a derived identifier, such as a digest of source code, you’re ensuring that years down the line, anyone can confirm that the referenced material hasn’t changed.
One of the best tools for managing versioned documents—especially code—is GitHub. Not only does GitHub make it easy to track changes over time, but it also automatically generates derived identifiers every time you save your work.
Every time you make a commit in GitHub (i.e., save a snapshot of your code or document), GitHub creates a SHA-1 digest of that commit. This digest is a unique identifier derived from the content and metadata of the commit. It acts as a cryptographic fingerprint that ensures the integrity of the data.
Here’s what goes into creating a GitHub commit digest:
All of this is bundled together and run through the SHA-1 hashing algorithm, producing a 40-character hexadecimal string like:
e68f7d3c9a4b8f1e2c3d4a5b6f7e8d9c0a1b2c3d
GitHub typically displays only the first 7–8 characters of this digest (e.g., e68f7d3c
), which is usually enough to uniquely identify a commit within a repository.
When citing code or documents stored on GitHub, always include the commit digest. This practice ensures that:
Whether you’re publishing a paper, sharing an analysis, or collaborating on a project, referencing the commit digest helps maintain transparency and reproducibility, promoting FAIR research.
GitHub’s built-in support for derived identifiers makes it a powerful platform for version control and long-term citation. By simply noting the commit digest when referencing code or documents, you’re contributing to a more robust and verifiable research ecosystem.
So next time you cite GitHub work, take a moment to copy that digest. It’s a small step that makes a big difference in the integrity of your research.
Written by Carly Huitema
Oh WOW! Back in October I talked about possible places to store and search for data schemas. For a quick reminder check out Searching for Data Schemas and Searching for variables within Data Schemas. I think I also stated somewhere or rather sometime, that as we continue to add to the Semantic Engine tools we want to take advantage of processes and resources that already exist – like Borealis, Lunaris, and odesi. In my opinion, by creating data schemas and storing them in these national platforms, we have successfully made our Ontario Agricultural College research data and Research Centre data FAIR.
But, there’s still a little piece to the puzzle that is missing – and that my dear friends is a catalogue. Oh I think I heard my librarian colleagues sigh :). Searching across National platforms is fabulous! It allows users to “stumble” across data that they may not have known existed and this is what we want. But, remember those days when you searched for a book in the library – found the catalogue number – walked up a few flights of stairs or across the room to find the shelf and then your book? Do you remember what else you did when you found that shelf and book? Maybe looked at the titles of the books that were found on the same shelf as the book you sought out? The titles were not the same, they may have contained different words – but the topic was related to the book you wanted. Today, when you perform a search – the results come back with the “word” you searched for. Great! Fabulous! But it doesn’t provide you with the opportunity to browse other related results. How these results are related will depend on how the catalogue was created.
I want to share with you the beginnings of a catalogue or rather catalogues. Now, let’s be clear, when I say catalogue, I am using the following definition: “a complete list of items, typically one in alphabetical or other systematic order.” (from the Oxford English Dictionary). We have started to create 2 catalogues at ADC – one is the Agri-food research centre schema library listing all the data schemas currently available at the Ontario Dairy Research Centre and the Ontario Beef Research Centre and the second is a listing of data schemas being used in a selection of Food from Thought research projects.
As we continue to develop these catalogues, keep an eye out for more study level information and a more complete list of data schemas.
image created by AI
GitHub is more than just a code repository, it is a a powerful tool for collaborative documentation and standards development. GitHub is an important tool for the development of FAIR data. In the context of writing and maintaining documentation, GitHub provides a comprehensive ecosystem that enhances the quality, accessibility, and efficiency of the process. Here’s why GitHub is invaluable for documentation:
To learn more about how to use GitHub, ADC has contributed content to this online book with introductions to research data management and how to use GitHub for people who write documentation. This project itself is an example of documentation hosted in GitHub and using the static site generator Jekyll to turn back-end markdown pages into an HTML-based webpage.
From the GitHub introduction you can learn about how to navigate GitHub, write in Markdown, edit files and folders, work on different branches of a project, and sync your GitHub work with your local computer. All of these techniques are useful for working collaboratively on documentation and standards using GitHub.
Agri-food Data Canada is a partner in the recently announced Climate Smart Agriculture and Genomics project and is a member of the Data Hub. One of our outputs as part of this team has been the introduction to GitHub documentation.
Written by Carly Huitema
© 2023 University of Guelph