FAIR

Findable

Accessible (where possible)

Interoperable

Reusable

Ah the last Blog post in the series of 4 regading the FAIR principles. The last or the first, depending on how you look at it :). F for Findable! Quick review from the FAIR website:

F1. (Meta)data are assigned a globally unique and persistent identifier
F2. Data are described with rich metadata (defined by R1 below)
F3. Metadata clearly and explicitly include the identifier of the data they describe
F4. (Meta)data are registered or indexed in a searchable resource

As we’ve ventured through the FAIR principles, we’ve highlighted the reproducibility crisis, we’ve discussed the challenges of interoperability – using the wrench as an example, and we’ve talked about making the (meta)data accessible. Has anyone noticed the PRIMARY theme behind all of these?

Yup my favourite METADATA so this post will be rather short since I’ve tackled many of the aspects that I want to highlight already.

First though, if you read through the FAIR principles they say (meta)data – I would challenge anyone to say the principles say that the DATA needs to be accessible, etc… The FAIR principles were created to help the researchers ensure that the data was FAIR by way of the metadata. We all know we cannot share all the data we collect – but as I noted in an earlier post, we should at least be aware of the data through its metadata. Hence my going on and on and on about metadata, or that love letter to yourself, or better yet the data schema!

So let’s talk briefly about Findable. How do we do this? I know you already know the answer – by documenting your data or by building that data schema! The A, I, and R really can’t happen before we fulfill the needs of the Findable 🙂

We already talk about the unique identifier (DOI) in the A in FAIR post. Now let’s take a closer peak at how we can describe our data. Here, at Agri-food Data Canada (ADC), we’ve been developing the Semantic Engine, a suite of tools to help you create your data schema, a suite of tools to help you create rich metadata to describe your data.

Review what the Semantic Engine can do for you by watching this little video

To address the F principles, we just need to create a data schema, or metadata. Sounds simple enough right? The Semantic Engine tools make it easy for you to create this – so try it out at:

https://www.semanticengine.org/

Remember, if you need help reach out to us at adc@uoguelph.ca or by booking an appointment with one of our team members at https://agrifooddatacanada.ca/consultations/

Let’s continue to build knowledge and change the Data Culture by creating FAIR data!

Entry codes can be very useful to ensure your data is high quality and to catch mistakes that might mess up with your analysis.

For example, you might have taken multiple measurements of two samples (WH10 and WH20) collected during your research.

You have a standardized sample name, a measurement (iron concentration) and a condition score for the samples from one to three. You can easily group your analysis into samples because they have consistent names. Incidentally, this is an example of a dataset in a ‘long’ format. Learn more about wide and long formats in our blog post.

The data is clean and ready for analysis, but perhaps in the beginning the data looked more like this, especially if you had multiple people contributing to the data collection:

Example data table with errors in the data.

Sample names and condition scores are inconsistent. You will need to go in and correct the data before you can analyze it (perhaps using a tool such as open refine). Also, if someone else uses your dataset they may not even be aware of the problems, they may not know that the condition score can only have values [1, 2 or 3] and which sample name should be used consistently.

You can help address this problem by documenting this information in a schema using the Semantic Engine with Entry Codes. Look at the figure below to see what entry codes you could use for the data collected.

An example of adding entry codes in the Semantic Engine.

You can see that you have two entry code sets created, one (WH10, WH20) for Sample and one (1, 2, 3) for Condition. It is not always necessary that the labels are different from the entry code itself. Labels become much more important when you are using multiple languages in your schema because it can help with internationalization, or when the label represents some more human understandable concept. In this example we can help understand the Condition codes by providing English labels: 1=fresh sample, 2=frozen sample and 3=unknown condition sample. However, it is very important to note that the entry code (1,2 or 3) and not the label (Fresh, frozen, undetermined) is what appears in the actual dataset.

An example of a complete schema for the above dataset, incorporating entry codes is below:

An example schema with entry codes created using the Semantic Engine.

If you have a long list of entry codes you can even import entry codes in a .csv format or from another schema. For example, you may wish to specify a list of gene names and you can go to a separate database or ontology (or use the unique function in Excel) to extract a list of correct data entry codes.

Entry codes can help you keep your data consistent and accurate, helping to ensure your analysis is correct. You can easily add entry codes when writing your schema using the Semantic Engine.

Written by Carly Huitema

Findable

Accessible (where possible)

Interoperable

Reusable

Good day everyone! We’re back looking at the FAIR priniciples and have now moved to talk about A for Accessible (where possible). Let’s first review the 2 principles under Accessible:

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
A2. Metadata are accessible, even when the data are no longer available

Source: FAIR principles

Oh my! What in the world are you talking about? Metadata? Identifier? Accessible when data are no longer available? Come on! I’m a researcher that just wants to collect my data, document as I need for my purposes, publish the papers and data where required and move on to the next project. Now you’re telling me I need to ensure my metadata has an identifier? Like a DOI (digital object identifier)?? How in the world do I do that? and what???? My metadata should be accessible even if my data is NOT available? Ok – now I’m confused!

The above is a conversation I imagined having with my younger researcher self. Let’s say 20 years ago to put this into context. However, I can almost imagine having this conversation with some of our researchers today! Lucky for them we have tools and services to help out with what appears to be a scary and very time consuming proposition.

So, let’s walk through this. I’ve been going on and on and on about metadata – I know! What can I say? I love my metadata and am trying to convince everyone around me that they should love their metadata too! I’ve heard this saying a few times now “Writing a love letter/note to one’s future self” or something similar to this (those of you that know me – are well aware of my inability to remember sayings). This is a great way to think about metadata. Writing a note to yourself to remind yourself about what you’re doing with your data. Now let’s think about principle A1 above – we’re suggesting that you save your metadata or that love letter – put it somewhere safe and in a place where you can retrieve it when you want it or need it. Me, I have my notebooks, YES, I have saved them all going back a number of years. BUT, ask me to find something in them – ha ha ha! Not going to happen quickly! I’ll find it “sometime”. Now if you take your metadata, love letter, or data schema and deposit it in… say.. Borealis… Guess what? You will now have a unique identifier for it (DOI) and you’ll be able to find it – not like me going through all my notebooks! You’ve got nothing to lose and everything to gain – so let’s try it! Check out Saving, Depositing and/or Publishing your Schema or book an appointment with the ADC team to work with you.

Alrighty we’re moving towards increased accessibility of our data and metadata. Now, what about that data for one reason or another canNOT be shared or made available? Why in the world should I still create that metadata, love letter, data schema, and deposit it? This is where I feel we have failed the world of science over the past few decades. Isn’t science about sharing and building knowledge? How can we do this if we are not aware of the data that has been collected or the studies that have been conducted? I know, I know, I should stay on top of journal publications! BUT! are all studies published? Heck no! Why? In some cases, because study results are not statistically significant, so why publish? If you don’t publish, how can anyone be aware of the data that may have been collected? Word of mouth only goes so far! So let’s publish that data schema, metadata, love letter! Remember it’s the metadata only – NO data! So NO excuses!

Here’s another reason why you should publish your data schemas – citations! YES! you can cite a data schema, the same way as you would cite a paper reference. Hmm… hang on now, citations of my works… yes! A benefit to you!!

Let’s round this all up to say that the more (meta)data we make accessible, the more knowledge we build. There really is NO negative side of sharing your metadata! So let’s do it!

Findable

Accessible (where possible)

Interoperable

Reusable

Oh my INTEROPERABILITY! What a HUGE word this is and let’s be honest are we all comfortable with what it means? Remember data – interoperability….. Ah but let’s start with a silly and very basic example first: a tool – more specifically a wrench.

Tools, let’s think about a wrench – you know that tool that helps us remove or tighten nuts – ok before someone pipes up and corrects me – let’s be more specific: a ratcheting or combination wrench (google it to get a picture). They come in different sizes – which is great and really helpful for those tough nuts since they fit snuggly around the nut and you can really yank on that wrench to loosen the nut. I’m sure many of you can relate to this. However, how many times have you tried this and that darn wrench is just a little too big or just a little too small – and we know that those nuts are a standard size! What’s going on??? Don’t laugh too hard at this analogy – as I don’t know how many times we’ve encountered this in our household. Metric vs SAE – yup! Different standards for the tools – Ugh! Now where’s my 8mm or was it the 1/4 inch?

If these were interoperable – I should be able to use one wrench for my nuts regardless of whether it was metric of SAE, but alas, 2 standards and the only way around them is to use an adjustable wrench – aha and answer to the 2 standards and a way to interoperability?

Let’s turn to data now – how do we gather information about the data we are working with? Metadata! The metadata will or rather should point us in the direction as to how the data was measured, if any standards were used, if the metadata follows a standard, and in general what is the data we’re working with. Now, let’s say we are working with weights measured on dogs. I am more comfortable weighing my dogs using pounds (lbs) and my colleague is more comfortable weighing their dogs in kilograms (kgs). What happens when I pool these 2 data sources together? I may have an interesting set of data with a mix of weights in lbs and kgs. I may well have a Great Dane who appears to weigh the same as a Chihuahua! I NEED that metadata to help me, as a researcher or data user – understand what my data represents and what transformations I may or may not need in order to pool the data!

Without this information I cannot integrate different sources of data! Think of that wrench – my nuts were metric and my wrench was SAE – it just won’t fit. The only difference with data – I can still pool all that data and come up with interesting and non-sensical results – that just won’t work with the wrench.

So interoperability when we think of the FAIR principles is the ability to integrate data from different sources, as long as we know the following:

I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (Meta)data use vocabularies that follow FAIR principles
I3. (Meta)data include qualified references to other (meta)data

Source: FAIR Principles

Findable

Accessible (where possible)

Interoperable

Reusable

I believe most of us are now familiar with this acronym? The FAIR principles published in 2016. I have to admit that part of me really wants to create a song around these 4 words – but I’ll save you all from that scary venture. Seriously though, how many of us are aware of the FAIR principles? Better yet, how many of us are aware of the impact of the FAIR principles? Over my next blog posts we’ll take a look at each of the FAIR letters and I’ll pull them all together with the RDM posts – YES there is a relationship!

So, YES I’m working backwards and there’s a reason for this. I really want to “sell” you on the idea of FAIR. Why do we consider this so important and a key to effective Research Data Management – oh heck it is also a MAJOR key to science today.

R is for Reusable

Reusable data – hang on – you want to REUSE my data? But I’m the only one who understands it! I’m not finished using it yet! This data was created to answer one research question, there’s no way it could be useful to anyone else! Any of these statements sound familiar? Hmmm… I may have pointed some of these out in the RDM posts – but aside from that – truthfully, can you relate to any of these statements? No worries, I already know the answer and I’m not going to ask you to confess to believing or having said or thought any of these. Ah I think I just heard that community sigh of relief 🙂

So let’s look at what can happen when a researcher does not take care of their data or does not put measures into place to make their data FAIR – remember we’re concentrating on the R for reusability today.

Reproducibility Crisis?

Have you heard about the reproducibility crisis in our scientific world? The inability to reproduce published studies. Imagine statements like this: “…in the field of cancer research, only about 20-25% of the published studies could be validated or reproduced…”? (Miyakawa, 2020). How scary is that? Sometimes when we think about reproducibility and reuse of our data – questions that come to mind – at least my mind – why would someone want my data? It’s not that exciting? But boys oh boys when you step back and think about the bigger picture – holy cow!!! We are not just talking about data in our little neck of the woods – this challenge of making your research data available to others – has a MUCH broader and larger impact! 20-25% of published studies!!! and that’s just in the cancer research field. If you start looking into this crisis you will see other numbers too!

So, really what’s the problem here? Someone cannot reproduce a study – maybe it’s age of the equipment, or my favourite – the statistical methodologies were not written in a way the reader could reproduce the results IF they had access to the original data. There are many reasons why a study may not be reproducible – BUT – our focus is the DATA!

The study I referred to above also talks about some of the issues the author encountered in his capacity as a reviewer. The issue that I want to highlight here is access to the RAW data or insufficient documentation about the data – aha!! That’s the link to RDM. Creating adequate documentation about your data will only help you and any future users of your data! Many studies cannot by reproduced because the raw data is NOT accessible and/or it is NOT documented!

Pitfalls to NO Reusable data

There have been a few notable researchers that have lost their career because of their data or rather lack thereof. One notable one is Brian Wansink, formerly of Cornell University. His research was ground-breaking at the time, studying eating habits, looking at how cafeterias could make food more appealing to children, it was truly great stuff! BUT….. when asked for the raw data….. that’s when everything fell apart. To learn more about this situation follow the link I provided above that will take you to a TIME article.

This is a worst case scenario – I know – but maybe I am trying to scare you! Let’s start treating our data as a first class citizen and not an artifact of our research projects. FAIR data is research data that should be Findable, Accessible (where possible), Interoperable, and REUSABLE! Start thinking beyond your study – one never knows when the data you collected during your MSc or PhD may be crucial to a study in the future. Let’s ensure it’s available and documented – remember Research Data Management best practices – for the future.

Funding for Agri-food Data Canada is provided in part by the Canada First Research Excellence Fund

University of Guelph

50 Stone Road East,
Guelph, Ontario, Canada
N1G 2W1

Call: (226) 971-0357

Resources

Quick Links

University of Guelph