FAIR

Is your data ready to describe using a schema? How can you ensure the fewest hiccups when writing your schema (such as with the Semantic Engine)? What kind of data should you document in your schema and what kinds of data can be left out?

Document data in chunks

When you prepare to describe your data with a schema, try to ensure that you are documenting ‘data chunks’, which can be grouped together based on function. Raw data is a type of data ‘chunk’ that deserves its own schema. If you calculate or manipulate data for presentation in a figure or as a published table you could describe this using a separate schema.

For example, if you take averages of values and put them in a new column and calculate this as a background signal which you then remove from your measurements which you put in another column; this is an summarizing/analyzing process and is probably a different kind of data ‘chunk’. You should document all the data columns before this analysis in your schema and have a separate table (e.g. in a separate Excel sheet) with a separate schema for manipulated data. Examples of data ‘chunks’ include ‘raw data’, ‘analysis data’, ‘summary data’ and ‘figure and table data’. You can also look to the TIER protocol for how to organize chunks of data through your analysis procedures.

Look for Entry Code opportunities

Entry codes can help you streamline your data entry and improve existing data quality.

Example data table with errors in the data.
Example data table with errors in the data.

For example, here is a dataset that could benefit from using entry codes. The sample name looks like it would consist of two sample types (WH10 and WH20) but there are multiple ways of writing the sample name. The same thing for condition. You can read our blog post about entry codes which works through the above example. If you have many entry codes you can also import entry codes from other schemas or from a .csv file using the Semantic Engine.

Separate out columns for clarity

Sometimes you may have compressed multiple pieces of information into a single column. For example, your sample identifier might have several pieces of useful information. While this can be very useful for naming samples, you can keep the sample ID and add extra columns where you pull all of the condensed information into separate attributes, one for each ‘fact’. This can help others understand the information coded in your sample names, and also make this information more easily accessible for analysis. Another good example of data that should be separated are latitude and longitude attributes which benefit from being in separate columns.

An example of splitting a single column which contains multiple pieces of information into separate columns where each piece of information is recorded in a single column.
Splitting column information into multiple columns for clarity.

Consider adopting error coding

If your data starts to have codes written in the data as you annotate problems with collection or missing samples, consider putting this information in an adjacent data quality column so that it doesn’t interfere with your data analysis. Your columns of data should contain only one type of information (the data), and annotations about the data can be moved to an adjacent quality column. Read our blog post to learn more about adding quality comments to a dataset using the Semantic Engine.

Look for standards you can use

It can be most helpful if you can find ways to harmonize your work with the community by trying to use standards. For example, there is an ISO standard for date/time values which you could use when formatting these kinds of attributes (even if you need to fight Excel to do so!).

Consider schema reuse

Schemas will often be very specific to a specific dataset, but it can be very beneficial to consider writing your schema to be more general. Think about your own research, do you collect the same kinds of data over and over again? Could you write a single schema that you can reuse for each of these datasets? In research schemas written for reuse are very valuable, such as a complex schema like phenopackets, and reusable schemas help with data interoperability improving FAIRness.

In conclusion, you can do many things to prepare your data for documentation. This will help both you and others understand your data and thinking process better, ensuring greater data FAIRness and higher quality research. You can also contribute back to the community if you develop a schema that others can use and you can publish this schema and give it an identifier such as  DOI for others to cite and reuse.

Written by Carly Huitema

 

How should you organize your files and folders when you start on a research project?

Or perhaps you have already started but can’t really find things.

Did you know that there is a recommendation for that? The TIER protocol will help you organize data and associated analysis scripts as well as metadata documentation. The TIER protocol is written explicitly for performing analysis entirely by scripts but there is a lot of good advice that researchers can apply even if they aren’t using scripts yet.

“Documentation that meets the specifications of the TIER Protocol contains all the data, scripts, and supporting information necessary to enable you, your instructor, or an interested third party to reproduce all the computations necessary to generate the results you present in the report you write about your project.” [TIER protocol]

The folder structure of the TIER 4.0 protocol for how to organize research data and analysis scripts.
The folder structure of the TIER 4.0 protocol for how to organize research data and analysis scripts.

If you go to the TIER protocol website, you can explore the folder structure and read about the contents of each folder. You have folders for raw data, for intermediate data, and data ready for analysis. You also have folders for all the scripts used in your analysis, as well as any associated descriptive metadata.

You can use the Semantic Engine to write the schema metadata, the data that describes the contents of each of your datasets. Your schemas (both the machine-readable format and the human-readable .txt file) would go into metadata folders of the TIER protocol. The TIER protocol calls data schemas “Codebooks”.

Remember how important it is to never change raw data! Store your raw collected data before any changes are made in the Input Data Files folder and never! ever! change the raw data. Make a copy to work from. It is most valuable when you can work with your data using scripts (and stored in the scripts folder of the TIER protocol) rather than making changes to the data directly via (for example) Excel. Benefits include reproducibility and the ease of changing your analysis method. If you write a script you always have a record of how you transformed your data and anyone who can re-run the script if needed. If you make a mistake you don’t have to painstakingly go back through your data and try and remember what you did, you just make the change in the script and re-run it.

The TIER protocol is written explicitly for performing analysis entirely by scripts. If you don’t use scripts to analyze your data or for some of your data preparation steps you should be sure to write out all the steps carefully in an analysis documentation file. If you are doing the analysis for example in Excel you would document each manual step you make to sort, clean, normalize, and subset your data as you develop your analysis. How did you use a pivot table? How did decide which data points where outliers? Why did you choose to exclude values from your analysis? The TIER protocol can be imitated such that all of this information is also stored in the scripts folder of the TIER protocol.

Even if you don’t follow all the directions of the TIER protocol, you can explore the structure to get ideas of how to best manage your own data folders and files. Be sure to also look at advice on how to name your files as well to ensure things are very clear.

Written by Carly Huitema

Findable

Accessible (where possible)

Interoperable

Reusable

Ah the last Blog post in the series of 4 regading the FAIR principles.  The last or the first, depending on how you look at it :). F for Findable!  Quick review from the FAIR website:

F1. (Meta)data are assigned a globally unique and persistent identifier
F2. Data are described with rich metadata (defined by R1 below)
F3. Metadata clearly and explicitly include the identifier of the data they describe
F4. (Meta)data are registered or indexed in a searchable resource

As we’ve ventured through the FAIR principles, we’ve highlighted the reproducibility crisis, we’ve discussed the challenges of interoperability – using the wrench as an example, and we’ve talked about making the (meta)data accessible.  Has anyone noticed the PRIMARY theme behind all of these?

Yup my favourite METADATA so this post will be rather short since I’ve tackled many of the aspects that I want to highlight already.

First though, if you read through the FAIR principles they say (meta)data – I would challenge anyone to say the principles say that the DATA needs to be accessible, etc…  The FAIR principles were created to help the researchers ensure that the data was FAIR by way of the metadata.  We all know we cannot share all the data we collect – but as I noted in an earlier post, we should at least be aware of the data through its metadata.  Hence my going on and on and on about metadata, or that love letter to yourself, or better yet the data schema!

So let’s talk briefly about Findable.  How do we do this?   I know you already know the answer – by documenting your data or by building that data schema!   The A, I, and R really can’t happen before we fulfill the needs of the Findable 🙂

We already talk about the unique identifier (DOI) in the A in FAIR post.  Now let’s take a closer peak at how we can describe our data.  Here, at Agri-food Data Canada (ADC), we’ve been developing the Semantic Engine, a suite of tools to help you create your data schema, a suite of tools to help you create rich metadata to describe your data.

Review what the Semantic Engine can do for you by watching this little video

 

To address the F principles, we just need to create a data schema, or metadata.  Sounds simple enough right?  The Semantic Engine tools make it easy for you to create this – so try it out at:

https://www.semanticengine.org/

Remember, if you need help reach out to us at adc@uoguelph.ca or by booking an appointment with one of our team members at https://agrifooddatacanada.ca/consultations/ 

Let’s continue to build knowledge and change the Data Culture by creating FAIR data!

Michelle

 

Entry codes can be very useful to ensure your data is high quality and to catch mistakes that might mess up with your analysis.

For example, you might have taken multiple measurements of two samples (WH10 and WH20) collected during your research.

Example data table
Example data table

You have a standardized sample name, a measurement (iron concentration) and a condition score for the samples from one to three. You can easily group your analysis into samples because they have consistent names. Incidentally, this is an example of a dataset in a ‘long’ format. Learn more about wide and long formats in our blog post.

The data is clean and ready for analysis, but perhaps in the beginning the data looked more like this, especially if you had multiple people contributing to the data collection:

Example data table with errors in the data.
Example data table with errors in the data.

Sample names and condition scores are inconsistent. You will need to go in and correct the data before you can analyze it (perhaps using a tool such as open refine). Also, if someone else uses your dataset they may not even be aware of the problems, they may not know that the condition score can only have values [1, 2 or 3] and which sample name should be used consistently.

You can help address this problem by documenting this information in a schema using the Semantic Engine with Entry Codes. Look at the figure below to see what entry codes you could use for the data collected.

An example of adding entry codes in the Semantic Engine.
An example of adding entry codes in the Semantic Engine.

You can see that you have two entry code sets created, one (WH10, WH2) for Sample and one (1, 2, 3) for Condition. It is not always necessary that the labels are different from the entry code itself. Labels become much more important when you are using multiple languages in your schema because it can help with internationalization, or when the label represents some more human understandable concept. In this example we can help understand the Condition codes by providing English labels: 1=fresh sample, 2=frozen sample and 3=unknown condition sample. However, it is very important to note that the entry code (1,2 or 3) and not the label (Fresh, frozen, undetermined) is what appears in the actual dataset.

An example of a complete schema for the above dataset, incorporating entry codes is below:

An example schema with entry codes created using the Semantic Engine.
An example schema with entry codes created using the Semantic Engine.

If you have a long list of entry codes you can even import entry codes in a .csv format or from another schema. For example, you may wish to specify a list of gene names and you can go to a separate database or ontology (or use the unique function in Excel) to extract a list of correct data entry codes.

Entry codes can help you keep your data consistent and accurate, helping to ensure your analysis is correct. You can easily add entry codes when writing your schema using the Semantic Engine.

 

Written by Carly Huitema

Findable

Accessible (where possible)

Interoperable

Reusable

Good day everyone!  We’re back looking at the FAIR priniciples and have now moved to talk about A for Accessible (where possible).  Let’s first review the 2 principles under Accessible:

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
A2. Metadata are accessible, even when the data are no longer available

Source: FAIR principles

Oh my!  What in the world are you talking about?  Metadata? Identifier?  Accessible when data are no longer available?  Come on!  I’m a researcher that just wants to collect my data, document as I need for my purposes, publish the papers and data where required and move on to the next project.  Now you’re telling me I need to ensure my metadata has an identifier?  Like a DOI (digital object identifier)??  How in the world do I do that?  and what????  My metadata should be accessible even if my data is NOT available?  Ok – now I’m confused!

The above is a conversation I imagined having with my younger researcher self.  Let’s say 20 years ago to put this into context.  However, I can almost imagine having this conversation with some of our researchers today!  Lucky for them we have tools and services to help out with what appears to be a scary and very time consuming proposition.

So, let’s walk through this.  I’ve been going on and on and on about metadata – I know!  What can I say?  I love my metadata and am trying to convince everyone around me that they should love their metadata too!  I’ve heard this saying a few times now “Writing a love letter/note to one’s future self” or something similar to this (those of you that know me – are well aware of my inability to remember sayings).  This is a great way to think about metadata.  Writing a note to yourself to remind yourself about what you’re doing with your data.  Now let’s think about principle A1 above – we’re suggesting that you save your metadata or that love letter – put it somewhere safe and in a place where you can retrieve it when you want it or need it.  Me, I have my notebooks, YES, I have saved them all going back a number of years.  BUT, ask me to find something in them – ha ha ha!  Not going to happen quickly!  I’ll find it “sometime”.  Now if you take your metadata, love letter, or data schema and deposit it in… say.. Borealis…  Guess what?  You will now have a unique identifier for it (DOI) and you’ll be able to find it – not like me going through all my notebooks!  You’ve got nothing to lose and everything to gain – so let’s try it!  Check out Saving, Depositing and/or Publishing your Schema or book an appointment with the ADC team to work with you.

Alrighty we’re moving towards increased accessibility of our data and metadata.  Now, what about that data for one reason or another canNOT be shared or made available?  Why in the world should I still create that metadata, love letter, data schema, and deposit it?  This is where I feel we have failed the world of science over the past few decades.   Isn’t science about sharing and building knowledge?  How can we do this if we are not aware of the data that has been collected or the studies that have been conducted?  I know, I know, I should stay on top of journal publications!  BUT!  are all studies published?  Heck no!  Why?  In some cases, because study results are not statistically significant, so why publish?  If you don’t publish, how can anyone be aware of the data that may have been collected?   Word of mouth only goes so far!  So let’s publish that data schema, metadata, love letter!  Remember it’s the metadata only – NO data!  So NO excuses!

Here’s another reason why you should publish your data schemas – citations!  YES! you can cite a data schema, the same way as you would cite a paper reference.  Hmm…  hang on now, citations of my works…  yes!  A benefit to you!!

Let’s round this all up to say that the more (meta)data we make accessible, the more knowledge we build.  There really is NO negative side of sharing your metadata!  So let’s do it!

Michelle

 

 

 

Findable

Accessible (where possible)

Interoperable

Reusable

Oh my INTEROPERABILITY!   What a HUGE word this is and let’s be honest are we all comfortable with what it means?  Remember data – interoperability…..  Ah but let’s start with a silly and very basic example first:  a tool – more specifically a wrench.

Tools, let’s think about a wrench – you know that tool that helps us remove or tighten nuts – ok before someone pipes up and corrects me – let’s be more specific:  a ratcheting or combination wrench (google it to get a picture).  They come in different sizes – which is great and really helpful for those tough nuts since they fit snuggly around the nut and you can really yank on that wrench to loosen the nut.   I’m sure many of you can relate to this.  However, how many times have you tried this and that darn wrench is just a little too big or just a little too small – and we know that those nuts are a standard size!  What’s going on???  Don’t laugh too hard at this analogy – as I don’t know how many times we’ve encountered this in our household.  Metric vs SAE – yup!  Different standards for the tools – Ugh!  Now where’s my 8mm or was it the 1/4 inch?

If these were interoperable – I should be able to use one wrench for my nuts regardless of whether it was metric of SAE, but alas, 2 standards and the only way around them is to use an adjustable wrench – aha and answer to the 2 standards and a way to interoperability?

Let’s turn to data now – how do we gather information about the data we are working with?  Metadata!  The metadata will or rather should point us in the direction as to how the data was measured, if any standards were used, if the metadata follows a standard, and in general what is the data we’re working with.  Now, let’s say we are working with weights measured on dogs.  I am more comfortable weighing my dogs using pounds (lbs) and my colleague is more comfortable weighing their dogs in kilograms (kgs).  What happens when I pool these 2 data sources together?  I may have an interesting set of data with a mix of weights in lbs and kgs.  I may well have a Great Dane who appears to weigh the same as a Chihuahua!  I NEED that metadata to help me, as a researcher or data user – understand what my data represents and what transformations I may or may not need in order to pool the data!

Without this information I cannot integrate different sources of data!  Think of that wrench – my nuts were metric and my wrench was SAE – it just won’t fit.  The only difference with data – I can still pool all that data and come up with interesting and non-sensical results – that just won’t work with the wrench.

So interoperability when we think of the FAIR principles is the ability to integrate data from different sources, as long as we know the following:

I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (Meta)data use vocabularies that follow FAIR principles
I3. (Meta)data include qualified references to other (meta)data

Source: FAIR Principles

Michelle

Findable

Accessible (where possible)

Interoperable

Reusable

I believe most of us are now familiar with this acronym?  The FAIR principles  published in 2016.  I have to admit that part of me really wants to create a song around these 4 words – but I’ll save you all from that scary venture.  Seriously though, how many of us are aware of the FAIR principles?  Better yet, how many of us are aware of the impact of the FAIR principles?  Over my next blog posts we’ll take a look at each of the FAIR letters and I’ll pull them all together with the RDM posts – YES there is a relationship!

So, YES I’m working backwards and there’s a reason for this.  I really want to “sell” you on the idea of FAIR.  Why do we consider this so important and a key to effective Research Data Management – oh heck it is also a MAJOR key to science today.

R is for Reusable

Reusable data – hang on – you want to REUSE my data?  But I’m the only one who understands it!   I’m not finished using it yet!  This data was created to answer one research question, there’s no way it could be useful to anyone else!  Any of these statements sound familiar?   Hmmm…  I may have pointed some of these out in the RDM posts – but aside from that – truthfully, can you relate to any of these statements?  No worries, I already know the answer and I’m not going to ask you to confess to believing or having said or thought any of these.  Ah I think I just heard that community sigh of relief 🙂

So let’s look at what can happen when a researcher does not take care of their data or does not put measures into place to make their data FAIR – remember we’re concentrating on the R for reusability today.

Reproducibility Crisis?

Have you heard about the reproducibility crisis in our scientific world?  The inability to reproduce published studies.  Imagine statements like this: “…in the field of cancer research, only about 20-25% of the published studies could be validated or reproduced…”? (Miyakawa, 2020). How scary is that?  Sometimes when we think about reproducibility and reuse of our data – questions that come to mind – at least my mind – why would someone want my data?  It’s not that exciting?  But boys oh boys when you step back and think about the bigger picture – holy cow!!!  We are not just talking about data in our little neck of the woods – this challenge of making your research data available to others – has a MUCH broader and larger impact!  20-25% of published studies!!! and that’s just in the cancer research field.  If you start looking into this crisis you will see other numbers too!

So, really what’s the problem here?   Someone cannot reproduce a study – maybe it’s age of the equipment, or my favourite – the statistical methodologies were not written in a way the reader could reproduce the results IF they had access to the original data.  There are many reasons why a study may not be reproducible – BUT – our focus is the DATA!

The study I referred to above also talks about some of the issues the author encountered in his capacity as a reviewer.  The issue that I want to highlight here is access to the RAW data or insufficient documentation about the data – aha!!  That’s the link to RDM.  Creating adequate documentation about your data will only help you and any future users of your data!  Many studies cannot by reproduced because the raw data is NOT accessible and/or it is NOT documented!

Pitfalls to NO Reusable data

There have been a few notable researchers that have lost their career because of their data or rather lack thereof.  One notable one is Brian Wansink, formerly of Cornell University.  His research was ground-breaking at the time, studying eating habits, looking at how cafeterias could make food more appealing to children, it was truly great stuff!  BUT…..  when asked for the raw data…..  that’s when everything fell apart.  To learn more about this situation follow the link I provided above that will take you to a TIME article.

This is a worst case scenario – I know – but maybe I am trying to scare you!  Let’s start treating our data as a first class citizen and not an artifact of our research projects.  FAIR data is research data that should be Findable, Accessible (where possible), Interoperable, and REUSABLE!  Start thinking beyond your study – one never knows when the data you collected during your MSc or PhD may be crucial to a study in the future.  Let’s ensure it’s available and documented – remember Research Data Management best practices – for the future.

Michelle