Research Data Management

Recommendation: Long format of datasets is recommended for schema documentation. Long format is more flexible because it is more general. The schema is reusable for other experiments, either by the researcher or by others. It is also easier to reuse the data and combine it with similar experiments.

Data must help answer specific questions or meet specific goals and that influences the way the data can be represented. For example, analysis often depends on data in a specific format, generally referred to as wide vs long format. Wide datasets are more intuitive and easier to grasp when there are relatively few variables, while long datasets are more flexible and efficient for managing complex, structured data with many variables or repeated measures. Researchers and data analysts often transform data between these formats based on the requirements of their analysis.

Wide Dataset:

Format: In a wide dataset, each variable or attribute has its own column, and each observation or data point is a single row. Repeated measures often have their own data column. This representation is typically seen in Excel.

Structure: It typically has a broader structure with many columns, making it easier to read and understand when there are relatively few variables.

Use Cases: Wide datasets are often used for summary or aggregated data, and they are suitable for simple statistical operations like means and sums.

For example, here is a dataset in wide format. Repeated measures (HT1-6) are described in separate columns (e.g. HT1 is the height of the subject measured at the end of week 1; HT2 is the height of the subject measured at the end of week 2 etc.).

ID TREATMENT HT1 HT2 HT3 HT4 HT5 HT6
01 A 12 18 19 26 34 55
02 A 10 15 19 24 30 45
03 B 11 16 20 25 32 50
04 B 9 11 14 22 38 42

 

Long Dataset:

Format: In a long dataset, there are fewer columns, and the data is organized with multiple rows for each unique combination of variables. Typically, you have columns for “variable,” “value,” and potentially other categorical identifiers.

Structure: It is more compact and vertically oriented, making it easier to work with when you have a large number of variables or need to perform complex data transformations.

Use Cases: Long datasets are well-suited for storing and analyzing data with multiple measurements or observations over time or across different categories. They facilitate advanced statistical analyses like regression and mixed-effects modeling. In Excel you can use pivot tables to view summary statistics of long datasets.

For example, here is some of the same data represented in a long format. The repeated measures don’t have separate columns, compared to the wide format, the height (H) is a column, and the weeks (1-6) are now recorded in a ‘week’ column.

ID TREATMENT WEEK HEIGHT
01 A 1 12
01 A 2 18
01 A 3 19
01 A 4 26
01 A 5 34
01 A 6 55

 

Long format data is a better choice when choosing a format to be documented with a schema as it is easier to document and more clear to understand.

For example, column headers (attributes) in the wide format are repetitive and this results in duplicated documentation. It is also less flexible as each additional week needs an additional column and therefore another attribute described in the schema. This means each time you add a variable you change the structure of the capture base of the schema reducing interoperability.

Documenting a schema in long format is more flexible because it is more general. This makes the schema reusable for other experiments, either by the researcher or by others. It is also easier to reuse the data and combine it with similar experiments.

At the time of analysis, the data can be transformed from long to wide if necessary and many data analysis programs have specialized functions that help researchers with this task.

 

Written by: Carly Huitema

Anyone who knows me or has sat in one of my classes, will know how much I LOVE the data life cycle.  As researchers we have been taught and embraced the research life cycle and I’m sure many of you could recite how that works:  Idea → Research proposal → Funding proposal → Data collection → Analysis → Publication → A new idea  – and we start again.  The data part of this always seemed the part that took the longest – other than maybe the writing – and really just kind of stopped there.  As a grad student, many years ago – too many to count anymore – the data was important and I worked with it, massaged it, cleaned it, re-massaged it, analyzed it – until I was happy with the results and my supervisor was happy with the results as well.  Then all the work and concentration shifted gears to the chapter writing and publication.  The data?  Just sat there – with my MSc project the data entry pieces sat in a banker box, until my supervisor cleared out the lab and shipped that box out to me in Alberta or Ontario.  So, the data lives, but in a box.

We talk about FAIR data – Findable, Accessible, Interoperable, and Reusable – um….  my MSc data?  Is Findable to me – it’s here on the floor under my desk at home.  Accessible?  maybe -it’s a box of printouts of the raw data that was entered in 1989.  Interoperable?  Let’s not even think about that!  Reusable?  um… maybe as a foot stool!  So my MSc data as I’m describing it to you right now it NOT FAIR!

Why not?  Because we never thought of the data life cycle back then!  Collect data, analyze data, publish!

Today, we know better!!!  I look back and get sad at the thought of all the data that was collected that well….  no longer is out there – consider my last post about the OAC 150 anniversary?

Today, we strive to observe and follow the data life cycle – we should be telling data’s story – we should be managing our data so that it can be FAIR!  Imagine just for a moment, if I had managed my MSc research data – who knows what further research could have been completed.  Now, funny story – there was a project here at University of Guelph that was doing what I did with my MSc but with new technologies.  The student who worked on the current project reached out to me to talk about my work – all I could do was tell them about my experiences.  My data was inaccessible to them – and it turns out so was my thesis – only copy I had was here in my office – and there was/is no accessible PDF version of it.  Now – if my data had been managed and archived (I’ll talk more about this in a later post), the student may have been able to incorporate it into her thesis work – now how cool would that have been?  Imaging pigs across 30 years?  But….  as we know that did not happen.

So I’m going on and on about this – reason is to convince you all – NOT to leave your data to the wayside – you need to manage your research data – you need to create documentation so that YOU can tell your data’s story once you’ve published your work, and so your data can live on, and have the opportunity to play a role in someone else’s project.  I never imagined someone else doing similar work than I did 30 years ago – so you just never know!

I’m going to leave this data life cycle diagram above for you to consider.  Next time I’ll start digging into the HOWs of Research Data Management (RDM) rather than the WHYs

 

 

 

Have you heard the news?  The Ontario Agricultural College will be 150 years old in 2024.  Wow!!  150 years of being recognized for our research, our students, our faculty, and our community in the areas of food, agriculture, communities and the environment.  Now, as a data archivist and researcher, I only have one question:  Where is all the research data collected over all these years?

Yes we can find some of the data – no worries, some may argue that the data is in the journal articles – and I may agree with you in some instances.  BUT, overall, we need to come to the realization that the older data is more than likely gone and lost.  Older media – 5.25″ diskettes, magnetic tapes – or older software – VPplanner, QuattroPro, my favourite Word Perfect – have led us to a time where we can no longer access the older data.  Over the past few decades, data allowed us to answer our research questions, but once it completed its job, it was often left on a shelf, or in a box, or in the basement.

We MUST view and treat data as a valuable asset.  Take it off the shelf, out of the box, bring it back to light and treat it as that valuable asset!  Data should be viewed as gold in our research field.  So, how do we do this?  Quick answer is Research Data Management!

In my next blog post, I’ll talk about the Data Life Cycle and start digging into the details of what YOU can do to make your data available for our future students and researchers.