Research Data Management

There is a new feature just released in the Semantic Engine!

Now, after you have written your schema you can use this schema to enter and verify data using your web browser.

Find the link to the new tool in the Quick Link lists, after you have uploaded a schema. Watch our video tutorial on how to easily create your own schema.

Link to the Data Entry Web plus Verification tool in the Quick Links section.
Link to the Data Entry Web plus Verification tool in the Quick Links section.

Add data

The Data Entry Web tool lets you upload your schema and then you can optionally upload a dataset. If you choose to upload a dataset, remember that Agri-food Data Canada and the Semantic Engine tool never receive your data. Instead, your data is ‘uploaded’ into your browser and all the data processing happens locally.

If you don’t want to upload a dataset, you can skip this step and go right to the end where you can enter and verify your data in the web browser. You add rows of blank data using the ‘Add rows’ button at the bottom and then enter the data. You can hover over the ?’s to see what data is expected, or click on the ‘verification rules’ to see the schema again to help you enter your data.

 

Screenshot of entering data following the rules of a schema using Data Entry Web.
Screenshot of entering data following the rules of a schema using Data Entry Web.

 

If you upload your dataset you will be able to use the ‘match attributes’ feature. If your schema and your dataset use the same column headers (aka variables or attributes), then the DEW tool will automatically match those columns with the corresponding schema attributes. Your list of unmatched data column headers are listed in the unassigned variables box to help you identify what is still available to be matched. You can create a match by selecting the correct column name in the associated drop-down. By selecting the column name you can unmatch an assigned match.

 

Matching attributes between schema and dataset in the DEW tool.
Matching attributes between schema and dataset in the DEW tool.

 

Matching data does two things:

1) Lets you verify the data in a data column (aka variable or attribute) against the rules of the schema. No matching, no verification.

2) When you export data from the DEW tool you have the option of renaming your column names to the schema name. This will automate future matching attempts and can also help you harmonize your dataset to the schema. No matching, no renaming.

Verify data

After you have either entered or ‘uploaded’ data, it is time to use one of the important tools of DEW – the verification tool! (read our blog post about why it is verification and not validation).

Verification works by comparing the data you have entered against the rules of the schema. It can only verify against the schema rules so if the rule isn’t documented or described correctly in the schema it won’t verify correctly either. You can always schedule a consultation with ADC to receive one-on-one help with writing your schema.

 

Verifying data using a schema in the DEW tool of the Semantic Engine.
Verifying data using a schema in the DEW tool of the Semantic Engine.

 

In the above example you can see the first variable/attribute/column is called farm and the DEW tool displays it as a list to select items from. In your schema you would set this feature up by making an attribute a list (aka entry codes). The other errors we can see in this table are the times. When looking up the schema rules (either via the link to verification rules which pops up the schema for reference, or by hovering over the column’s ?) you can see the expected time should be in ISO standard (HH:MM:SS), which means two digits for hour. The correct times would be something like 09:15:00. These format rules and more are available as the format overlay in the Semantic Engine when writing your schema. See the figure below for an example of adding a format rule to a schema using the Semantic Engine.

 

Add format rules for data entry using the Semantic Engine
Add format rules for data entry using the Semantic Engine

Export data

A key thing to remember, because ADC and the Semantic Engine don’t ever store your data, if you leave the webpage, you lose the data! After you have done all the hard work of fixing your data you will want to export the data to keep your results.

You have a few choices when you export the data. If you export to .csv you have the option of keeping your original data headers or changing your headers to the matched schema attributes. When you export to Excel you will generate an Excel following our Data Entry Excel template. The first sheet will contain all the schema documentation and then next sheet will contain your data with the matching schema attribute names.

The new Data Entry Web tool of the Semantic Engine can help you enter and verify your data. Reuse your schema and improve your data quality using these tools available at the the Semantic Engine.

 

Written by Carly Huitema

How should you organize your files and folders when you start on a research project?

Or perhaps you have already started but can’t really find things.

Did you know that there is a recommendation for that? The TIER protocol will help you organize data and associated analysis scripts as well as metadata documentation. The TIER protocol is written explicitly for performing analysis entirely by scripts but there is a lot of good advice that researchers can apply even if they aren’t using scripts yet.

“Documentation that meets the specifications of the TIER Protocol contains all the data, scripts, and supporting information necessary to enable you, your instructor, or an interested third party to reproduce all the computations necessary to generate the results you present in the report you write about your project.” [TIER protocol]

The folder structure of the TIER 4.0 protocol for how to organize research data and analysis scripts.
The folder structure of the TIER 4.0 protocol for how to organize research data and analysis scripts.

If you go to the TIER protocol website, you can explore the folder structure and read about the contents of each folder. You have folders for raw data, for intermediate data, and data ready for analysis. You also have folders for all the scripts used in your analysis, as well as any associated descriptive metadata.

You can use the Semantic Engine to write the schema metadata, the data that describes the contents of each of your datasets. Your schemas (both the machine-readable format and the human-readable .txt file) would go into metadata folders of the TIER protocol. The TIER protocol calls data schemas “Codebooks”.

Remember how important it is to never change raw data! Store your raw collected data before any changes are made in the Input Data Files folder and never! ever! change the raw data. Make a copy to work from. It is most valuable when you can work with your data using scripts (and stored in the scripts folder of the TIER protocol) rather than making changes to the data directly via (for example) Excel. Benefits include reproducibility and the ease of changing your analysis method. If you write a script you always have a record of how you transformed your data and anyone who can re-run the script if needed. If you make a mistake you don’t have to painstakingly go back through your data and try and remember what you did, you just make the change in the script and re-run it.

The TIER protocol is written explicitly for performing analysis entirely by scripts. If you don’t use scripts to analyze your data or for some of your data preparation steps you should be sure to write out all the steps carefully in an analysis documentation file. If you are doing the analysis for example in Excel you would document each manual step you make to sort, clean, normalize, and subset your data as you develop your analysis. How did you use a pivot table? How did decide which data points where outliers? Why did you choose to exclude values from your analysis? The TIER protocol can be imitated such that all of this information is also stored in the scripts folder of the TIER protocol.

Even if you don’t follow all the directions of the TIER protocol, you can explore the structure to get ideas of how to best manage your own data folders and files. Be sure to also look at advice on how to name your files as well to ensure things are very clear.

Written by Carly Huitema

Findable

Accessible (where possible)

Interoperable

Reusable

I believe most of us are now familiar with this acronym?  The FAIR principles  published in 2016.  I have to admit that part of me really wants to create a song around these 4 words – but I’ll save you all from that scary venture.  Seriously though, how many of us are aware of the FAIR principles?  Better yet, how many of us are aware of the impact of the FAIR principles?  Over my next blog posts we’ll take a look at each of the FAIR letters and I’ll pull them all together with the RDM posts – YES there is a relationship!

So, YES I’m working backwards and there’s a reason for this.  I really want to “sell” you on the idea of FAIR.  Why do we consider this so important and a key to effective Research Data Management – oh heck it is also a MAJOR key to science today.

R is for Reusable

Reusable data – hang on – you want to REUSE my data?  But I’m the only one who understands it!   I’m not finished using it yet!  This data was created to answer one research question, there’s no way it could be useful to anyone else!  Any of these statements sound familiar?   Hmmm…  I may have pointed some of these out in the RDM posts – but aside from that – truthfully, can you relate to any of these statements?  No worries, I already know the answer and I’m not going to ask you to confess to believing or having said or thought any of these.  Ah I think I just heard that community sigh of relief 🙂

So let’s look at what can happen when a researcher does not take care of their data or does not put measures into place to make their data FAIR – remember we’re concentrating on the R for reusability today.

Reproducibility Crisis?

Have you heard about the reproducibility crisis in our scientific world?  The inability to reproduce published studies.  Imagine statements like this: “…in the field of cancer research, only about 20-25% of the published studies could be validated or reproduced…”? (Miyakawa, 2020). How scary is that?  Sometimes when we think about reproducibility and reuse of our data – questions that come to mind – at least my mind – why would someone want my data?  It’s not that exciting?  But boys oh boys when you step back and think about the bigger picture – holy cow!!!  We are not just talking about data in our little neck of the woods – this challenge of making your research data available to others – has a MUCH broader and larger impact!  20-25% of published studies!!! and that’s just in the cancer research field.  If you start looking into this crisis you will see other numbers too!

So, really what’s the problem here?   Someone cannot reproduce a study – maybe it’s age of the equipment, or my favourite – the statistical methodologies were not written in a way the reader could reproduce the results IF they had access to the original data.  There are many reasons why a study may not be reproducible – BUT – our focus is the DATA!

The study I referred to above also talks about some of the issues the author encountered in his capacity as a reviewer.  The issue that I want to highlight here is access to the RAW data or insufficient documentation about the data – aha!!  That’s the link to RDM.  Creating adequate documentation about your data will only help you and any future users of your data!  Many studies cannot by reproduced because the raw data is NOT accessible and/or it is NOT documented!

Pitfalls to NO Reusable data

There have been a few notable researchers that have lost their career because of their data or rather lack thereof.  One notable one is Brian Wansink, formerly of Cornell University.  His research was ground-breaking at the time, studying eating habits, looking at how cafeterias could make food more appealing to children, it was truly great stuff!  BUT…..  when asked for the raw data…..  that’s when everything fell apart.  To learn more about this situation follow the link I provided above that will take you to a TIME article.

This is a worst case scenario – I know – but maybe I am trying to scare you!  Let’s start treating our data as a first class citizen and not an artifact of our research projects.  FAIR data is research data that should be Findable, Accessible (where possible), Interoperable, and REUSABLE!  Start thinking beyond your study – one never knows when the data you collected during your MSc or PhD may be crucial to a study in the future.  Let’s ensure it’s available and documented – remember Research Data Management best practices – for the future.

Michelle

In the rapidly evolving landscape of livestock research, the ability to harness data from diverse sources is paramount. From sensors monitoring animal health to weather data influencing grazing patterns, the insights derived from integrated data can drive informed decisions and innovative solutions. However, integrating data into a centralized livestock research database presents a myriad of challenges that require careful consideration and robust solutions.

Challenges of Data Integration:

  1. Diverse Data Sources: Livestock research generates data from a multitude of sources, including sensors, health monitoring devices, laboratory tests, and manual observations. Each source may produce data in different formats and structures, complicating the integration process.
  2. Data Quality and Consistency: Ensuring data quality and consistency across disparate sources is crucial for meaningful analysis and interpretation. Discrepancies in data formats, missing values, and inconsistencies pose significant challenges that must be addressed.
  3. Real-Time Data Flow: In the dynamic environment of livestock research, timely access to data is essential. Establishing systems for continuous data flow ensures that researchers have access to the latest information for analysis and decision-making.

Solutions for Seamless Data Integration:

  1. Standardized Data Formats: Implementing standardized data formats, such as JSON or CSV, facilitates easier integration across different sources. By establishing data standards, organizations can streamline the integration process and improve interoperability.
  2. Data Governance and Quality Assurance: Developing robust data governance policies and quality assurance processes helps maintain data integrity throughout the integration pipeline. Regular audits, validation checks, and data cleaning protocols ensure that only high-quality data is integrated into the research database.
  3. APIs and Data Pipelines: Leveraging application programming interfaces (APIs) and data pipelines enables automated data retrieval and integration from various sources. APIs provide a standardized way to access and transmit data, while data pipelines automate the flow of data, ensuring seamless integration and synchronization.
  4. Data Synchronization and Monitoring: Implementing mechanisms for data synchronization and monitoring ensures that data flows continuously and is not missing. Regular checks and alerts can notify database administrators of any disruptions in data flow, allowing for timely resolution.

Conclusion:

In the pursuit of advancing livestock research, data integration plays a pivotal role in unlocking valuable insights and driving innovation. By addressing the challenges associated with integrating data from diverse sources and formats, organizations can create a centralized research database that serves as a foundation for evidence-based decision-making and scientific discovery. Through standardized formats, robust governance practices, and automated data pipelines, seamless data integration becomes achievable, empowering researchers to harness the full potential of data in advancing livestock management and welfare.

 

Written by Lucas Alcantara

We first talked about writing filenames back in a post about Organizing your data: Research Data Management (RDM)

To further improve your filenaming game, check out a naming convention worksheet from Caltech that helps researchers create a great filenaming system that works with their workflows.

Example: My file naming convention is “SA-MPL-EID_YYYYMMDD_###_status.tif” Examples are “P1-MUS-023_20200229_051_raw.tif” and “P2-DRS-285_20191031_062_composite.tif”.

 

Why standardize file names?

Standardizing file naming conventions helps researchers better organize their own work and collaborate with others. Here are some key benefits of adopting standardized file naming conventions:

 

Consistency:

Standardized file naming ensures a consistent structure across files, making it easier for researchers to locate and identify documents. This consistency reduces confusion and streamlines file management.

 

Improved Searchability:

A standardized naming convention makes searching easier. Users can quickly locate files based on keywords, project names, or other relevant information embedded in the file names, reducing the time spent searching for specific documents.

 

Ease of Sorting:

Uniform file names make it easier to sort and arrange files in alphabetical or chronological order. This aids in maintaining an organized file structure, especially when dealing with a large volume of documents.

 

Enhanced Collaboration:

In collaborative environments (such as sharing data with your supervisor), standardized file naming promotes a shared understanding of how files are named and organized.

 

Version Control:

Including version numbers or dates in file names helps manage version control effectively. This is particularly important in situations where multiple iterations of a document are created, ensuring that users can identify the most recent or relevant version.

By adopting a consistent approach to file naming, researchers can improve their overall file management processes, prevent mistakes, and enhance productivity.

 

File organization:

Keep reading our blog for an upcoming post on file organization using the TIER protocol.

 

Written by Carly Huitema

How many of you document your statistical analysis code?  SAS and R users, do you add comments in your programs?  Or do you fly by the seat of your pants, write, modify code, and know that you’ll remember what you did at a later time?  I know we don’t have the time to add all this information in our code, but I cannot stress enough how IMPORTANT it is to do something!  I’ll use the same story as I did a few posts ago – and be honest!  Will you 100% remember WHY you adjusted that weight variable?  HOW you adjusted that weight variable?  WHY you dropped that observation or the other?   If you’re a paper person like myself – you may have written it down in your lab or notebook.  Fabulous!  BUT!  What happens to your notes when you archive your data and associated statistical programs?

Many if not all of the statistical programs that you are using have the ability to add comments among your code.  This is the easiest way to start documenting your statistical analysis.  When you archive your data, you will also archive your statistical analysis programs (.sas and .r are text files), so future users of your data will understand how your data was created.  Yes, there are other options to capture your documentation, one of these is Markdown.  Imagine writing your SAS or R code, adding documentation explaining what you do as you work through the code AND adding documentation to your output – all at the same time!

If you’re curious how Markdown works for both R and SAS, review this tutorial “Saving your code and making your research REPRODUCIBLE”. A more recent workshop demonstrating how to use R Markdown for R processes can be found at the Agri-food Research Data Workshop Series: Workshop 6: Documenting your Data and Processes with R Markdown .

Enjoy!  And remember if you have any questions or comments  – please let us know at adc@uoguelph.ca

Michelle

 

Maintaining data quality can be a constant challenge. One effective solution is the use of entry codes. Let’s explore what entry codes entail, why they are crucial for clean data, and how they are seamlessly integrated into the Semantic Engine using Overlays Capture Architecture (OCA).

 

Understanding Entry Codes in Data Entry

Entry codes serve as structured identifiers in data entry, offering a systematic approach to input data. Instead of allowing free-form text, entry codes limit choices to a predefined list, ensuring consistency and accuracy in the dataset.

 

The Need for Clean Data

Data cleanliness is essential for meaningful analysis and decision-making. Without restrictions on data entry, datasets often suffer from various spellings and abbreviations of the same terms, leading to confusion and misinterpretation.

 

Practical Examples of Entry Code Implementation

Consider scenarios in scientific research where specific information, such as research locations, gene names, or experimental conditions, needs to be recorded. Entry codes provide a standardized framework, reducing the likelihood of inconsistent data entries.

 

Overcoming Cleanup Challenges

In the past, when working with datasets lacking entry codes, manual cleanup or tools like Open Refine were essential. Open Refine is a useful data cleaning tool that lets users standardize data after collection has been completed.

 

Leveraging OCA for Improved Data Management

Overlays Capture Architecture (OCA) takes entry codes a step further by allowing the creation of lists to limit data entry choices. Invalid entries, those not on the predefined list (entry code list), are easily identified, enhancing the overall quality of the dataset.

 

Language-specific Labels in OCA

OCA introduces a noteworthy feature – language-specific labels for entry codes. In instances like financial data entry, where numerical codes may be challenging to remember, users can associate user-friendly labels (e.g., account names) with numerical entry codes. This ensures ease of data entry without compromising accuracy.

An example of adding entry codes to a schema using the Semantic Engine.
An example of adding entry codes to a schema using the Semantic Engine.

Multilingual Support for Global Usability

OCA’s multilingual support adds a layer of inclusivity, enabling the incorporation of labels in multiple languages. This feature facilitates international collaboration, allowing users worldwide to engage with the dataset in a language they are comfortable with.

 

Crafting Acceptable Data Entries in OCA

When creating lists in OCA, users define acceptable data entries for specific attributes. Labels accompanying entry codes aid users in understanding and selecting the correct code, contributing to cleaner datasets.

 

Clarifying the Distinction between Labels and Entry Codes

It’s important to note that, in OCA, the emphasis is on entry codes rather than labels. While labels provide user-friendly descriptions, it is the entry code itself that becomes part of the dataset, ensuring data uniformity.

 

In conclusion, entry codes play an important role in streamlining data entry and enhancing the quality of datasets. Through the practical implementation of entry codes supported by Overlays Capture Architecture, organizations can ensure that their data remains accurate, consistent, and accessible on a global scale.

Happy New Year everyone!!!  Welcome to 2024 – Leap year!! 

Oh wow!  How time is really flying by!  It’s so easy for us to say this and see it happen in our every day lives – BUT – yes it also happens at work and with our research.  I remember as a graduate student, in the thick of data collection, thinking I’m never going to finish this project by a given date – there’s too much to do – it’ll never happen!  And just like that, whoosh, it’s over and done, and I’ve managed to complete a few research projects since.  It’s just amazing how time really does fly by.

As we start a new year with new aspirations, what a great time to implement new habits in our research work!  Ah yes, the dreaded documentation piece.  Last time we spoke, I talked about variable names and provided you with a list of recommended best practices when creating your variable names for analysis.  I also nudged you about keeping those labels, and using the Semantic Engine to create your data schema -check our Carly’s post about Crafting Effective Machine-Actionable Schemas.

So, we have variable names and a data schema, but is that ALL the documentation you should be keeping when you conduct a research project?  Of course the answer is NO!  Let’s review some other possible documentation pieces and ways to create the documentation.

README file

Let’s tackle the easy piece first and probably the one that will take the longest.   A README file is a text file that you should keep in the top folder of your project.  Now, let’s first talk about what I mean by a text file.  A file created and saved using Notepad on a Windows machine OR TextEdit on a Mac – NOT Word!!!  Now I’m sure you’re asking why in the world would I want to use a text editor – a program with NO formatting ability – my document is going to be ugly!  Yes it will!  BUT – by using a text editor, aka creating a file with a .txt ending will provide you with the comfort that your file will be readable by researchers in the future.  Thinking about the Word program as an example, are you 100% positive that the next release will be readable say 5 years from now?  Can we read older Word documents today?  If you have an older computer with an older version of Word, can you read a document that was created in a newer version of Word?  Chances are you’ll have formatting challenges.  So…. let’s just avoid that nonsense and use a format that is archivable!  .txt

So now that we got that out of the way, what should we include in a README file?  Think of the README file as your project or study level documentation.  This is where you will describe your folder structure and explain your acronyms.  This is also where you will give brief abstract of your study, who the principal investigators are, timeframes, and any information you believe should be passed on to future researchers.  Things like challenges with data collection – downpour on day 10 prevented data collection to occur – data collection was conducted by 3 new individuals on day 15, etc…  Think about the information that you would find important if YOU were using another study’s data.  If you are looking for examples, check out the READMEs in the OAC Historical Research Data and Reproducibility Project dataverse.

The README file is often a skipped yet crucial documentation piece to any project.  Some projects use a lab book to capture this information.  No matter what media you use the end goal is to capture this information and create a text file for future use.

Conclusion

One more piece of documentation I want to talk about is capturing what happens in your statistical analysis.  Let’s leave that for the next post.

Michelle

 

In the dynamic realm of research, where data is the cornerstone of ground-breaking discoveries, ensuring the integrity, reproducibility, and collaboration of your work is paramount. One indispensable tool that researchers often overlook but should embrace wholeheartedly is version control. In this blog post, we’ll delve into the importance of version control for research data, explore its benefits, and provide a few tricks on using GitHub to effectively track changes in your analysis code.

 

The Importance of Version Control in Research Data

  1. Preserving Data Integrity

Version control acts as a safety net for your research data, preserving its integrity by keeping a historical record of changes. This ensures that no information is lost or irreversibly altered, allowing researchers to retrace their steps if necessary.

 

  1. Enhancing Reproducibility

Reproducibility is a cornerstone of scientific research. Version control enables the precise recreation of experiments and analyses by maintaining a detailed history of code changes. This not only validates your findings but also allows others to replicate and build upon your work.

 

  1. Facilitating Collaboration

Collaborative research projects often involve multiple contributors. Version control platforms, such as GitHub, streamline collaboration by providing a centralized hub where team members can contribute, review, and merge changes seamlessly. This collaborative environment fosters a more efficient and organized research process.

 

  1. Streamlining Peer Review

Version control systems offer a transparent view of the evolution of your research, making it easier for peers and reviewers to understand and assess your methodology. This transparency enhances the credibility of your work and facilitates a smoother peer-review process.

 

Tricks for Using GitHub to Track Changes

  1. Branching Strategies

Leverage GitHub’s branching features to create separate branches for different features or experimental approaches. This allows you to experiment without affecting the main codebase and facilitates the integration of successful changes.

 

  1. Commits with Descriptive Messages

Adopt a disciplined approach to commit messages. Each commit should have a clear and concise message describing the changes made. This practice enhances traceability and makes it easier to understand the purpose of each modification.

 

  1. Utilize Pull Requests

GitHub’s pull request feature is a powerful tool for code review and collaboration. It enables contributors to propose changes, discuss modifications, and ensure the quality of the code before it is merged into the main branch.

 

  1. Continuous Integration

Integrate continuous integration tools with your GitHub repository to automatically test changes and ensure that new code additions do not break existing functionality. This ensures a more robust and error-free codebase.

 

Learn More with Resources

For those eager to dive deeper into the world of version control and GitHub, there are plenty of accessible resources available. For a comprehensive understanding of version control concepts, “Pro Git” by Scott Chacon and Ben Straub is an excellent e-book available for free online. It covers the basics and advanced features of Git, the underlying technology behind GitHub, making it an invaluable resource for both beginners and experienced users.

 

YouTube is also a treasure trove of tutorials. The “freeCodeCamp.org” channel offers a series of well-structured videos covering everything from the basics to advanced workflows. The tutorials provide hands-on demonstrations, making it easier for visual learners to grasp the concepts.

 

If you prefer a more interactive learning experience, GitHub Skills is an online platform offering hands-on courses directly within the GitHub environment. These interactive courses guide you through real-world scenarios, allowing you to practice version control concepts in a practical setting.

 

Lastly, but not the least, Agri-food Data Canada hosted a hands-on “Introduction to GitHub” workshop this summer as part of the successful Research Data Workshop Series, so make sure to check our slide notes to learn more about GitHub and stay tuned for the next editions of this workshop series.

 

Remember, mastering version control and GitHub is an ongoing journey, and these resources serve as invaluable companions, whether you’re a novice seeking a solid foundation or an experienced user looking to refine your skills. Happy learning!

 

In conclusion, version control, especially when harnessed through platforms like GitHub, is not just a technical nicety but a fundamental necessity in the realm of research data. By adopting these practices, researchers can safeguard the integrity of their work, enhance collaboration, and contribute to the broader scientific community with confidence. Embrace version control and watch your research journey unfold with unprecedented clarity and efficiency.

 

Lucas Alcantara

The next stop on our RDM travels is “Documenting your work”.  Those 3 words can scare a lot of people – let’s face it that means spending time writing things down, or creating scripts, or it could be viewed as taking time away from conducting research and analysis.  Yes, I know, I know – and anyone who has worked with me in the past, knows that I value documentation VERY highly!  Without documentation, your data is valuable to YOU at this moment, but 6 months or 5 years down the road, without documentation it may become useless.  On this note, before I start talking about the details of documenting your work, I would like to share the Data Sharing and Management Snafu in 3 Short Acts video.  I cannot believe that this video is 10 years old – but it still SO relevant.  If you have not seen it, please watch it!  It highlights WHY we are talking about RDM – but near the end it deals with our topic today – documenting your data.

Reference:  NYU Health Sciences Library. “Data Sharing and Management Snafu in 3 Short Acts” YouTube, 19 Dec 2012, https://www.youtube.com/watch?v=N2zK3sAtr-4.

Variable Names

So let’s talk about variable names for your statistical analyses.  Creating variable names is usually done with statistical analyses packages in mind.  Let’s be honest we only want to create the variable names once – if we have to rename them – we increase our chances of introducing oddities in our analyses and outputs.  Hmm…  could I be talking about personal experiences?  How many times, in the past, have I fallen into the trap of naming my variables V1, V2, V3, etc…  or ME1 or ME_final?  It is so easy to fall into these situations especially when we have a deadline looming.  So let’s try to build some habits that will help us avoid these situations and help us create data documentation that can eventually be shared and understood by researchers outside of our inner circle.  A great place to begin is by reviewing the naming characteristics of the most popular packages used by University of Guelph researchers – based on a survey I conducted in 2017.

Length of Variable Name

SAS: 32 characters long
Stata: 32 characters long
Matlab: 32 characters long
SPSS: 64 bytes long =  64 characters in English or 32 characters in Chinese
R: 10,000 characters long

1st Character of a Variable Name

SAS: MUST be a letter or an underscore
Stata: MUST be a letter or an underscore
Matlab: MUST be a letter
SPSS: MUST be a letter, an underscore or @,#,$
R: No restrictions found

Special Characters in Variable Names

SAS: NOT allowed
Stata: NOT allowed
Matlab: No restrictions found
SPSS: ONLY Period, @ are allowed
R: ONLY Period is allowed

Case in Variable Names

SAS: Mixed case –Presentation only
Stata: Mixed case – Presentation only
Matlab: Case sensitive
SPSS: Mixed case – Presentation only
R: Case sensitive

Recommended Best Practice for Variable Names

Based on the naming characteristics listed above the following is a list of Recommended Best Practices to consider when naming your variables:

  1. Set Maximum length to 32 characters
  2. ALWAYS start variable names with a letter
  3. Numbers can be used anywhere in the variable name AFTER the first character
  4. ONLY use underscores “_” in a variable name
  5. Do NOT use blanks or spaces
  6. Use lowercase

Example Variable Names

Heading in Excel or description of the measure to be taken → variable name to be used in a statistical analysis

Diet A → diet_a
Fibre length in centimetres → fibre_cm
Location of farm → location
Price paid for fleece → price
Weight measured during 2nd week of trial → weight2

Label or description of the variable

Let’s ALWAYS ensure that we are keeping the descriptive part or label for the variable name documented.  Check out the Semantic Engine, an easy to use tool to document your dataset!

Conclusion

Variable names are only one piece of the documentation for any study, but it’s usually the first piece we tend to work on as we collect our data or once we start the analysis.  Next RDM post I will talk about the other aspects of documentation and present different ways to do it.

Michelle