Here’s why you should break up your analysis

Today I was diving deep into one of my pipelines to fix potential issues I noticed in my results. In short, I neglected to remove invasive species for my analyses in phylogenetic regionalization (A topic Im sure I need to talk about more in-depth at some point). Im not sure if this has drastic effects downstream, but regardless I thought it would be best to remove these taxa altogether.

And I was reminded of one of the best R tips for pipelines that I think too many of us ignore.

Breaking your scripts into steps

I actually talk about this a little in my Fundamentals of R for Biologists online course, but its such a useful practice that I thought id write up this post while some of my analyses are running!

Instead of having one very large script that performs all the data cleaning, analysis, and visualization for your entire thesis (If you feel called out by this, yes you should stop doing this immediately), instead, create a series of scripts that performs the analyses step by step.

Doing so will allow you to keep your code more organized, make your coding experience that much easier, and allow to make changes to your pipeline incredibly quickly.

I know that more senior programmers or those with great coding etiquette think this is basic, but hey! some people only code once every 6 months when their field season grant money runs out.

Here is what my scripts folder looks like for this R project. For reference, for every project I have fodlers for inputs, outputs, and scripts. Just another organization tip.

My preferred naming format is simple. Step#_descriptive_filename.R

If I have a particular step that may have an optional substep, or needs to fork for different datasubsets, or im feeling lazy and dont want to rename every files, I just add a period StepNumber#.Sub#_descriptive_filename

See the file 1.1_graft_phylogenies.R which was an old analysis that is no longer used in this pipeline but I kept JUST IN CASE.

Oh! and all the analyses prefixed with 999 are post-hoc analyses that arent in any particular order. All of them are fed the outputs from different steps in the pipeline. I use 999 so they are put at the end of my script folder when sorting by name.

Why do this though?

It makes editing, updating, and changing your pipeline IMMENSELY easier. Heres why:

3 months ago, I finally decided to take my messy scripts that were long and drawn out and convert them to this stepwise format. Since then I have had a laundry list of changes that I needed to make: updating taxa names, fine-tuning the spatial resolution, doing my analyses twice for each Amphibians and Squamates, adding new scripts for post-hoc tests, and so on and so on.

When I used to have my entire pipeline in 2 or 3 massive scripts, I often felt overwhelmed, worried that a single change would break my downstream analyses, and generally frustrated at how much effort it took to edit my script.

Now each step acts as a self-contained box of analyses. I know, that the file 2_clean_spatial.R is there to clean my Spatial data. I know the outputs of that file, are fed into 3_combine_GBIF_IUCN.r and so and so forth.

While it can be a bit difficult to figure out where to make your steps, I try to keep them short, highly focused on only one or a few tasks, and generally understandbale as to how its different from other scripts. Of course, its incredibly easy to break up your scripts even further or combine several together.

This enables me to know exactly what elements of the pipeline are not working, it helps keep my coding environment clean, and when R inevitably crashes and I lose all my data (literally happened right now), its alot easier to pick up back where I started.

I’ve brought up this data format with students before and I often receive initial pushback along the lines of:

But what about all the objects for each step! I have a billion different objects in my environment that I need for my analyses! This is going to create more work for me right?

Thats where part two comes in: using .rdata files, save(), and load()

Lets take a look at my 3_Combine_GBIF_IUCN.R file. This pipeline requires occurence data that ive already cleaned and a raster that shows the extent of my study. It will output a raster file of grids, a full occurence dataset, and a presence-absence matrix telling me which grid cells each taxa is found in.

A beautiful script pipeline that accomplishes many tasks. Note that I also did some further coding distillation by creating some custom functions that I put in my development r package EvoGEAR. More on that in the future.

At the very top of the script, right after I load the libraries I need, there is a single function that loads in all the data I need for this pipeline

load("Outputs/2_cleaned_data.rData")

This single function is able to load in 6 objects into my R environment easily.

All of this objects are the outputs of the previous pipeline step. Hence why it starts with the number 2.

With my stepped pipeline structure, I prefer to have as many data inputs as needed, however only have 1 data output at the end of each step. This way, it is incredibly easy to know what step of the pipeline a particular data object was derived from.

At the end of this script for example, I want to export the full occurence dataset, a presence absence matrix, and a raster grid overlaying my study area.

At the end of the script, I simply wrote:

save(occ_full,pres_ab,raster, file = "Outputs/3_occurence.rdata")

The .rdata file format is specifcally suited to holding multiple R objects in a single file. So that file “3_occurence.rdata” which is housed in my Outputs folder, contains the objects occ_full, pres_ab, and raster

Whats great about this format is that it cuts down on object clutter, is easy to understand what data is going where, and it dramatically helps me cut down on my time spent editing.

If you want a short template for how this works ill leave it here

#Each script starts with what libraries you need
library(tidyverse)
library(sf)

#Then we load in any data we need
load("Outputs/1_first_step_of_pipeline.rData")

#Then I start doing all that I need my script to do
# Use
# Your
# Imagination
# Assume
# Its
# Related
#To
# Amphibians

#Then we save the r objects we want to export. I try to keep this list as small as possible
save(object1, object2, object3, object4, file = "Outputs/2_second_step_of_pipeline.rData")

Of course, if you want to level up your R game, check out our Fundamentals of R for Biologists online course where we cover this best practice as well as many more functions related to data cleaning, data manipulation, and data visualization!