Getting Started with Reproducible Research…

Goal: Make Research Reproducible

What?

The Turing Way project illustration by Scriberia. DOI.

Different kinds of “reproducible”

DOI.

Today, we’re focused on “Reproducible”

Goal: Make Research Reproducible

Why?

Goal: Make Research Reproducible

How?

Goal: Make Research Reproducible

How?

Lots of ways!

Barriers to Reproducibility

If reproducibility is such a great thing to ensure in research, why are we here in 2024 learning about it (possibly for the first time)?

Barriers to Reproducibility

If reproducibility is such a great thing to ensure in research, why are we here in 2024 learning about it (possibly for the first time)?

A slide outlining some of the barriers to reproducible research from Kirstie Whitaker’s talk about The Turing Way at csv,conf,v4 in May 2019. DOI

Reproducibility is hard

Fortunately, reproducibility is not “all-or-nothing.”

Even some reproducibility is better than none!

Today, let’s focus on “good enough”: the least amount of work to get a respectably reproducible research project.

What we’ll cover

🔁 Version control
🪪 Licensing
🌅 Environments

🔁 Version Control

Version control is a workflow where the entire history of a set of documents is preserved.

Why do we want our work under version control?

Why version control

Tracking project history. DOI.

Provenance
Version history
Hide older versions
Distributed work

Create an account on GitHub

Create a repo

Fill out the form information.

Repo name: Something pithy that describes the project you’re working on.
Description: A one-line summary of what you’re using this repo for.
Public / Private: Suggest “public” by default, but “private” during development and “public” during and after review also makes sense.
Add a README: This is where you’ll go into detail about what your repo is doing. Definitely add this!
Add .gitignore: A bit more technical; depends on the work you’re doing.
Choose a license: Absolutely yes–we’ll go into more detail in the next section.

Create a folder structure

This is a suggestion!
Use what is relevant, ignore what is not
(PSO: Don’t put full data in version control. Sample data is great! Or, add data/ to .gitignore)

Add everything to version control

Version control workflow

🪪 Licensing

Licensing is how to spell out the rights of others to use, modify, or build on our work.

Why is it important to give our work (code, data, content) a license?

Patents, Trademarks, and Copyright

Multiple kinds of intellectual property protection.

Here, we’re focusing on Copyright.

Copyright can cover usage rights for:

Code
Data
Hardware
ML models
Content (slides, books, pictures, figures…)

Choosing a license

How does one pick a license?

Fortunately, there’s a handy flowchart!

https://choosealicense.com/

Adding a license file to your repo

You can do this right when you create the repo!

🌅 Environments

Making your environment reproducible means configuring your work space–code, software, programs, even operating system–so that it is identical to the environment in which the research was originally done.

Why is it important to have reproducible environments?

How to “capture” an environment

Computational environments. DOI.

Virtual machines
Docker / containers
Conda / Mamba
Binder

VMs are maybe the “easiest”, but certainly the most massive, ways of passing an environment around. The drawback is the size: they often clock in well into the GBs. Preferable to this would be something more akin to a recipe for building an environment locally.
This “recipe” idea is what containers aim for. Dockerfiles enumerate the “recipe” for the environment, which are excecuted locally to build it.
Where VMs and containers recreate entire compute environments, conda / mamba are package managers that are specific to coding environments (they often exist within VMs or containers). They can also specify recipes of specific coding packages and their versions that comprise a research environment.
Binder is a self-contained platform that builds a coding environment like conda/mamba but without needing the command line or a local VM or container; all you need is a repo (on GitHub) and a web browser.