Robust and Reproducible Research

Course Repository

The official course repository containing all the relevant material (slides, recordings, images, etc..) is: https://github.com/federicoruggeri/phdlectures-r3

Introduction

The number of scientific articles published in Computer Science (and similar fields) increases steadily every year. This is mainly due to breakthroughs like Deep Learning, and, more recently, Large Language Models.

Paradoxically, researchers are struggling even more to reproduce published research. This issue affects all possible aspects of research, including methodology, data curation, approach comparison, and implementation.

In this course, we’ll introduce and discuss the concept of ‘reproducibility’ in research. In particular, we’ll overview current issues in research and existing attempts to address them. We’ll focus on data curation, experimental setup, model comparison, and programming best practices.

This course is recommended for all types of researchers, from those who have just embarked on their journey to those who have always wondered how certain research managed to get published. See Section Prerequisites for more details.

Part 1: Reproducibility in Research

We discuss the current risks in doing research nowadays, characterized by non-reproducible findings, non publicly available resources (e.g., data, code, artifacts), an incredible and non-sustainable urge of publishing large amounts of papers in short time. In this world, either as an aspiring or experience researcher, would you accept (trust) a work that (i) doesn’t provide sufficient information for reproducibility; (ii) doesn’t provide any code; (iii) doesn’t provide the data/guidelines for collecting their contributing dataset; (iv) doesn’t provide training details like model hyper-parameters and data partitioning.

Lecture recordings

Lecture 1 – Reproducibility in Research (Pt. I).
Lecture 2 - Reproducibility in Research (Pt. II).

Readings

Stodden and Miguez, 2013 - Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research.
McNutt, 2014 - Reproducibility.
Allison et al., 2016 - Reproducibility: A tragedy of errors.
Baker 2016 - 1,500 scientists lift the lid on reproducibility.
Edwards and Roy, 2017 - Academic Research in the 21st Century: Maintaining Scientific Integrity in a Climate of Perverse Incentives and Hypercompetition.
Raff, 2019 - A Step Toward Quantifying Independently Reproducible Machine Learning Research.
Bouthillier et al., 2019 - Unreproducible Research is Reproducible.
Pineau et al., 2020 - Improving Reproducibility in Machine Learning Research.
Garcia et al., 2021 – Nonreplicable publications are cited more than replicable ones.
Leonelli, 2023 - Philosophy of Open Science.
Kapoor and Naranayan, 2023 - Leakage and the reproducibility crisis in machine-learning-based science.

Part 2: Data collection and Annotation

Reproducibility can target different aspects of the research pipeline. From the conceptualization of ideas to collecting resources and doing experiments. We cover several aspects of data collection since it often represents the backbone of most produced machine learning research. In particular, we cover annotation paradigms, requirements for collecting and annotating data, issues and risks when collecting data, and evaluating annotation quality.

Lecture recordings

Lecture 3 – Data Collection and Annotation.

Readings

Geva et al., 2019 - Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets.
Liao et al., 2021 - Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning.
Paullada et al., 2021 - Data and its (dis)contents: A survey of dataset development and use in machine learning research.
Koch et al., 2021 - Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research.
Röttger et al., 2022 - Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks.
Cabitza et al., 2023 - Toward a Perspectivist Turn in Ground Truthing for Predictive Computing.
Ruggeri et al., 2025 - Let Guidelines Guide You: A Prescriptive Guideline-Centered Data Annotation Methodology.

Part 3: Modeling and Experimenting

There are other aspects of a machine learning pipeline where talking about reproducibility is essential: modeling, experimenting, and evaluation. We talk about data partitioning, data leakage, random seeding, performance comparison, and metrics.

Lecture recordings

Lecture 4 – Modeling and Experimenting.

Readings

Blagec et al., 2008 - A critical analysis of metrics used for measuring progress in artificial intelligence.
Henderson et al., 2018 - Deep Reinforcement Learning That Matters.
Dror et al., 2018 - The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing.
Gorman and Bedrick, 2019 - We need to talk about standard splits.
Amrhein et al., 2019 - Scientists rise up against statistical significance.
Forde and Paganini, 2019 - The Scientific Method in the Science of Machine Learning.
Hovy and Prabhumoye, 2020 - Five sources of bias in natural language processing.
Azer et al., 2020 - Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses.
Dehghani et al., 2021 - The Benchmark Lottery.
Søgaard et al., 2021 - We need to Talk About Random Splits.
Rob van der Goot, 2021 - We Need to Talk About train-dev-test Splits.
Bethard, 2021 - We need to talk about random seeds.
Marie et al., 2021 - Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers.
Kapoor and Naranayan, 2023 - Leakage and the reproducibility crisis in machine-learning-based science.
Lones, 2024 - How to avoid machine learning pitfalls: a guide for academic researchers.

Part 4: Responsible Research

While there are several issues we might encounter concerning reproducibility, there is also effort in developing solutions to mitigate these issues. One of these solutions in accordance with reproducible research is represented by recommendations checklists.

Lecture recordings

Lecture 5 – Responsible Research.

Readings

Arnold et al., 2019 - FactSheets: Increasing trust in AI services through supplier’s declarations of conformity.
Mitchell et al., 2019 - Model Cards for Model Reporting.
Gebru et al., 2021 - Datasheets for Datasets.
Pushkarna et al., 2022 - Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI.
Kapoor and Naranayan, 2023 - Leakage and the reproducibility crisis in machine-learning-based science.
Mancini et al., 2025 - Promoting the Responsible Development of Speech Datasets for Mental Health and Neurological Disorders Research.

Part 5: Programming Best Practices

Whether you like it or not, experimental setting might require you to do some coding stuff.

Coding translates to:

Transparency (don’t you dare do some cheap tricks!)
Correctness (your code should reflect your paper statements)
Readability (please, don’t make this a nightmare)
Efficiency (time is money)
Maintainability (I’m sure you’ll re-use this code)

Lecture recordings

Lecture 6 – Programming Best Practices (Pt. I).
Lecture 7 – Programming Best Practices (Pt. II).

Readings

Part 6: Cinnamon, a lightweight python library for research

Cinnamon is a lightweight library for general-purpose configuration and code logic de-coupling.

Lecture recordings

Lecture 8 – Cinnamon: a lightweight python library for research.

Readings

Cinnamon Documentation

Course History

2024-2025 –> “Robust and Reproducible Research” (16 hours)
2022-2023 –> “Robust and Reproducible Experimental Deep Learning Setting” (10 hours)

Course Info

16 Hours Lecture Format: 2 hour-long hybrid lectures.

Prerequisites

Lectures are meant to be interactive.

Programming: Intermediate
Deep Learning Theory: Intermediate
Jupyter Notebook: Beginner

Reproducibility Research Data Creation Benchmark Experimental Setting Evaluation

Authors

Federico Ruggeri (he/him)

Postdoctoral Research Fellow

I hold the position of Post-doc Research Fellow at the Computer Science and Engineering Department (DISI) of the University of Bologna and I’m mainly researching knowledge extraction and Neuro-symbolic solutions for UKI.

My main research area is integrating unstructured knowledge into deep learning models. I’m currently focusing on Natural Language Processing (NLP) and Argument Mining, a branch of NLP that aims at extracting arguments from unstructured texts.