releasing-research-code/notebooks/code_checklist-analysis.Rmd

---
title: "ML Code Completeness Checklist Analysis"
output:
  pdf_document: default
  html_notebook: default
---

This notebook contains the ML Code Completeness analysis for NeurIPS 2019 repositories.

For a run & rendered version of this notebook please see: [code_checklist-analysis.pdf](code_checklist-analysis.pdf).

Official repositories for NeurIPS 2019 papers fetched from: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019

A random 25% sample has been selected and manually annotated according to the 5 critera of the ML Code Completness Checklist. The result has been saved into `code_checklist-neurips2019.csv`.

```{r}
library(tidyverse)
library(ggplot2)
library(MASS)
library(RColorBrewer)

t = read_csv("code_checklist-neurips2019.csv")
cat("Number of rows:", nrow(t), "\n")
```

We'll focus only on Python repositories, since this is the dominant language in ML and repositories in other languages tend to have a smaller number of stars just because the community is smaller.

```{r}
t = t[t$python==1,]
cat("Number of rows:", nrow(t), "\n")
```

Next, we calculate the score as a sum of of individual checklist items and calculate summary stats.

```{r}
t$score = rowSums(t[,4:8])
```

We group repositories based on their score and calculate summary stats.

```{r}

cat("Spread of values in each group:\n")
summaries = tapply(t$stars, t$score, summary)
names(summaries) = paste(names(summaries), "ticks")
print(summaries)

cat("Proportion of repos in each group:\n")
props = tapply(t$stars, t$score, length)
props = props/sum(props)
names(props) = paste(names(props), "ticks")
print(props)

# Extract medians
medians = unlist(lapply(tapply(t$stars, t$score, summary), function(x) x["Median"]))
names(medians) = paste(sub(".Median", "", names(medians)), "ticks")
```

Generate summary graphs.

```{r}
par(oma=c(0,1,0,1))
layout(matrix(c(1,2), 1, 2, byrow = TRUE), widths=c(3,2))
barplot(medians,
        xlab="",
        ylab="Median GitHub stars", ylim=c(0,200),
        col=brewer.pal(6, "Blues"), cex.axis=0.6, cex.names=0.6)
mtext("GitHub repos grouped by number of ticks on ML code checklist", side=1, line=3, cex=0.6)

pie(rev(props), col=rev(brewer.pal(6, "Blues")), cex=0.6)
mtext("Proportion of repositories in each group", side=1, line=3, cex=0.6)
```

Compare using box plots.

```{r}
tp = t
tp$score = as.factor(tp$score)
par(mfrow=c(1,1))
boxplot(stars~score, data=t, ylim=c(0,200), col=brewer.pal(6, "Blues"),
        xlab="ML code checklist ticks", ylab="Github stars")
```

Fit robust regression and test significance of results

```{r}
print(summary(rlm(stars~training+evaluation+pretrained_model+results+dependencies, data=t)))

for(i in 0:4){
  cat("\nScore5 vs Score", i, "\n")
  print(wilcox.test(t$stars[t$score==5], t$stars[t$score==i]))
}
```

### Session information

```{r}
sessionInfo()
```