```
library(pdftools)
library(tidyverse)
library(ggthemes)
library(ggtext)
library(ggpubr)
```

Another fun project – during my comprehensive exams I was given a large set of JCR papers to read and review. Since I had all these PDFs lying around, I had a good opportunity to learn more about automated data extraction from PDF documents. I decided to look at the distribution of P values in all 2019-2020 JCR papers.

# Required Packages

# Extract P Values

The code chunk below extracts the P values from papers in the following way:

- Get the names of all papers in a folder containing them
- Loop through each paper name and use the
`pdftools`

package to extract their raw text - Do some basic cleaning (e.g. getting rid of “\n”, which denote paragraph breaks)
- Extract all strings of text that are preceded by
`"P >"`

,`"P <"`

,`"P ="`

, and which end in a closed bracket`")"`

- Output a data frame with 2 columns:
`P values`

and`Paper Name`

```
<- list.files("PDFs")
files
<- data.frame(matrix(ncol = 2, nrow = 0))
Results names(Results) <- c("P_Value", "Paper")
for (i in 1:length(files)) {
<- files[i]
name
<-
text pdf_text(
::here("blog", "jcr_pvals", "PDFs", name)
here
)<- gsub("\n", "", text)
text <- gsub(" ", "", text)
text
<-
values unlist(str_extract_all(text, 'p\\s?[=<>]\\s?\\.\\d{1,4}')) |>
as.data.frame() |>
mutate(Paper = name)
if (is_empty(values)) {
next
else{
} names(values) <- c("P_Value", "Paper")
<- rbind(values, Results)
Results
} }
```

# Cleaning

Now that I have every P value, I need to extract the actual number. This is moderately challenging - P values are often reported without the leading `0`

(e.g. p = .07), and a p value that is reported as **greater than** 0.05 is different from one that is **equal** to or **less than**, and those differences need to be recorded somewhere for any future work I may do.

In summary, what the below code does is:

- Extract the “raw numeric value” from each reported p value string
- Replace the prior “p [<=>]” with a “0” instead
- Convert this column to numeric

```
<- Results |>
Cleaned_Results mutate(
Raw_Value = P_Value,
Raw_Value = gsub("p < ", "0", Raw_Value),
Raw_Value = gsub("p = ", "0", Raw_Value),
Raw_Value = gsub("p > ", "0", Raw_Value),
Raw_Value = gsub("p >", "0", Raw_Value),
Raw_Value = gsub("p =", "0", Raw_Value),
Raw_Value = gsub("p <", "0", Raw_Value),
Raw_Value = gsub("p< ", "0", Raw_Value),
Raw_Value = gsub("p= ", "0", Raw_Value),
Raw_Value = gsub("p> ", "0", Raw_Value),
Raw_Value = gsub("p<", "0", Raw_Value),
Raw_Value = gsub("p=", "0", Raw_Value),
Raw_Value = gsub("p>", "0", Raw_Value)
|>
) mutate(Operator = str_extract(P_Value, "[=<>]"))
$Raw_Value <- as.numeric(Cleaned_Results$Raw_Value) Cleaned_Results
```

# Plotting

Finally, I wanted to plot the raw p values I’ve found. There’s little analytic code here - mostly just ggplot aesthetic wrangling. I used the `{ggthemes}`

package for R to get the `theme_fivethirtyeight`

function which gives me a lot of aesthetic power, for lack of a better term.

I’ve added X axis breaks at the standard p value thresholds - 0.1, 0.05, 0.01, 0.001. We should expect that there is significant clustering around these thresholds, as most researchers seem to report inequalities (p < x) rather than exact values.

Finally, p values aren’t exactly linear in their distribution, so everything is put on a log scale for easier interpretation.

```
ggplot(Cleaned_Results, aes(x = Raw_Value)) +
geom_histogram(bins = 50, fill = "black") +
scale_x_log10(
breaks = c(0.0001, 0.001, 0.01, 0.05, 0.1, 1),
labels = c(".0001", ".001", ".01", ".05", ".1", "1")
+
) theme_pubclean() +
theme(
axis.title.x = element_text(),
axis.title.y = element_text(),
plot.title = element_markdown(),
plot.background = element_rect(
colour = "black",
fill = NA,
size = 2
)+
) labs(
title = "<span style = 'color: #ed713a;'>Distribution of P Values</span>",
subtitle = "Appearing in JCR, 2019-2020 editions",
x = "P Values",
y = "Count"
)
```