FanficReadeR
A package for scraping data from fanfiction website AO3.org.
FanficReadeR
A webscraper for gathering public data on AO3
The Package
FanficReadeR
scrapes data from ArchiveOfOurOwn (AO3), one of the world’s leading fanfiction websites, with more than 3.7 million registered users and 7.6 million works listed on the platform. This package gathers data about three broad categories: Fandoms, Users, and Works
Fandoms
Works are organized into fandoms, which refer to the media each work is a fanfiction of. Fandoms are things like “Harry Potter”, or “Percy Jackson”, or “Stranger Things”.
Works
Works are simply stories written by users. Works contain chapters, and all works have at least one chapter. Works can be either incomplete (chapters are still being posted) or complete (the user has marked the work as finished).
Each work has a series of attributes associated with it:
- Length: defined as either chapter-count or word-count
- Kudos: the AO3 equivalent of “likes”
- Comments: responses to the fic generated by readers (with timestamps)
- Bookmarks: the number of times a fic has been “saved” by a reader to display on their profile
- Pairings: denotes the romantic/sexual pairings of characters in the story (F/F, F/M, M/M, Multi, etc)
- Fandoms: lists all the fandoms a work is associated with. This is important because many fics are “crossovers”, using intellectual property from multiple different story universes (e.g.: a Harry Potter/Twilight crossover might have Bella attending Hogwarts)
- Published: the date the fiction was first posted to AO3
- Completion Status: whether the fiction is complete or not
- Completed: the date the fiction was completed
- Last updated: the time the fiction was last updated
- Language: what language is the story written in?
- Characters: what characters are involved in this story? Who do they “pair off” with?
Users
There are a number of features about users that may prove interesting to an external observer.
- Bookmarks: a list of all the fics a user has saved for later, or wishes to recommend to others
- Works: a list of all the stories the author has written or is in the process of writing
- Biography: a set of user characteristics like join date, user ID, name, and any pseudonyms they may have
- Join date: the date a user joined the platform
- User ID: the users unique ID
Functions
FanficReadeR
uses a small set of functions to generate this data about individual works and/or authors.
Fandom Data
Fandoms can be searched for in one of two ways: GetFandomIndex()
and GetSearchIndex
. GetFandomIndex()
simply searches for all fictions within a fandom. So, for example, if you wanted to get an index of all fictions within the Harry Potter fandom, this function would return all fictions up to a maximum of 5,000 pages. Each page has 20 fictions listed, so the maximum that GetFandomIndex()
can return is 100,000 fictions. This will only be a problem for the largest fandoms, and this can be solved using GetSearchIndex()
with iterated date ranges.
GetSearchIndex()
is a more advanced tool for gather information on fictions. Rather than returning all relevant fictions, GetSearchIndex()
returns all fictions that meet particular criteria, such as when a fiction was updated (date_from
to date_to
), or its complete status (complete, incomplete, or all fictions). Basic descriptions of each function are listed below:
Function | Inputs | Description |
---|---|---|
GetFandomIndex() |
Fandom name, max pages to collect, start page (default = 1) | Gathers an index of fanfiction URLs for a given fandom, outputted as a dataframe. Currently, this function selects the most recently updated fanfictions in the given fandom. When using this function, you need to specify how many pages of results to gather – AO3 displays 20 results per page, so if you want to gather 20 fanfiction URLs you would set numbere of pages to 1, and for 100, set page numbers to 5. |
Works Data
Function | Inputs | Description |
---|---|---|
GetWorksInfo() |
Work Link OR Chapter Link | Gathers basic summary data about the work in question. Includes data like: Work title, completion status, user engagement (kudos, comments, bookmarks, hits), romantic pairings (M/M, F/M, F/F, Multi, etc) |
GetChapterIndex() |
Work Link OR Chapter Link | Creates an index of all chapters in the relevant work. Lists their names, chapter order, and provides a URL |
GetComments |
Work Link OR Chapter Link | Gathers all comments on the relevant work. For each comment, this function also tells you which user made the comment, if that user was the author, when the comment was made, what chapter the comment was made on. The large nature of the data that this can generate means that I have added a few extra options to the function: 1) keep.text = TRUE is the default, and preserves the original text of the comment in the output, 2) excl.author = FALSE is the default, and removes comments made by the author on their own work |
Rate Limits
AO3 limits how many requests can be made to their website. They use the platform rack::attack
for their servers, which limits requests to AO3’s website to only 60 per minute (the exact rate limit is a actually 300 req per 300 sec, per the open source AO3 GitHub). However, in practice the throttle threashold for AO3 is much smaller than this. I’ve tested the functions here many times, and there will be no point at which you will consistently achieve 60rpm without hitting a HTTP 429 error for making too many requests.
I’ve experimented with different parameters for a Sys.sleep()
call attached to every html request, and found that a 5.5 second delay per request is the smallest delay that lets these functions run continuously without error. This is obviously not ideal, and frankly I don’t know why the OTW archive says they have a 1 req/s limit when in practice their limits are more like 0.2 req/s.
Unless you’re trying to gather large quantities of data, this shouldn’t matter all that much to you. The scraper works relatively fast for single-fanfic scrapings, but the delays will make it so that large-scale scraping efforts could take multiple hours or even days. I’d appreciate any suggestions for improving the speed at which these functions can continuously collect data.
Examples
See example_scrape.qmd
for a functioning scraper workflow that gets data on ~3,000 Harry Potter fanfictions, as well as associated comment sections and authors.
Future roadmap
While this package can gather most of the information from AO3 that any researcher might want, there are still several ways in which it could – and will be – improved. Specifically:
- Improve
GetSearchIndex()
to encompass a broader range of search parameters - Add higher-level wrapping functions that encompass several major functions. This will reduce the need for users to write their own loops when scraping data on many functions
- Add unit tests from
testthat
to test whether future changes to the package break anything - Add more informative
roxygen2
documentation
Installation
You can install FanficReadeR
with the following code:
install.packages("devtools") # if you have not installed "devtools" package
::install_github("SEthanMilne/FanficReadeR") devtools
Citation
If you use this package for academic purposes, I ask that you cite me using the below information:
Milne (2024). FanficReadeR. R package version 1.0.
Ethan
for LaTeX users is
A BibTeX entry
@Manual{,
= {FanficReadeR},
title = {Ethan Milne},
author = {2021},
year = {R package version 1.0},
note }