FanficReadeR

news

code

analysis

Author

Ethan Milne

Published

July 4, 2024

A package for scraping data from fanfiction website AO3.org.

FanficReadeR

A webscraper for gathering public data on AO3

The Package

FanficReadeR scrapes data from ArchiveOfOurOwn (AO3), one of the world’s leading fanfiction websites, with more than 3.7 million registered users and 7.6 million works listed on the platform. This package gathers data about three broad categories: Fandoms, Users, and Works

Fandoms

Works are organized into fandoms, which refer to the media each work is a fanfiction of. Fandoms are things like “Harry Potter”, or “Percy Jackson”, or “Stranger Things”.

Works

Works are simply stories written by users. Works contain chapters, and all works have at least one chapter. Works can be either incomplete (chapters are still being posted) or complete (the user has marked the work as finished).

Each work has a series of attributes associated with it:

Length: defined as either chapter-count or word-count
Kudos: the AO3 equivalent of “likes”
Comments: responses to the fic generated by readers (with timestamps)
Bookmarks: the number of times a fic has been “saved” by a reader to display on their profile
Pairings: denotes the romantic/sexual pairings of characters in the story (F/F, F/M, M/M, Multi, etc)
Fandoms: lists all the fandoms a work is associated with. This is important because many fics are “crossovers”, using intellectual property from multiple different story universes (e.g.: a Harry Potter/Twilight crossover might have Bella attending Hogwarts)
Published: the date the fiction was first posted to AO3
Completion Status: whether the fiction is complete or not
Completed: the date the fiction was completed
Last updated: the time the fiction was last updated
Language: what language is the story written in?
Characters: what characters are involved in this story? Who do they “pair off” with?

Users

There are a number of features about users that may prove interesting to an external observer.

Bookmarks: a list of all the fics a user has saved for later, or wishes to recommend to others
Works: a list of all the stories the author has written or is in the process of writing
Biography: a set of user characteristics like join date, user ID, name, and any pseudonyms they may have
Join date: the date a user joined the platform
User ID: the users unique ID

Functions

FanficReadeR uses a small set of functions to generate this data about individual works and/or authors.

Fandom Data

Fandoms can be searched for in one of two ways: GetFandomIndex() and GetSearchIndex. GetFandomIndex() simply searches for all fictions within a fandom. So, for example, if you wanted to get an index of all fictions within the Harry Potter fandom, this function would return all fictions up to a maximum of 5,000 pages. Each page has 20 fictions listed, so the maximum that GetFandomIndex() can return is 100,000 fictions. This will only be a problem for the largest fandoms, and this can be solved using GetSearchIndex() with iterated date ranges.

GetSearchIndex() is a more advanced tool for gather information on fictions. Rather than returning all relevant fictions, GetSearchIndex() returns all fictions that meet particular criteria, such as when a fiction was updated (date_from to date_to), or its complete status (complete, incomplete, or all fictions). Basic descriptions of each function are listed below:

Function	Inputs	Description
`GetFandomIndex()`	Fandom name, max pages to collect, start page (default = 1)	Gathers an index of fanfiction URLs for a given fandom, outputted as a dataframe. Currently, this function selects the most recently updated fanfictions in the given fandom. When using this function, you need to specify how many pages of results to gather – AO3 displays 20 results per page, so if you want to gather 20 fanfiction URLs you would set numbere of pages to 1, and for 100, set page numbers to 5.

Author Data

Function	Inputs	Description
`GetAuthorInfo()`	Author Name OR Profile Link	Gathers basic biographical data about an author on AO3. Includes data like: Date joined, Number of stories, number of bookmarks, Author ID
`GetAuthorWorks()`	Author Name OR Profile Link	Gathers data about all works generated by the author. Includes data like: Work title, completion status, user engagement (kudos, comments, bookmarks, hits), romantic pairings (M/M, F/M, F/F, Multi, etc)
`GetAuthorBookmarks()`	Author Name OR Profile Link	Gathers data about all works bookmarked by the author. The data is near-identical to the output of `GetAuthorWorks()`
`GetAuthorAll()`	Author Name OR Profile Link	Applies the above three functions in a single call, output as a list

Works Data

Function	Inputs	Description
`GetWorksInfo()`	Work Link OR Chapter Link	Gathers basic summary data about the work in question. Includes data like: Work title, completion status, user engagement (kudos, comments, bookmarks, hits), romantic pairings (M/M, F/M, F/F, Multi, etc)
`GetChapterIndex()`	Work Link OR Chapter Link	Creates an index of all chapters in the relevant work. Lists their names, chapter order, and provides a URL
`GetComments`	Work Link OR Chapter Link	Gathers all comments on the relevant work. For each comment, this function also tells you which user made the comment, if that user was the author, when the comment was made, what chapter the comment was made on. The large nature of the data that this can generate means that I have added a few extra options to the function: 1) `keep.text = TRUE` is the default, and preserves the original text of the comment in the output, 2) `excl.author = FALSE` is the default, and removes comments made by the author on their own work

Rate Limits

AO3 limits how many requests can be made to their website. They use the platform rack::attack for their servers, which limits requests to AO3’s website to only 60 per minute (the exact rate limit is a actually 300 req per 300 sec, per the open source AO3 GitHub). However, in practice the throttle threashold for AO3 is much smaller than this. I’ve tested the functions here many times, and there will be no point at which you will consistently achieve 60rpm without hitting a HTTP 429 error for making too many requests.

I’ve experimented with different parameters for a Sys.sleep() call attached to every html request, and found that a 5.5 second delay per request is the smallest delay that lets these functions run continuously without error. This is obviously not ideal, and frankly I don’t know why the OTW archive says they have a 1 req/s limit when in practice their limits are more like 0.2 req/s.

Unless you’re trying to gather large quantities of data, this shouldn’t matter all that much to you. The scraper works relatively fast for single-fanfic scrapings, but the delays will make it so that large-scale scraping efforts could take multiple hours or even days. I’d appreciate any suggestions for improving the speed at which these functions can continuously collect data.

Examples

See example_scrape.qmd for a functioning scraper workflow that gets data on ~3,000 Harry Potter fanfictions, as well as associated comment sections and authors.

Future roadmap

While this package can gather most of the information from AO3 that any researcher might want, there are still several ways in which it could – and will be – improved. Specifically:

Improve GetSearchIndex() to encompass a broader range of search parameters
Add higher-level wrapping functions that encompass several major functions. This will reduce the need for users to write their own loops when scraping data on many functions
Add unit tests from testthat to test whether future changes to the package break anything
Add more informative roxygen2 documentation

Installation

You can install FanficReadeR with the following code:

install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("SEthanMilne/FanficReadeR")

Citation

If you use this package for academic purposes, I ask that you cite me using the below information:

Ethan Milne (2024). FanficReadeR. R package version 1.0.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {FanficReadeR},
    author = {Ethan Milne},
    year = {2021},
    note = {R package version 1.0},
  }