What is Bookworm?

Bookworm makes it easy to interactively explore massive collections of texts as data.

If you have a huge collection of texts, it provides a way to interpret them, make them explorable, and unlock them to a wide variety of uses even if you can't share them freely for copyright or other reasons.

For people interested in coding new ways of accessing texts, it makes it possible to explore text data that can be easily plugged into many interesting collections without having to reinvent the wheel of tokenization and complicated systems for indexing.

For researchers, it provides a concise but powerful API for creating complicated queries across textual metadata that can easily accessed in your statistical analysis framework of choice, whether you want to use a web framework or not.

And for everyday readers, it gives a set of new, useful, and compelling ways to explore digital libraries that get beyond the restrictive search engine to help you discover macro patterns and interesting individual texts.


Bookworm was initially developed at the Harvard Cultural Observatory, founded by Erez Lieberman Aiden and Jean-Baptiste Michel. The project is currently jointly steered by Erez Aiden from the Rice Cultural Observatory, and Ben Schmidt from Northeastern University. Matt Nicklay at Rice provides ongoing contributions

Major contributions between 2011-2013 were made by Martin Camacho, Billy Janitsch, Neva Cherniavsky, Erez Aiden, Ben Schmidt, Matt Nicklay, and JB Michel.

Funding and institutional supported has been provided by the Harvard Cultural Observatory, the Digital Public Library of America, Northeastern University, and Rice University.

What's in this book

The first half of this is nuts-and-bolts instructions for building a bookworm: the second half is background and specifications to understand the architecture, to use the API, and to extend the functionality of a Bookworm.


I'm going to try to use Bookworm (capitalized) to refer to the project as a whole and the overall project, and bookworm (lowercase) as a way to describe individual installations.

Editing this document

We welcome fixes or additions based on your experience with the project. File an issue on github, or you can directly edit this document by generating a pull request on the gh-pages branch of this repository. All edits should be to the Markdown files in the STATIC directory: the HTML in the base directory is routinely overwritten.