A Bookworm query is a JSON object with keys describing the objects to be fetched. Each key is essentially a function argument. The syntax for queries is largely taken from MongoDB.
This definition is open for review, and will probably change before the 1.0 release.
I'm soliciting any comments. What statistics should be added? How should keys be arranged? What should they be named? Is the capitalization format driving you crazy?
You can think of each of the query keys as doing one of four things:
Some of these take one key, some take a few.
search_limits
(required: default {}
)search_limits
is the workhorse function that lets you set the words or other fields to be searched. It has its own syntax: see 4.2.1 for details.
compare_limits
(optional)Unlike search_limits
, compare_limits
is rarely specified manually, but when used it allows particularly complicated queries. Many queries contain an implicit comparison: you wish to return the number of times a word in a set is used as a percentage of all the words used in that set. compare_limits
allows you to specify the comparison explicitly.
By default, compare_limits
will be the same as search_limits
, but with the words
key removed: this makes it trivial to search for the percentage of all words.
But this is not always the most immediately useful comparison. If you want to compare how often two words are used, you can put one in search_limits and one in compare_limits.
"groups"
groups
is an array of metadata fields describing what metadata should be returned. Each entry represents an additional layer of complexity: for example, specifying "groups":["year"]
will group only be year, while "groups":["year","city"]
will group by both year and city.
Be very careful with the choices, because too many groups can quickly make a query unmanageable. If you have 100 of each, this could easily return a 10,000 row query. (Although interactions which do not exist in the source data will not be returned, so it will probably be somewhat lower.)
Possible fields include any of the user-defined metadata, as well as "unigram" or "bigram" to return wordcount data.
Grouping by "unigram" or "bigram" can be quite slow, and should only be attempted be attempted for the time being on subcorpora of, say, 1 million words or less at a time. (On larger corpora, you'll just up timing out.)
Ordinarily, each ratio summary statistic ("Percentage of Books," say) refers directly to the interaction of group A and group B. Sometimes, this is less than useful.
Ordinarily a query like
{"groups":["year","library"],"counttype":["TextPercent"]}
will give for each interaction of year and library the number of texts that come from that particular library in that year. That's not interesting. (By definition, it will always be 100%.
On the other hand,
{"groups":["year","*library"],"counttype":["TextPercent"]}
will drop the library grouping on the superset and give the percentage of all texts for that year that come from the library, so each column will sum to 100%;{"groups":["*year","library"],"counttype":["TextPercent"]}
will drop the year superset and give the percentage of all texts for that library that come from that year and library.{"groups":["*year","*library"], "counttype":["TextPercent"]}
will drop both and give the percentage of all texts for the library defined by search_limits or constrain_limits contained in each cell: the sum of all the TextPercent cells in the entire return set should be 100. (Though it may not be if year or library is undefined for some items).Combining this syntax with that for defining a separate compare_limits
will produce some pretty nonsensical queries, so it's generally better to do just one or the other.
"counttype"
Example: "counttype":["WordsPerMillion"]
Counttype is an array of commands that specify what summary statistics will be returned.
The most commonly used values are:
search_limits
for each group. (If no words key is specified, the sum of all the words in the book).Also permanently available are:
Currently available, and useful in some specialized cases involving comparisons, are:
WordCount/TotalWords
TotalWords + WordCount
WordsRatio and TextRatioTextCount/TotalTexts
TextCount+TotalTexts
database
(required: server-specific defaults)Example: {"database":"ChronAm"}
A single server can contain several bookworms: this is a string describing which one to run queries on.
method
(required: default "return_json")The type of results to be returned. For standard queries, this should be one of:
counttype
.There are also some special methods that overrride other settings:
groups
or in search_limits
with some data about their type. All fields but "database" are ignored.search_limits
. "Groupings" is ignored, and "counttype" is used in a special way (see ordertype
). By default only the first 100 results are returned--there is currently no way to page past them.ordertype
(default: dynamic)In progress: comments welcome
When method
is "search_results", the books are sorted before being returned. This sort ordering can be controlled.
By default, results are sorted by the percentage of hits in the text. That biases towards either texts that use the words a lot, or texts that use it rarely.
Often you want not the top texts, but some representative texts. For this purpose.
Currently, random sorting is handled in an interesting way. If the counttype relies on the number or ratio of texts, it sorts the texts in random order.
If the counttype relies on the number of ratio of words, however, it tries to sort the texts randomly weighted by the number of times the words appear in it. This means that a random word from the first text should represent a random usage from the overall sample.
The current MySQL-python implementation uses an approximation for this: LOG(1-RAND())/sum(main.count)
that should mimic a weighted random ordering for most distributions, but in some cases it may not behave as intended.
In progress. True weighted random ordering will be more expensive in time but potentially useful.
Depending on the usefulness of search ordering, this could be extended to support:
words_collation
(optional: default "case_sensitive")Example: "words_collation":"case_sensitive"
A string representing how to handle case matching on the "words" term in groupings
Possible values:
In progress: I'm inclined to think this should be eliminated and instead users could specify 'casesens','case_insens' or 'stem' directly, and the API would translate the results appropriately. It's slightly uglier, but would allow more complicated queries (such as mixing case sensitive and insensitive in the same limits, or using separate values for groupings and search limits)
If you build a web or analysis app using Bookworm, you're encouraged to use the dict to add other keys storing other elements of the state. For example, the layout preferences for the D3 bookworm are stored in an aesthetic
field which maps to a dictionary; and both GUIs use a field called smoothingSpan
to represent smoothing.
The advantage of doing this is state persistence for RESTful apps, portability, and helpfulness for the logs.
We may need to reserve a few keys for own use down the road. So if you do define something, avoid using the following unless you're contributing to a core project:
D3-bookworm reserved
Future authentication needs