Query Language

Intro

NGRAMS' query language is similar to a very simple form of regular expression. A query, formed as a sequence of terms and query operators, is matched against all indexed ngrams in the selected corpus.

The query

how * you *

returns

2468812   How do you know
 845786   How do you feel
 830062   How did you know
 755042   How are you ?
 739312   How do you do
...

Results are sorted by the ngram's total match count in descending order. See Data Model for details.

Because the underlying ngrams are between 1 and 5 tokens long, a query must be matchable within this range. For example, a query such as one two three four five six with 6 terms will never match. Sometimes a term will be split further by NGRAMS in order to match how the underlying ngrams were tokenized by Google. In this case a query can exceed the limit of 5 tokens. More about that in Tokenization.

Query Operators

Here is the complete list of query operators.

Operator	Name	Description	Example
`*`	Star	Matches one token.	what * day
`**`	StarStar	Matches zero or more tokens.	what ** day
`a / b`	Alternation	Matches either a or b — see Alternation.	what a sunny / rainy day
`"a b"`	TermGroup	Treats multiple terms as one entity — see TermGroup.	you are / "will be" doing
`prefix~`	Completion	Matches tokens starting with prefix.	what an aw~ day
`*_ADJ`	Star_ADJ	Matches one adjective.	I feel *_ADJ
`*_ADP`	Star_ADP	Matches one adposition (preposition or postposition).	working *_ADP home
`*_ADV`	Star_ADV	Matches one adverb.	she sings *_ADV
`*_CONJ`	Star_CONJ	Matches one conjunction.	tea *_CONJ coffee
`*_DET`	Star_DET	Matches one determiner or article.	go *_DET way
`*_NOUN`	Star_NOUN	Matches one single noun.	buy some *_NOUN
`*_NUM`	Star_NUM	Matches one numeral.	buy *_NUM bottles
`*_PRON`	Star_PRON	Matches one pronoun.	bring *_PRON flowers
`*_PRT`	Star_PRT	Matches one particle.	to step *_PRT
`*_VERB`	Star_VERB	Matches one verb.	I *_VERB you
`_START_`	SentenceStart	Matches the start of a sentence. See Sentence Boundary Tags.	_START_ as expected *
`_END_`	SentenceEnd	Matches the end of a sentence. See Sentence Boundary Tags.	as expected * _END_

Alternation

An alternation checks multiple terms at once. The / can be read like a logical OR operator.

what a sunny / rainy / windy * checks

what a sunny *
what a rainy *
what a windy *

TermGroup

A term group treats zero or more terms as one entity. It is only useful within an alternation, i.e. as the left or right side of the / operator. A term group can also be empty to let an alternation check for the empty string.

you are / "will be" / "" doing checks

you are doing
you will be doing
you doing

When a query is parsed from left to right, a token like " or "foo (opening token) is interpreted as the start of a term group. The next " or bar" (closing token) is interpreted as the end of this term group. It is an error if a closing token comes before an opening token. Unintended opening and closing tokens can be avoided using an escape sequence.

Sentence Boundary Tags

_START_ and _END_ are artificial tokens that were inserted by Google after sentence detection. _START_ is placed before the first word of a sentence. _END_ is placed after the punctuation mark that finishes a sentence. As the ngrams in the dataset do not span across sentence boundaries, you will not find these tokens in the middle of an ngram.

Escape Sequences

To disable the semantics of characters or whole tokens that are used as operators, you have to backslash-escape them. For example, to search for a literal * you have to enter \*, and so on. Here is the complete list of escape sequences:

Operator	Escape Sequence	Note
`*`	`\*`
`**`	`\**`
`/`	`\/`
`"a`	`\"a`	`a` can be any term or empty.
`b"`	`b\"`	`b` can be any term or empty.
`prefix~`	`prefix\~`	`prefix` can be any term or empty.
`*_ADJ`	`\*_ADJ`	Same for other part-of-speech wildcards.
`_START_`	`\_START_`
`_END_`	`\_END_`

Tokenization

When Google compiled the dataset, they applied some tokenization (and normalization) to the raw material. For example, punctuation marks that usually follow directly after a term are split off to form a separate token, e.g. hello! became hello !. Another example is the splitting of contractions like they're into they 're for linguistic reasons.

In order to match a query against these ngrams, the query has to be tokenized the same way. NGRAMS does this by default. However, this has the effect that a query can exceed the number of tokens (5) and is not matchable anymore. In this case NGRAMS responds with an error. You then have to make your query shorter.

You can turn auto-tokenization off in search settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Query Language

Intro

Query Operators

Alternation

TermGroup

Sentence Boundary Tags

Escape Sequences

Tokenization

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Home

Ngram Dataset

Query Language

Search Settings

REST API

Terms

About

FAQ

Clone this wiki locally