Charting the Geosciences with Google Ngram Viewer

Danita S. Brandt, Department of Earth and Environmental Geosciences, Michigan State University, East Lansing, Michigan 48824,

INTRODUCTION                                    available at and             was followed by volumes two and three in
                                                             1832 and 1833, respectively. The N-gram
  Frequency of mention in books can be                                                           frequency chart supports the hypothesis
used to trace the evolution of a discipline,    CAVEATS TO USING THE CORPUS                      that Lyell’s books contributed to an
from the first recorded use of the word or                                                       increase in the frequency of the unigram
phrase to its current standing, as measured       Problems with the unfiltered use of the        “geology”; the conclusion that Lyell’s work
by the number of books that include the         Google Books corpus are well-documented,         had a major impact on the growth of geol-
phrase. Ngram Viewer, the tool developed        including errors introduced during optical       ogy is supported independently by histori-
by a team at Google Books (Michel et al.,       scanning and entering metadata (Nunberg,         ans of our discipline (Rudwick, 2010).
2011) places a database (“corpus”) of >500      2009). Pechenick et al. (2015) described
billion words at the disposal of its users      limits to inferring cultural and linguistic        N-gram frequency of “micropaleontol-
( Here I        evolution from the Google N-gram corpus,         ogy” reached a maximum in the early
describe how this tool can be used to           including the problem of the burgeoning          1950s, coincident with that decade’s
examine patterns suggested by qualitative       number of scientific texts since 1990,           “petroleum” boom, and reflects the well-
ideas about the intellectual development of     which skews the results toward academic          documented connection between micro-
the geosciences. An example of the Ngram        usage of N-grams and is therefore less           biostratigraphy and petroleum exploration
Viewer output is given in Figure 1.             reflective of cultural context. However, if      (Haq and Boersma, 1998). However, not all
                                                the user’s purpose is to trace the history of    possible correlations are easily tested using
N-GRAMS                                         a scientific discipline rather than a cultural   Ngram Viewer; an attempt to chart the
                                                phenomenon, as the purpose is here, the bias     N-grams “micropaleontology” and “petro-
  An N-gram is a contiguous string of n         Pechenick et al. (2015) described skews in a     leum” on the same graph returned a display
items from a given sequence of text or          constructive direction. Because the database     in which the line tracing the frequency of
speech. A 1 gram (also known as a uni-          consists of books only, rather than journal      “micropaleontology” was indistinguish-
gram) is a string of characters uninter-        articles, N-gram results might lag the intel-    able from the x-axis; the frequency of the
rupted by a space, e.g., “trilobite” or         lectual development of a discipline.             N-gram “petroleum” swamped “micropa-
“3.14159.” An N-gram is a sequence of                                                            leontology.” The corpus is also sensitive to
1 gram, e.g., “trilobite extinction” (2 gram    APPLICATION TO THE                               N-gram size and word order; the trigram
or bigram), and “Michigan State University”     GEOLOGICAL SCIENCES                              “extinction of trilobites” successfully
(3 gram or trigram). N-grams are used by                                                         returned results; a query for “trilobite
computer scientists and computational lin-        Ngram Viewer is useful for suggesting          extinction” returned no N-grams. Although
guists for text mining and natural language     testable hypotheses by identifying correla-      Ngram Viewer does not allow for easy
processing (Jurafsky and Martin, 2014).         tions. Two important caveats to keep in          comparison of N-grams with wildly differ-
Google Books, a service of search-engine        mind when using Ngram Viewer are, as in          ent occurrence rates, this obstacle can be
giant Google Inc., has amassed a database       any analysis, correlation does not necessar-     overcome by downloading and replotting
of more than 25 million scanned books.          ily indicate causation, and, as with any         the Ngram Viewer data using programs
From this resource, a subset of over five       online resource (Wikipedia, for example),        such as R.
million books, chosen for the quality of        Ngram Viewer provides a starting point to
their optical scan and metadata (e.g., date of  stimulate further investigation, not an end in     Cause-and-effect is suggested by the
publication), comprises the corpus of Google    itself. Here, in approximate chronological       graph of “geosynclines” and “plate tecton-
Ngram Viewer. Currently, Ngram Viewer is        order, are three examples of Ngram Viewer        ics” (Fig. 1). The graph traces the displace-
restricted to a maximum word string length      searches drawn from geological topics cho-       ment of the older “geosynclines” paradigm
of n = 5 (five-grams), and counts only          sen to illustrate the potential and the limita-  for explaining crustal tectonics by the
N-grams that occur at least 40 times in the     tions of these data. Search terms and phrases    emergence of “plate tectonics.” The dra-
corpus. The data consist of books published     (the N-grams) are enclosed in quotes.            matic shift from “geosynclines” to “plate
from the 1500s to 2000, and includes chil-                                                       tectonics” occurred in the mid-1970s, as
dren’s literature, trade, and other books but     The frequency of the unigram “geology”         plate tectonic theory supplanted the pre-
no journal articles. The full data set is       shows an increase at 1830, coincident with       tectonic explanation of crustal dynamics
                                                publication of the first volume of Charles       and made its way into textbooks. The
                                                Lyell’s Principles of Geology. Volume one        apparent causal connection between the

