Numerical
Analysis of Literature
CAN the technologies of
Big Data, which are transforming so many areas of life, change our
understanding of American novels? After conducting research with Google¡¦s Ngram database, which tabulates the frequency of words used
in more than five million books, I believe the answer is yes.
Consider the question of
which themes and books characterize a literary era. The time-honored approach
to this problem has been for an august critic or group of distinguished
scholars to select and analyze key novels. That methodology, however, has its
flaws. No one person or team of readers can do more than dip their toes into
the vast sea of literary works. By the 1840s Americans wrote more than 100
novels annually; by the 1880s, more than 1,000; by the early 21st century, more
than 10,000. In addition, there is the threat of subjective bias. Not long ago,
for example, critics focused their attention almost exclusively on white male
authors.
The Ngram
database offers an alternative approach. As I demonstrate in the latest
issue of the journal Social Science History, by examining the changing
frequencies of key words in books published in the
There are important
caveats in using this source. The ¡§American English¡¨ subset of the Ngram database includes a broad selection of books
published in the
Take the role of women in
mid-19th-century American novels. Scholars have long argued that domesticity
shaped the world of middle-class women and that novels relegated them to the
home and restricted their activities. Women had influence only when they
persuaded men to act, as in Harriet Beecher Stowe¡¦s 1852 novel, ¡§Uncle Tom¡¦s
Cabin.¡¨ Women were supposed to be submissive, pious, domestic and pure.
But Ngram data indicate that the use of those words
peaked, respectively, in 1807, 1814, 1835 and 1847. All fell off before
midcentury. By contrast, striking gains were recorded during these years in the
usage of woman¡¦s rights. Virtually unknown before the 1840s, the term
soared in frequency after the Seneca Falls Convention in 1848 and did not peak
until 1884. Perhaps we need to invert the conventional wisdom and declare as
¡§representative¡¨ those midcentury novels criticizing domesticity and
celebrating independent women - books like Fanny Fern¡¦s ¡§Ruth Hall,¡¨ published
in 1854, and E.D.E.N. Southworth¡¦s ¡§Hidden Hand,¡¨
which first appeared in serial form in 1859.
Ngram data also provide a new
perspective on the novels of the 1930s. These years are traditionally viewed as
the heyday of the proletarian novel, a time of gloom and a period when business
leaders were despised. John Steinbeck¡¦s 1939 novel, ¡§The Grapes of Wrath,¡¨ is
considered a quintessential novel of the decade. But according to Ngram data, the use of businessman, a term virtually
unknown before 1930, surged during the decade. Of course, you might guess that
those citations were negative, but trends in other terms point to a more
positive reading. References to optimism rose throughout the decade,
while pessimism declined. Mentions of the American dream, a term
rarely seen before 1930, also climbed precipitously. So instead of Steinbeck¡¦s
novel, works highlighting scrappy, successful entrepreneurs may best mark this
decade. In Zora Neale Hurston¡¦s ¡§Their Eyes Were
Watching God,¡¨ published in 1937, for example, the heroine¡¦s first two husbands
were successful businessmen who overcame tough times and racial prejudice.
Similarly, Margaret Mitchell¡¦s ¡§Gone With the Wind¡¨
(1936) details Scarlett O¡¦Hara¡¦s campaign to regain the affluence she once
enjoyed.
Our view of postmodern
fiction might also need adjusting. Chaos, conspiracy and nihilism are thought
to reign in this literary world, as in the unsettling early works of Thomas
Pynchon and Bret Easton Ellis. Word usage, however, indicates a more positive
dynamic: the growing attention paid to children. Among the terms whose
frequency escalates after 1960
are caring, nurturing, infant, toddler and childhood.
It could be that the truly representative works of this era are novels like
Toni Morrison¡¦s ¡§Beloved,¡¨ Philip Roth¡¦s ¡§American Pastoral¡¨ and Cormac McCarthy¡¦s ¡§The Road,¡¨ all of which feature deep
parent-child bonds.
Such hypotheses are
merely suggestive, but as tools like the Ngram
database continue to improve, the insights they make possible should encourage
scholars to revisit longstanding assumptions with a critical eye.
Marc
Egnal, a professor of history at
Crunching
Literary Numbers
Published:
July 12, 2013
http://www.nytimes.com/2013/07/14/opinion/sunday/crunching-literary-numbers.html?_r=0