Named Entity Recognition

November 05, 2016

A couple months ago I was in a hostel desperately looking for an English language book. As I was in a Spanish speaking country and this particular location wasn’t on the main backpacking track the selection was a little limited. Yet there was a book written by Danielle Steel (as there seemed to be wherever I traveled). While I’ve never read a Danielle Steel book I was aware of her and that she is an incredibly prolific writer.

I started wondering if she actually wrote her books and committed to putting a corpus together and attacking with my limited natural language tools to attempt to answer that question. Along the way I started noticing some interesting trends.

This is a quick run through of the methods and tools I used

Classify text with pretrained NER models

Named Entity Recognition is a tool for applying labels to text. A sentence such as:

Tim {name} helped Sarah {name} move to San Diego {location} in 2016.

would have the labels in brackets after being processed (likely in a different data structure. I had a passing familiarity when working through Natural Language Processing with Python (7 years ago!)but had only gone so far as to implement a parts of speech tokenizer.

I started off trying to roll my own but it led to a rabbit hole. I decided to stand on the shoulders of giants. There are several services that offer NER to varying degrees of quality. I decided to use an open tool so I could dive into the source in the future when my interests were less split. As this is exploratory I don’t have a high accuracy requirement otherwise I’d likely use one of the other tools (or train my own model if I had time to label text). The Stanford Natural Language Processing Group makes the Stanford NER freely available along with pretrained models.

Constructing a corpus

First, I obtained 88 ebooks by Danielle Steel in azw3, epub, and mobi from the 70s until now. I used calibre’s ebook-convert tool to convert them all to text with this script. I explored the text to see if there were many conversion errors. I saw a few but decided to not dedicate much time to data clean up. I do expect some publisher, editors and other locations often included in the beginning and ends of books appeared in the end result that should not be.

NLTK and initial exploratory data analysis

import numpy as np
import nltk
import matplotlib
import matplotlib.pyplot as plt
import wordcloud
plt.style.use('fivethirtyeight')
%matplotlib inline

# Read in corpus from local corpus directory
corpus = nltk.corpus.PlaintextCorpusReader("corpus", '.*\.txt')

# Calculate basic statistics
works = corpus.fileids()
titles = [title.split()[0] for title in works]
print("Total works by Danielle Steele: {}".format(len(works)))

word_count = [len(corpus.words(work)) for work in works]
print("Average word count {:.2f}".format(np.average(word_count)))
print("Median word count {:.2f}".format(np.median(word_count)))
print("Standard deviation word count {:.2f}".format(np.std(word_count)))
print("Minimum words {}".format(np.min(word_count)))
print("Maximum words {}".format(np.max(word_count)))

Total works by Danielle Steele: 88
Average word count 133223.85
Median word count 135017.50
Standard deviation word count 34262.29
Minimum words 56394
Maximum words 215792

# View the distribution of works by word count
n, bins, patches = plt.hist(word_count, 20, range=(0,220000))
_ = plt.ylabel("Total works")
_ = plt.xlabel("Total words")

Distribution of novels by word count

Data Cleanliness

This corpus is not perfectly clean. It contains some formatting data and is what remains from a conversion of 88 different epub, mobi, and azw3 files to text format representing the majority of Danielle Steels’ prolific set of work from 1973-2009. However, that doesn’t prevent us from making some general observations: The majority of her works are around 135,000 words (54.5% were between 100,000 and 150,000 words).

print("{:.1f}%".format(len([words for words in word_count if words < 150000 and words > 100000])/len(works)*100))

54.5%

Wordclouds because why not

I found a wonderful little python module for wordcouds and thought it would be a fun exploration of the text, even if not amazingly useful for understanding it.

# Create a wordcloud to look at textual data
wordcloud_im = wordcloud.WordCloud(width=800, height=600)
_ = wordcloud_im.generate(corpus.raw())
_ = plt.imshow(wordcloud_im)
_ = plt.axis("off")

Wordcloud

Extracting locations

I enjoy maps. So what if we were to look at all the location names mentioned in all 88 novels, geocode them, and display them on a map? The NER tags individual words. I decided to use the heuristic that if locations were close (two spaces) to make them one combined entity. For example a location would pass through three stages: Santa Cruz -> Santa [location] Cruz [location] -> Santa Cruz [location]. After trying it out it looked good enough.

# Extract locations
# After spending some time trying to roll my own from the GNIS dataset and other gazetteers I decided 
# to use the Stanford NER tagger
from nltk.tag.stanford import StanfordNERTagger
# NLTK lib makes calling out to the pretained model (first arg) and classifier (second arg)
# very seamless, requires java.
tagger = StanfordNERTagger("ner/classifiers/english.all.3class.distsim.crf.ser.gz", "ner/stanford-ner.jar")
tagged_corpus = []
for work in corpus.fileids():
    tagged_corpus.append(tagger.tag(corpus.words(work)))

# Extract locations
filtered_locations = []
for tagged_work in tagged_corpus:
    # Add all consecutive tags, if there is a spacing of 3 or more separate
    spacing = 0
    SPACE_LIMIT = 2
    work_storage = []
    location_storage = []
    for word in tagged_work:
        # if it is a location word add it
        if word[1] == "LOCATION":
            location_storage.append(word[0])
        else:
            # if it is not a spacing word, consume a space
            spacing += 1
            # if the spacing limit is reached add the aggregated location
            # and reset the parameters
            if spacing > SPACE_LIMIT:
                # Add only nonempty lists
                if len(location_storage) != 0:
                    work_storage.append(location_storage[:])
                spacing = 0
                location_storage = []
    filtered_locations.append(work_storage)
    work_storage = []

# Utility function for adding all of the tags together
def create_place_name(tagged_locs :list):
    agg_string = ""
    for place_name in tagged_locs:
        agg_string += place_name
        agg_string += " "
    return agg_string

# Take a peak at the filtered locations to get an idea of accuracy
# but first a sanity check
locations_only = [create_place_name(item) for sublist in filtered_locations for item in sublist]
location_works = [] # list of lists, containing locations, matches up with corpus.fileids() order
for work in filtered_locations:
    location_works.append([create_place_name(item) for item in work])
    
print("Do we have 88 works as expected?: {}".format(len(filtered_locations) == 88))
print("Total number of locations tagged: {}".format(len(locations_only)))


print([create_place_name(places) for places in filtered_locations[0][:5]])
print([create_place_name(places) for places in filtered_locations[15][50:55]])
print([create_place_name(places) for places in filtered_locations[50][100:105]])
print([create_place_name(places) for places in filtered_locations[80][155:160]])

Do we have 88 works as expected?: True
Total number of locations tagged: 46230
['Victoria ', 'New York ', 'West Village ', 'Tribeca ', 'West Village ']
['Berlin ', 'Berlin ', 'Verdun ', 'Berlin ', 'Berlin ']
['Santa Claus ', 'New York ', 'California ', 'New York ', 'New York ']
['France ', 'Riviera Nice ', 'London ', 'New York ', 'Rome ']

Geocoding

Not a particularly scientific approach, but there looks to be locations that can be geocoded enough to explore locations in a novel, depending on the geocoder used I might be able to restrict it to cities smaller if I can use extents.

That is a hell of a lot of locations though and well above an amount that is easy to geocode e.g. Google Geocoding service limits to 2,500 queries a day. It’d cost $22 to pay for it which I’d rather not pay. Other services can do it, but in my experience they aren’t as accurate especially for partial locations.

# Distribution of location counts
_ = plt.hist([len(x) for x in location_works])
_ = plt.ylabel("Total Works")
_ = plt.xlabel("Total Locations Mentioned")

Distribution of location counts

# Can't geocode all so, selecting a random book
# to explore locations
import geopy
import time
geocoder = geopy.GoogleV3(api_key='XXXXXXXXXXXX')
book_locs = location_works[30]
geocoded_locs = []
for loc in book_locs:
    geocoded_locs.append(geocoder.geocode(loc))

# Adapted from example code found http://matplotlib.org/basemap/users/robin.html
lat = [loc.latitude for loc in geocoded_locs if loc != None]
long = [loc.longitude for loc in geocoded_locs if loc != None]
latlong = zip(lat, long)

from mpl_toolkits.basemap import Basemap
m = Basemap(projection='robin',lon_0=-80, resolution='l')
m.drawcoastlines()
m.drawcountries()
m.fillcontinents(color='yellow',lake_color='black')
# draw parallels and meridians.
# m.drawparallels(np.arange(-90.,120.,30.), color='white')
# m.drawmeridians(np.arange(0.,360.,30.), color='white')
m.drawmapboundary(fill_color='black')
x, y = m(long, lat)
m.plot(x, y, 'r.', alpha=0.3, markersize=5)
plt.title("Locations mentioned in {}".format(titles[30]))
plt.show()

Map of locations in a Danielle Steel novel

Conclusions

NER is a surprising way to discover new insights into aggregated text. While this wasn’t groundbreaking there are many interesting projects, especially in digital history, that use this tool for exploration:
Digital History Methods
Generative Historiography of Religion
Placing Literature

As for Danielle Steel: Most of her books are based around NYC, SF, and Paris. According to a summary I found it is primarily set in NYC and SF. The map mostly reflects that with a fair amount of Europe coverage as well. The method of visualization doesn’t demonstrate the concentration especially well and would likely be better suited to a heatmap. However, generally, the method appears to work.

Like most projects I have only more questions. Some future projects for this data set:

Visualize location changes throughout a novel (arbitrarily bin). Perhaps find a way to track movement or sequences where movement is occurring by increases in the occurrence of particular verbs or phrases.
Are there temporal location preferences (setting or more recently published)?
What does the full data set look like?
Map directionality of mentions along with strength (mentioning London immediately after Paris suggests a stronger connection than a mention of Boston 10 locations/entities/words later.
Measure accuracy without labeling (How?)
Did Danielle Steel write her novels? (yes, probably her pacing isn’t inhuman)