From 5f5e4206055466da74a2a2ccf3bbbdf5fa688cb7 Mon Sep 17 00:00:00 2001 From: Surya Date: Thu, 22 Feb 2018 16:51:45 +0530 Subject: [PATCH 1/2] add python3 compatibility and remove deprecated calls --- A Smattering of NLP in Python.ipynb | 1212 +++++++++++++++++++++------ 1 file changed, 977 insertions(+), 235 deletions(-) diff --git a/A Smattering of NLP in Python.ipynb b/A Smattering of NLP in Python.ipynb index 13ed366..0f32e4a 100644 --- a/A Smattering of NLP in Python.ipynb +++ b/A Smattering of NLP in Python.ipynb @@ -1,241 +1,983 @@ { - "metadata": { - "name": "A Smattering of NLP in Python", - "signature": "sha256:5b38818827e50ee282fa44155be8b71ad71466229789ff67e945f3a6d2570004" - }, - "nbformat": 3, - "nbformat_minor": 0, - "worksheets": [ + "cells": [ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": "# A Smattering of NLP in Python\n*by Charlie Greenbacker [@greenbacker](https://twitter.com/greenbacker)*\n\n[![Python Powered logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/python-powered-w-200x80.png)](https://www.python.org/)\n\n### Part of a [joint meetup on Natural Language Processing](http://www.meetup.com/stats-prog-dc/events/177772322/) - 9 July 2014\n- #### [Statistical Programming DC](http://www.meetup.com/stats-prog-dc/)\n- #### [Data Wranglers DC](http://www.meetup.com/Data-Wranglers-DC/)\n- #### [DC Natural Language Processing](http://dcnlp.org/)\n\n***\n\n## Introduction\nBack in the dark ages of data science, each group or individual working in Natural Language Processing (NLP) generally maintained an assortment of homebrew utility programs designed to handle many of the common tasks involved with NLP. Despite everyone's best intentions, most of this code was lousy, brittle, and poorly documented -- not a good foundation upon which to build your masterpiece. Fortunately, over the past decade, mainstream open source software libraries like the [Natural Language Toolkit for Python (NLTK)](http://www.nltk.org/) have emerged to offer a collection of high-quality reusable NLP functionality. These libraries allow researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.\n\nThis presentation will cover a handful of the NLP building blocks provided by NLTK (and a few additional libraries), including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. Several of these components will then be assembled to build a very basic document summarization program.\n\n[![Natural Language Processing with Python book cover](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/cat.gif)](http://oreilly.com/catalog/9780596516499/)\n\n### Initial Setup\nObviously, you'll need Python installed on your system to run the code examples used in this presentation. We enthusiatically recommend using [Anaconda](https://store.continuum.io/cshop/anaconda/), a Python distribution provided by [Continuum Analytics](http://www.continuum.io/). Anaconda is free to use, it includes nearly [200 of the most commonly used Python packages for data analysis](http://docs.continuum.io/anaconda/pkg-docs.html) (including NLTK), and it works on Mac, Linux, and yes, even Windows.\n\n[![Anaconda logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/anaconda_logo_web.png)](https://store.continuum.io/cshop/anaconda/)\n\nWe'll make use of the following Python packages in the example code:\n\n- [nltk](http://www.nltk.org/install.html) (comes with Anaconda)\n- [readability-lxml](https://github.com/buriy/python-readability)\n- [BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/) (comes with Anaconda)\n- [scikit-learn](http://scikit-learn.org/stable/install.html) (comes with Anaconda)\n\nPlease note that the **readability** package is not distributed with Anaconda, so you'll need to download & install it separately using something like easy_install readability-lxml or pip install readability-lxml.\n\nIf you don't use Anaconda, you'll also need to download & install the other packages separately using similar methods. Refer to the homepage of each package for instructions.\n\nYou'll want to run nltk.download() one time to get all of the NLTK packages, corpora, etc. (see below). Select the \"all\" option. Depending on your network speed, this could take a while, but you'll only need to do it once.\n\n#### Java libraries (optional)\nOne of the examples will use NLTK's interface to the [Stanford Named Entity Recognizer](http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download), which is distributed as a Java library. In particular, you'll want the following files handy in order to run this particular example:\n\n- stanford-ner.jar\n- english.all.3class.distsim.crf.ser.gz\n\n[![Stanford NLP Group logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/stanford-nlp.jpg)](http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download)\n\n***\n\n## Getting Started\nThe first thing we'll need to do is import nltk:" - }, - { - "cell_type": "code", - "collapsed": false, - "input": "import nltk", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "#### Downloading NLTK resources\nThe first time you run anything using NLTK, you'll want to go ahead and download the additional resources that aren't distributed directly with the NLTK package. Upon running the nltk.download() command below, the the NLTK Downloader window will pop-up. In the Collections tab, select \"all\" and click on Download. As mentioned earlier, this may take several minutes depending on your network connection speed, but you'll only ever need to run it a single time." - }, - { - "cell_type": "code", - "collapsed": false, - "input": "nltk.download()", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "## Extracting text from HTML\nNow the fun begins. We'll start with a pretty basic and commonly-faced task: extracting text content from an HTML page. Python's urllib package gives us the tools we need to fetch a web page from a given URL, but we see that the output is full of HTML markup that we don't want to deal with.\n\n(N.B.: Throughout the examples in this presentation, we'll use Python *slicing* (e.g., [:500] below) to only display a small portion of a string or list. Otherwise, if we displayed the entire item, sometimes it would take up the entire screen.)" - }, - { - "cell_type": "code", - "collapsed": false, - "input": "from urllib import urlopen\n\nurl = \"http://venturebeat.com/2014/07/04/facebooks-little-social-experiment-got-you-bummed-out-get-over-it/\"\nhtml = urlopen(url).read()\nhtml[:500]", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "#### Stripping-out HTML formatting\nFortunately, NTLK provides a method called clean_html() to get the raw text out of an HTML-formatted string. It's still not perfect, though, since the output will contain page navigation and all kinds of other junk that we don't want, especially if our goal is to focus on the body content from a news article, for example." - }, - { - "cell_type": "code", - "collapsed": false, - "input": "text = nltk.clean_html(html)\ntext[:500]", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "#### Identifying the Main Content\nIf we just want the body content from the article, we'll need to use two additional packages. The first is a Python port of a Ruby port of a Javascript tool called Readability, which pulls the main body content out of an HTML document and subsequently \"cleans it up.\" The second package, BeautifulSoup, is a Python library for pulling data out of HTML and XML files. It parses HTML content into easily-navigable nested data structure. Using Readability and BeautifulSoup together, we can quickly get exactly the text we're looking for out of the HTML, (*mostly*) free of page navigation, comments, ads, etc. Now we're ready to start analyzing this text content." - }, - { - "cell_type": "code", - "collapsed": false, - "input": "from readability.readability import Document\nfrom bs4 import BeautifulSoup\n\nreadable_article = Document(html).summary()\nreadable_title = Document(html).title()\nsoup = BeautifulSoup(readable_article)\nprint '*** TITLE *** \\n\\\"' + readable_title + '\\\"\\n'\nprint '*** CONTENT *** \\n\\\"' + soup.text[:500] + '[...]\\\"'", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "## Frequency Analysis\nHere's a little secret: much of NLP (and data science, for that matter) boils down to counting things. If you've got a bunch of data that needs *analyzin'* but you don't know where to start, counting things is usually a good place to begin. Sure, you'll need to figure out exactly what you want to count, how to count it, and what to do with the counts, but if you're lost and don't know what to do, **just start counting**.\n\nPerhaps we'd like to begin (as is often the case in NLP) by examining the words that appear in our document. To do that, we'll first need to tokenize the text string into discrete words. Since we're working with English, this isn't so bad, but if we were working with a non-whitespace-delimited language like Chinese, Japanese, or Korean, it would be much more difficult.\n\nIn the code snippet below, we're using two of NLTK's tokenize methods to first chop up the article text into sentences, and then each sentence into individual words. (Technically, we didn't need to use sent_tokenize(), but if we only used word_tokenize() alone, we'd see a bunch of extraneous sentence-final punctuation in our output.) By printing each token alphabetically, along with a count of the number of times it appeared in the text, we can see the results of the tokenization. Notice that the output contains some punctuation & numbers, hasn't been loweredcased, and counts *BuzzFeed* and *BuzzFeed's* separately. We'll tackle some of those issues next." - }, - { - "cell_type": "code", - "collapsed": false, - "input": "tokens = [word for sent in nltk.sent_tokenize(soup.text) for word in nltk.word_tokenize(sent)]\n\nfor token in sorted(set(tokens))[:30]:\n print token + ' [' + str(tokens.count(token)) + ']'", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "#### Word Stemming\n[Stemming](http://en.wikipedia.org/wiki/Stemming) is the process of reducing a word to its base/stem/root form. Most stemmers are pretty basic and just chop off standard affixes indicating things like tense (e.g., \"-ed\") and possessive forms (e.g., \"-'s\"). Here, we'll use the Snowball stemmer for English, which comes with NLTK.\n\nOnce our tokens are stemmed, we can rest easy knowing that *BuzzFeed* and *BuzzFeed's* are now being counted together as... *buzzfe*? Don't worry: although this may look weird, it's pretty standard behavior for stemmers and won't affect our analysis (much). We also (probably) won't show the stemmed words to users -- we'll normally just use them for internal analysis or indexing purposes." - }, - { - "cell_type": "code", - "collapsed": false, - "input": "from nltk.stem.snowball import SnowballStemmer\n\nstemmer = SnowballStemmer(\"english\")\nstemmed_tokens = [stemmer.stem(t) for t in tokens]\n\nfor token in sorted(set(stemmed_tokens))[50:75]:\n print token + ' [' + str(stemmed_tokens.count(token)) + ']'", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "#### Lemmatization\n\nAlthough the stemmer very helpfully chopped off pesky affixes (and made everything lowercase to boot), there are some word forms that give stemmers indigestion, especially *irregular* words. While the process of stemming typically involves rule-based methods of stripping affixes (making them small & fast), **lemmatization** involves dictionary-based methods to derive the canonical forms (i.e., *lemmas*) of words. For example, *run*, *runs*, *ran*, and *running* all correspond to the lemma *run*. However, lemmatizers are generally big, slow, and brittle due to the nature of the dictionary-based methods, so you'll only want to use them when necessary.\n\nThe example below compares the output of the Snowball stemmer with the WordNet lemmatizer (also distributed with NLTK). Notice that the lemmatizer correctly converts *women* into *woman*, while the stemmer turns *lying* into *lie*. Additionally, both replace *eyes* with *eye*, but neither of them properly transforms *told* into *tell*." - }, - { - "cell_type": "code", - "collapsed": false, - "input": "lemmatizer = nltk.WordNetLemmatizer()\ntemp_sent = \"Several women told me I have lying eyes.\"\n\nprint [stemmer.stem(t) for t in nltk.word_tokenize(temp_sent)]\nprint [lemmatizer.lemmatize(t) for t in nltk.word_tokenize(temp_sent)]", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "#### NLTK Frequency Distributions\nThus far, we've been working with lists of tokens that we're manually sorting, uniquifying, and counting -- all of which can get to be a bit cumbersome. Fortunately, NLTK provides a data structure called FreqDist that makes it more convenient to work with these kinds of frequency distributions. The code snippet below builds a FreqDist from our list of stemmed tokens, and then displays the top 25 tokens appearing most frequently in the text of our article. Wasn't that easy?" - }, - { - "cell_type": "code", - "collapsed": false, - "input": "fdist = nltk.FreqDist(stemmed_tokens)\n\nfor item in fdist.items()[:25]:\n print item", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "#### Filtering out Stop Words\nNotice in the output above that most of the top 25 tokens are worthless. With the exception of things like *facebook*, *content*, *user*, and perhaps *emot* (emotion?), the rest are basically devoid of meaningful information. They don't really tells us anything about the article since these tokens will appear is just about any English document. What we need to do is filter out these [*stop words*](http://en.wikipedia.org/wiki/Stop_words) in order to focus on just the important material.\n\nWhile there is no single, definitive list of stop words, NLTK provides a decent start. Let's load it up and take a look at what we get:" - }, - { - "cell_type": "code", - "collapsed": false, - "input": "sorted(nltk.corpus.stopwords.words('english'))[:25]", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "Now we can use this list to filter-out stop words from our list of stemmed tokens before we create the frequency distribution. You'll notice in the output below that we still have some things like punctuation that we'd probably like to remove, but we're much closer to having a list of the most \"important\" words in our article." - }, - { - "cell_type": "code", - "collapsed": false, - "input": "stemmed_tokens_no_stop = [stemmer.stem(t) for t in stemmed_tokens if t not in nltk.corpus.stopwords.words('english')]\n\nfdist2 = nltk.FreqDist(stemmed_tokens_no_stop)\n\nfor item in fdist2.items()[:25]:\n print item", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "## Named Entity Recognition\nAnother task we might want to do to help identify what's \"important\" in a text document is [named entity recogniton (NER)](http://en.wikipedia.org/wiki/Named-entity_recognition). Also called *entity extraction*, this process involves automatically extracting the names of persons, places, organizations, and potentially other entity types out of unstructured text. Building an NER classifier requires *lots* of annotated training data and some [fancy machine learning algorithms](http://en.wikipedia.org/wiki/Conditional_random_field), but fortunately, NLTK comes with a pre-built/pre-trained NER classifier ready to extract entities right out of the box. This classifier has been trained to recognize PERSON, ORGANIZATION, and GPE (geo-political entity) entity types.\n\n(At this point, I should include a disclaimer stating [No True Computational Linguist](http://en.wikipedia.org/wiki/No_true_Scotsman) would ever use a pre-built NER classifier in the \"real world\" without first re-training it on annotated data representing their particular task. So please don't send me any hate mail -- I've done my part to stop the madness.)\n\n![Retrain my classifier models? Ain't nobody got time for that!](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/no_time.jpg)\n\nIn the example below (inspired by [this gist from Gavin Hackeling](https://gist.github.com/gavinmh/4735528/) and [this post from John Price](http://freshlyminted.co.uk/blog/2011/02/28/getting-band-and-artist-names-nltk/)), we're defining a method to perform the following steps:\n\n- take a string as input\n- tokenize it into sentences\n- tokenize the sentences into words\n- add part-of-speech tags to the words using nltk.pos_tag()\n- run this through the NLTK-provided NER classifier using nltk.ne_chunk()\n- parse these intermediate results and return any extracted entities\n\nWe then apply this method to a sample sentence and parse the clunky output format provided by nltk.ne_chunk() (it comes as a [nltk.tree.Tree](http://www.nltk.org/_modules/nltk/tree.html)) to display the entities we've extracted. Don't let these nice results fool you -- NER output isn't always this satisfying. Try some other sample text and see what you get." - }, - { - "cell_type": "code", - "collapsed": false, - "input": "def extract_entities(text):\n\tentities = []\n\tfor sentence in nltk.sent_tokenize(text):\n\t chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)))\n\t entities.extend([chunk for chunk in chunks if hasattr(chunk, 'node')])\n\treturn entities\n\nfor entity in extract_entities('My name is Charlie and I work for Altamira in Tysons Corner.'):\n print '[' + entity.node + '] ' + ' '.join(c[0] for c in entity.leaves())", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "If you're like me, you've grown accustomed over the years to working with the [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) library for Java, and you're suspicious of NLTK's built-in NER classifier (especially because it has *chunk* in the name). Thankfully, recent versions of NLTK contain an special NERTagger interface that enables us to make calls to Stanford NER from our Python programs, even though Stanford NER is a *Java library* (the horror!). [Not surprisingly](http://www.yurtopic.com/tech/programming/images/java-and-python.jpg), the Python NERTagger API is slightly less verbose than the native Java API for Stanford NER.\n\nTo run this example, you'll need to follow the instructions for installing the optional Java libraries, as outlined in the **Initial Setup** section above. You'll also want to pay close attention to the comment that says # change the paths below to point to wherever you unzipped the Stanford NER download file." - }, - { - "cell_type": "code", - "collapsed": false, - "input": "from nltk.tag.stanford import NERTagger\n\n# change the paths below to point to wherever you unzipped the Stanford NER download file\nst = NERTagger('/Users/cgreenba/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',\n '/Users/cgreenba/stanford-ner/stanford-ner.jar', 'utf-8')\n\nfor i in st.tag('Up next is Tommy, who works at STPI in Washington.'.split()):\n print '[' + i[1] + '] ' + i[0]", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "## Automatic Summarization\nNow let's try to take some of what we've learned and build something potentially useful in real life: a program that will [automatically summarize](http://en.wikipedia.org/wiki/Automatic_summarization) documents. For this, we'll switch gears slightly, putting aside the web article we've been working on until now and instead using a corpus of documents distributed with NLTK.\n\nThe Reuters Corpus contains nearly 11,000 news articles about a variety of topics and subjects. If you've run the nltk.download() command as previously recommended, you can then easily import and explore the Reuters Corpus like so:" - }, - { - "cell_type": "code", - "collapsed": false, - "input": "from nltk.corpus import reuters\n\nprint '** BEGIN ARTICLE: ** \\\"' + reuters.raw(reuters.fileids()[0])[:500] + ' [...]\\\"'", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "Our [painfully simplistic](http://anthology.aclweb.org/P/P11/P11-3014.pdf) automatic summarization tool will implement the following steps:\n\n- assign a score to each word in a document corresponding to its level of \"importance\"\n- rank each sentence in the document by summing the individual word scores and dividing by the number of tokens in the sentence\n- extract the top N highest scoring sentences and return them as our \"summary\"\n\nSounds easy enough, right? But before we can say \"*voila!*,\" we'll need to figure out how to calculate an \"importance\" score for words. As we saw above with stop words, etc. simply counting the number of times a word appears in a document will not necessarily tell you which words are most important.\n\n#### Term Frequency - Inverse Document Frequency (TF-IDF)\n\nConsider a document that contains the word *baseball* 8 times. You might think, \"wow, *baseball* isn't a stop word, and it appeared rather frequently here, so it's probably important.\" And you might be right. But what if that document is actually an article posted on a baseball blog? Won't the word *baseball* appear frequently in nearly every post on that blog? In this particular case, if you were generating a summary of this document, would the word *baseball* be a good indicator of importance, or would you maybe look for other words that help distinguish or differentiate this blog post from the rest?\n\nContext is essential. What really matters here isn't the raw frequency of the number of times each word appeared in a document, but rather the **relative frequency** comparing the number of times a word appeared in this document against the number of times it appeared across the rest of the collection of documents. \"Important\" words will be the ones that are generally rare across the collection, but which appear with an unusually high frequency in a given document.\n\nWe'll calculate this relative frequency using a statistical metric called [term frequency - inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf). We could implement TF-IDF ourselves using NLTK, but rather than bore you with the math, we'll take a shortcut and use the TF-IDF implementation provided by the [scikit-learn](http://scikit-learn.org/) machine learning library for Python.\n\n![Chevy Chase: \"It was my understanding that there would be no math.\"](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/i-was-told-there-would-be-no-math.jpg)\n\n#### Building a Term-Document Matrix\n\nWe'll use scikit-learn's TfidfVectorizer class to construct a [term-document matrix](http://en.wikipedia.org/wiki/Document-term_matrix) containing the TF-IDF score for each word in each document in the Reuters Corpus. In essence, the rows of this sparse matrix correspond to documents in the corpus, the columns represent each word in the vocabulary of the corpus, and each cell contains the TF-IDF value for a given word in a given document.\n\n[![Scikit-learn logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/Scikit-learn_logo.png)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)\n\nInspired by a [computer science lab exercise from Duke University](http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html), the code sample below iterates through the Reuters Corpus to build a dictionary of stemmed tokens for each article, then uses the TfidfVectorizer and scikit-learn's own built-in stop words list to generate the term-document matrix containing TF-IDF scores." - }, - { - "cell_type": "code", - "collapsed": false, - "input": "import datetime, re, sys\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\ndef tokenize_and_stem(text):\n tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]\n filtered_tokens = []\n # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)\n for token in tokens:\n if re.search('[a-zA-Z]', token):\n filtered_tokens.append(token)\n stems = [stemmer.stem(t) for t in filtered_tokens]\n return stems\n\ntoken_dict = {}\nfor article in reuters.fileids():\n token_dict[article] = reuters.raw(article)\n \ntfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words='english', decode_error='ignore')\nprint 'building term-document matrix... [process started: ' + str(datetime.datetime.now()) + ']'\nsys.stdout.flush()\n\ntdm = tfidf.fit_transform(token_dict.values()) # this can take some time (about 60 seconds on my machine)\nprint 'done! [process finished: ' + str(datetime.datetime.now()) + ']'", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "#### TF-IDF Scores\n\nNow that we've built the term-document matrix, we can explore its contents:" - }, - { - "cell_type": "code", - "collapsed": false, - "input": "from random import randint\n\nfeature_names = tfidf.get_feature_names()\nprint 'TDM contains ' + str(len(feature_names)) + ' terms and ' + str(tdm.shape[0]) + ' documents'\n\nprint 'first term: ' + feature_names[0]\nprint 'last term: ' + feature_names[len(feature_names) - 1]\n\nfor i in range(0, 4):\n print 'random term: ' + feature_names[randint(1,len(feature_names) - 2)]", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "#### Generating the Summary\n\nThat's all we'll need to produce a summary for any document in the corpus. In the example code below, we start by randomly selecting an article from the Reuters Corpus. We iterate through the article, calculating a score for each sentence by summing the TF-IDF values for each word appearing in the sentence. We normalize the sentence scores by dividing by the number of tokens in the sentence (to avoid bias in favor of longer sentences). Then we sort the sentences by their scores, and return the highest-scoring sentences as our summary. The number of sentences returned corresponds to roughly 20% of the overall length of the article.\n\nSince some of the articles in the Reuters Corpus are rather small (i.e., a single sentence in length) or contain just raw financial data, some of the summaries won't make sense. If you run this code a few times, however, you'll eventually see a randomly-selected article that provides a decent demonstration of this simplistic method of identifying the \"most important\" sentence from a document." - }, - { - "cell_type": "code", - "collapsed": false, - "input": "import math\nfrom __future__ import division\n\narticle_id = randint(0, tdm.shape[0] - 1)\narticle_text = reuters.raw(reuters.fileids()[article_id])\n\nsent_scores = []\nfor sentence in nltk.sent_tokenize(article_text):\n score = 0\n sent_tokens = tokenize_and_stem(sentence)\n for token in (t for t in sent_tokens if t in feature_names):\n score += tdm[article_id, feature_names.index(token)]\n sent_scores.append((score / len(sent_tokens), sentence))\n\nsummary_length = int(math.ceil(len(sent_scores) / 5))\nsent_scores.sort(key=lambda sent: sent[0], reverse=True)\n\nprint '*** SUMMARY ***'\nfor summary_sentence in sent_scores[:summary_length]:\n print summary_sentence[1]\n\nprint '\\n*** ORIGINAL ***'\nprint article_text", - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "#### Improving the Summary\nThat was fairly easy, but how could we improve the quality of the generated summary? Perhaps we could boost the importance of words found in the title or any entities we're able to extract from the text. After initially selecting the highest-scoring sentence, we might discount the TF-IDF scores for duplicate words in the remaining sentences in an attempt to reduce repetitiveness. We could also look at cleaning up the sentences used to form the summary by fixing any pronouns missing an antecedent, or even pulling out partial phrases instead of complete sentences. The possibilities are virtually endless.\n\n## Next Steps\nWant to learn more? Start by working your way through all the examples in the NLTK book (aka \"the Whale book\"):\n\n[![Natural Language Processing with Python book cover](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/cat.gif)](http://oreilly.com/catalog/9780596516499/)\n\n- [Natural Language Processing with Python (book)](http://oreilly.com/catalog/9780596516499/)\n- (free online version: [nltk.org/book](http://www.nltk.org/book/))\n\n### Additional NLP Resources for Python\n- [NLTK HOWTOs](http://www.nltk.org/howto/)\n- [Python Text Processing with NLTK 2.0 Cookbook (book)](http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book)\n- [Python wrapper for the Stanford CoreNLP Java library](https://pypi.python.org/pypi/corenlp)\n- [guess_language (Python library for language identification)](https://bitbucket.org/spirit/guess_language)\n- [MITIE (new C/C++-based NER library from MIT with a Python API)](https://github.com/mit-nlp/MITIE)\n- [gensim (topic modeling library for Python)](http://radimrehurek.com/gensim/)\n\n### Attend future DC NLP meetups\n\n[![DC NLP logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/dcnlp.jpeg)](http://dcnlp.org/)\n\n- [dcnlp.org](http://dcnlp.org/) | [@DCNLP](https://twitter.com/DCNLP/)" + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# A Smattering of NLP in Python\n", + "*by Charlie Greenbacker [@greenbacker](https://twitter.com/greenbacker)*\n", + "\n", + "[![Python Powered logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/python-powered-w-200x80.png)](https://www.python.org/)\n", + "\n", + "### Part of a [joint meetup on Natural Language Processing](http://www.meetup.com/stats-prog-dc/events/177772322/) - 9 July 2014\n", + "- #### [Statistical Programming DC](http://www.meetup.com/stats-prog-dc/)\n", + "- #### [Data Wranglers DC](http://www.meetup.com/Data-Wranglers-DC/)\n", + "- #### [DC Natural Language Processing](http://dcnlp.org/)\n", + "\n", + "***\n", + "\n", + "## Introduction\n", + "Back in the dark ages of data science, each group or individual working in Natural Language Processing (NLP) generally maintained an assortment of homebrew utility programs designed to handle many of the common tasks involved with NLP. Despite everyone's best intentions, most of this code was lousy, brittle, and poorly documented -- not a good foundation upon which to build your masterpiece. Fortunately, over the past decade, mainstream open source software libraries like the [Natural Language Toolkit for Python (NLTK)](http://www.nltk.org/) have emerged to offer a collection of high-quality reusable NLP functionality. These libraries allow researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.\n", + "\n", + "This presentation will cover a handful of the NLP building blocks provided by NLTK (and a few additional libraries), including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. Several of these components will then be assembled to build a very basic document summarization program.\n", + "\n", + "[![Natural Language Processing with Python book cover](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/cat.gif)](http://oreilly.com/catalog/9780596516499/)\n", + "\n", + "### Initial Setup\n", + "Obviously, you'll need Python installed on your system to run the code examples used in this presentation. We enthusiatically recommend using [Anaconda](https://store.continuum.io/cshop/anaconda/), a Python distribution provided by [Continuum Analytics](http://www.continuum.io/). Anaconda is free to use, it includes nearly [200 of the most commonly used Python packages for data analysis](http://docs.continuum.io/anaconda/pkg-docs.html) (including NLTK), and it works on Mac, Linux, and yes, even Windows.\n", + "\n", + "[![Anaconda logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/anaconda_logo_web.png)](https://store.continuum.io/cshop/anaconda/)\n", + "\n", + "We'll make use of the following Python packages in the example code:\n", + "\n", + "- [nltk](http://www.nltk.org/install.html) (comes with Anaconda)\n", + "- [readability-lxml](https://github.com/buriy/python-readability)\n", + "- [BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/) (comes with Anaconda)\n", + "- [scikit-learn](http://scikit-learn.org/stable/install.html) (comes with Anaconda)\n", + "\n", + "Please note that the **readability** package is not distributed with Anaconda, so you'll need to download & install it separately using something like easy_install readability-lxml or pip install readability-lxml.\n", + "\n", + "If you don't use Anaconda, you'll also need to download & install the other packages separately using similar methods. Refer to the homepage of each package for instructions.\n", + "\n", + "You'll want to run nltk.download() one time to get all of the NLTK packages, corpora, etc. (see below). Select the \"all\" option. Depending on your network speed, this could take a while, but you'll only need to do it once.\n", + "\n", + "#### Java libraries (optional)\n", + "One of the examples will use NLTK's interface to the [Stanford Named Entity Recognizer](http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download), which is distributed as a Java library. In particular, you'll want the following files handy in order to run this particular example:\n", + "\n", + "- stanford-ner.jar\n", + "- english.all.3class.distsim.crf.ser.gz\n", + "\n", + "[![Stanford NLP Group logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/stanford-nlp.jpg)](http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download)\n", + "\n", + "***\n", + "\n", + "## Getting Started\n", + "The first thing we'll need to do is import nltk:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import nltk" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Downloading NLTK resources\n", + "The first time you run anything using NLTK, you'll want to go ahead and download the additional resources that aren't distributed directly with the NLTK package. Upon running the nltk.download() command below, the the NLTK Downloader window will pop-up. In the Collections tab, select \"all\" and click on Download. As mentioned earlier, this may take several minutes depending on your network connection speed, but you'll only ever need to run it a single time." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# nltk.download()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extracting text from HTML\n", + "Now the fun begins. We'll start with a pretty basic and commonly-faced task: extracting text content from an HTML page. Python's urllib package gives us the tools we need to fetch a web page from a given URL, but we see that the output is full of HTML markup that we don't want to deal with.\n", + "\n", + "(N.B.: Throughout the examples in this presentation, we'll use Python *slicing* (e.g., [:500] below) to only display a small portion of a string or list. Otherwise, if we displayed the entire item, sometimes it would take up the entire screen.)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "b'\\n\\n\\n 5\u001b[0;31m '/Users/cgreenba/stanford-ner/stanford-ner.jar', 'utf-8')\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mst\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtag\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Up next is Tommy, who works at STPI in Washington.'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/nltk/tag/stanford.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 173\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 174\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__init__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 175\u001b[0;31m \u001b[0msuper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mStanfordNERTagger\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__init__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 176\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 177\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mproperty\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/nltk/tag/stanford.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, model_filename, path_to_jar, encoding, verbose, java_options)\u001b[0m\n\u001b[1;32m 56\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_JAR\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath_to_jar\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 57\u001b[0m \u001b[0msearchpath\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0m_stanford_url\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 58\u001b[0;31m verbose=verbose)\n\u001b[0m\u001b[1;32m 59\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 60\u001b[0m self._stanford_model = find_file(model_filename,\n", + "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/nltk/__init__.py\u001b[0m in \u001b[0;36mfind_jar\u001b[0;34m(name_pattern, path_to_jar, env_vars, searchpath, url, verbose, is_regex)\u001b[0m\n\u001b[1;32m 719\u001b[0m searchpath=(), url=None, verbose=False, is_regex=False):\n\u001b[1;32m 720\u001b[0m return next(find_jar_iter(name_pattern, path_to_jar, env_vars,\n\u001b[0;32m--> 721\u001b[0;31m searchpath, url, verbose, is_regex))\n\u001b[0m\u001b[1;32m 722\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 723\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/nltk/__init__.py\u001b[0m in \u001b[0;36mfind_jar_iter\u001b[0;34m(name_pattern, path_to_jar, env_vars, searchpath, url, verbose, is_regex)\u001b[0m\n\u001b[1;32m 635\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 636\u001b[0m raise LookupError('Could not find %s jar file at %s' %\n\u001b[0;32m--> 637\u001b[0;31m (name_pattern, path_to_jar))\n\u001b[0m\u001b[1;32m 638\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 639\u001b[0m \u001b[0;31m# Check environment variables\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mLookupError\u001b[0m: Could not find stanford-ner.jar jar file at /Users/cgreenba/stanford-ner/stanford-ner.jar" + ] + } + ], + "source": [ + "from nltk.tag.stanford import StanfordNERTagger\n", + "\n", + "# change the paths below to point to wherever you unzipped the Stanford NER download file\n", + "st = StanfordNERTagger('/Users/cgreenba/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',\n", + " '/Users/cgreenba/stanford-ner/stanford-ner.jar', 'utf-8')\n", + "\n", + "for i in st.tag('Up next is Tommy, who works at STPI in Washington.'.split()):\n", + " print('[' + i[1] + '] ' + i[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Automatic Summarization\n", + "Now let's try to take some of what we've learned and build something potentially useful in real life: a program that will [automatically summarize](http://en.wikipedia.org/wiki/Automatic_summarization) documents. For this, we'll switch gears slightly, putting aside the web article we've been working on until now and instead using a corpus of documents distributed with NLTK.\n", + "\n", + "The Reuters Corpus contains nearly 11,000 news articles about a variety of topics and subjects. If you've run the nltk.download() command as previously recommended, you can then easily import and explore the Reuters Corpus like so:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "** BEGIN ARTICLE: ** \"ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT\n", + " Mounting trade friction between the\n", + " U.S. And Japan has raised fears among many of Asia's exporting\n", + " nations that the row could inflict far-reaching economic\n", + " damage, businessmen and officials said.\n", + " They told Reuter correspondents in Asian capitals a U.S.\n", + " Move against Japan might boost protectionist sentiment in the\n", + " U.S. And lead to curbs on American imports of their products.\n", + " But some exporters said that while the conflict wo [...]\"\n" + ] + } + ], + "source": [ + "from nltk.corpus import reuters\n", + "\n", + "print('** BEGIN ARTICLE: ** \\\"' + reuters.raw(reuters.fileids()[0])[:500] + ' [...]\\\"')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our [painfully simplistic](http://anthology.aclweb.org/P/P11/P11-3014.pdf) automatic summarization tool will implement the following steps:\n", + "\n", + "- assign a score to each word in a document corresponding to its level of \"importance\"\n", + "- rank each sentence in the document by summing the individual word scores and dividing by the number of tokens in the sentence\n", + "- extract the top N highest scoring sentences and return them as our \"summary\"\n", + "\n", + "Sounds easy enough, right? But before we can say \"*voila!*,\" we'll need to figure out how to calculate an \"importance\" score for words. As we saw above with stop words, etc. simply counting the number of times a word appears in a document will not necessarily tell you which words are most important.\n", + "\n", + "#### Term Frequency - Inverse Document Frequency (TF-IDF)\n", + "\n", + "Consider a document that contains the word *baseball* 8 times. You might think, \"wow, *baseball* isn't a stop word, and it appeared rather frequently here, so it's probably important.\" And you might be right. But what if that document is actually an article posted on a baseball blog? Won't the word *baseball* appear frequently in nearly every post on that blog? In this particular case, if you were generating a summary of this document, would the word *baseball* be a good indicator of importance, or would you maybe look for other words that help distinguish or differentiate this blog post from the rest?\n", + "\n", + "Context is essential. What really matters here isn't the raw frequency of the number of times each word appeared in a document, but rather the **relative frequency** comparing the number of times a word appeared in this document against the number of times it appeared across the rest of the collection of documents. \"Important\" words will be the ones that are generally rare across the collection, but which appear with an unusually high frequency in a given document.\n", + "\n", + "We'll calculate this relative frequency using a statistical metric called [term frequency - inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf). We could implement TF-IDF ourselves using NLTK, but rather than bore you with the math, we'll take a shortcut and use the TF-IDF implementation provided by the [scikit-learn](http://scikit-learn.org/) machine learning library for Python.\n", + "\n", + "![Chevy Chase: \"It was my understanding that there would be no math.\"](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/i-was-told-there-would-be-no-math.jpg)\n", + "\n", + "#### Building a Term-Document Matrix\n", + "\n", + "We'll use scikit-learn's TfidfVectorizer class to construct a [term-document matrix](http://en.wikipedia.org/wiki/Document-term_matrix) containing the TF-IDF score for each word in each document in the Reuters Corpus. In essence, the rows of this sparse matrix correspond to documents in the corpus, the columns represent each word in the vocabulary of the corpus, and each cell contains the TF-IDF value for a given word in a given document.\n", + "\n", + "[![Scikit-learn logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/Scikit-learn_logo.png)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)\n", + "\n", + "Inspired by a [computer science lab exercise from Duke University](http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html), the code sample below iterates through the Reuters Corpus to build a dictionary of stemmed tokens for each article, then uses the TfidfVectorizer and scikit-learn's own built-in stop words list to generate the term-document matrix containing TF-IDF scores." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "building term-document matrix... [process started: 2018-02-22 16:41:48.535374]\n", + "done! [process finished: 2018-02-22 16:42:29.088650]\n" + ] + } + ], + "source": [ + "import datetime, re, sys\n", + "from sklearn.feature_extraction.text import TfidfVectorizer\n", + "\n", + "def tokenize_and_stem(text):\n", + " tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]\n", + " filtered_tokens = []\n", + " # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)\n", + " for token in tokens:\n", + " if re.search('[a-zA-Z]', token):\n", + " filtered_tokens.append(token)\n", + " stems = [stemmer.stem(t) for t in filtered_tokens]\n", + " return stems\n", + "\n", + "token_dict = {}\n", + "for article in reuters.fileids():\n", + " token_dict[article] = reuters.raw(article)\n", + " \n", + "tfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words='english', decode_error='ignore')\n", + "print('building term-document matrix... [process started: ' + str(datetime.datetime.now()) + ']')\n", + "sys.stdout.flush()\n", + "\n", + "tdm = tfidf.fit_transform(token_dict.values()) # this can take some time (about 60 seconds on my machine)\n", + "print('done! [process finished: ' + str(datetime.datetime.now()) + ']')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### TF-IDF Scores\n", + "\n", + "Now that we've built the term-document matrix, we can explore its contents:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TDM contains 25833 terms and 10788 documents\n", + "first term: 'd\n", + "last term: zzzz\n", + "random term: trackag\n", + "random term: rush\n", + "random term: visa\n", + "random term: government-guarante\n" + ] + } + ], + "source": [ + "from random import randint\n", + "\n", + "feature_names = tfidf.get_feature_names()\n", + "print('TDM contains ' + str(len(feature_names)) + ' terms and ' + str(tdm.shape[0]) + ' documents')\n", + "\n", + "print('first term: ' + feature_names[0])\n", + "print('last term: ' + feature_names[len(feature_names) - 1])\n", + "\n", + "for i in range(0, 4):\n", + " print('random term: ' + feature_names[randint(1,len(feature_names) - 2)])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Generating the Summary\n", + "\n", + "That's all we'll need to produce a summary for any document in the corpus. In the example code below, we start by randomly selecting an article from the Reuters Corpus. We iterate through the article, calculating a score for each sentence by summing the TF-IDF values for each word appearing in the sentence. We normalize the sentence scores by dividing by the number of tokens in the sentence (to avoid bias in favor of longer sentences). Then we sort the sentences by their scores, and return the highest-scoring sentences as our summary. The number of sentences returned corresponds to roughly 20% of the overall length of the article.\n", + "\n", + "Since some of the articles in the Reuters Corpus are rather small (i.e., a single sentence in length) or contain just raw financial data, some of the summaries won't make sense. If you run this code a few times, however, you'll eventually see a randomly-selected article that provides a decent demonstration of this simplistic method of identifying the \"most important\" sentence from a document." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "*** SUMMARY ***\n", + "Thus the dollar firmed to close the period at 1.8320 marks\n", + " and 153.70 yen.\n", + "The dollar had plumbed a post-World War II low of 149.98\n", + " yen on January 19 and reached a seven-year low of 1.7675 marks\n", + " on January 28.\n", + "But on January 28, the dollar closed at 151.50/60 yen\n", + " after dipping as low as 150.40 yen earlier in the session.\n", + "The dollar had risen as high as 2.08 marks and 165 yen in\n", + " early November.\n", + "The Fed's quarterly review of foreign exchange operations\n", + " said that the U.S. bought 50 mln dlrs through the sale of yen\n", + " on January 28.\n", + "\n", + "*** ORIGINAL ***\n", + "U.S. INTERVENED TO AID DLR IN JANUARY, FED SAYS\n", + " U.S. authorities intervened in the\n", + " foreign exchange market to support the dollar on one occasion\n", + " during the period between the start of November 1986 and the\n", + " end of January, the Federal Reserve Bank of New York said in a\n", + " report.\n", + " The Fed's quarterly review of foreign exchange operations\n", + " said that the U.S. bought 50 mln dlrs through the sale of yen\n", + " on January 28. This operation was coordinated with the Japanese\n", + " monetary authorities and was funded equally by the Fed and the\n", + " U.S. Treasury.\n", + " The Fed's intervention was on the morning after president\n", + " Reagan's State of the Union message and was \"in a manner\n", + " consistent with the joint statement\" made by U.S. Treasury\n", + " secretary James Baker and Japanese finance minister Kiichi\n", + " Miyazawa after their January 21 consultations.\n", + " At that meeting, the two reaffirmed their willingness to\n", + " cooperate on exchange rate issues.\n", + " The Fed's report did not say at what level the intervention\n", + " occurred. But on January 28, the dollar closed at 151.50/60 yen\n", + " after dipping as low as 150.40 yen earlier in the session. It\n", + " had closed at 151.05/15 yen the previous day.\n", + " The dollar had plumbed a post-World War II low of 149.98\n", + " yen on January 19 and reached a seven-year low of 1.7675 marks\n", + " on January 28. It ended that day at 1.7820/30 marks.\n", + " The Fed noted that, after trading steadily throughout\n", + " November and the first half of December, the dollar moved\n", + " sharply lower until the end of January.\n", + " It closed the three-month review period down more than 11\n", + " pct against the mark and most other Continental currencies and\n", + " seven pct lower against the yen and sterling. It had fallen\n", + " four pct against the Canadian dollar.\n", + " During the final days of January, pressure on the dollar\n", + " subsided. Reports of the U.S.-Japanese intervention operation\n", + " and talk of an upcoming meeting of the major industrial\n", + " countries encouraged expectations for broader cooperation on\n", + " exchange rate and economic policy matters, the Fed said.\n", + " Moreover, doubts had developed about the course of U.S.\n", + " interest rates. The dollar's swift fall had raised questions\n", + " about whether the Fed would let short-term rates ease.\n", + " Thus the dollar firmed to close the period at 1.8320 marks\n", + " and 153.70 yen. According to the Fed's trade-weighted index, it\n", + " had declined nine pct since the beginning of the period.\n", + " The dollar had risen as high as 2.08 marks and 165 yen in\n", + " early November.\n", + " The Fed last intervened in the foreign exchange market on\n", + " November 7, 1985 when it bought a total of 102.2 mln dlrs worth\n", + " of marks and yen.\n", + " The Fed's action followed the September 1985 Plaza\n", + " agreement between the five major industrial nations under which\n", + " they agreed to promote an orderly decline of the dollar.\n", + " \n", + "\n", + "\n" + ] + } + ], + "source": [ + "import math\n", + "from __future__ import division\n", + "\n", + "article_id = randint(0, tdm.shape[0] - 1)\n", + "article_text = reuters.raw(reuters.fileids()[article_id])\n", + "\n", + "sent_scores = []\n", + "for sentence in nltk.sent_tokenize(article_text):\n", + " score = 0\n", + " sent_tokens = tokenize_and_stem(sentence)\n", + " for token in (t for t in sent_tokens if t in feature_names):\n", + " score += tdm[article_id, feature_names.index(token)]\n", + " sent_scores.append((score / len(sent_tokens), sentence))\n", + "\n", + "summary_length = int(math.ceil(len(sent_scores) / 5))\n", + "sent_scores.sort(key=lambda sent: sent[0], reverse=True)\n", + "\n", + "print('*** SUMMARY ***')\n", + "for summary_sentence in sent_scores[:summary_length]:\n", + " print(summary_sentence[1])\n", + "\n", + "print('\\n*** ORIGINAL ***')\n", + "print(article_text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Improving the Summary\n", + "That was fairly easy, but how could we improve the quality of the generated summary? Perhaps we could boost the importance of words found in the title or any entities we're able to extract from the text. After initially selecting the highest-scoring sentence, we might discount the TF-IDF scores for duplicate words in the remaining sentences in an attempt to reduce repetitiveness. We could also look at cleaning up the sentences used to form the summary by fixing any pronouns missing an antecedent, or even pulling out partial phrases instead of complete sentences. The possibilities are virtually endless.\n", + "\n", + "## Next Steps\n", + "Want to learn more? Start by working your way through all the examples in the NLTK book (aka \"the Whale book\"):\n", + "\n", + "[![Natural Language Processing with Python book cover](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/cat.gif)](http://oreilly.com/catalog/9780596516499/)\n", + "\n", + "- [Natural Language Processing with Python (book)](http://oreilly.com/catalog/9780596516499/)\n", + "- (free online version: [nltk.org/book](http://www.nltk.org/book/))\n", + "\n", + "### Additional NLP Resources for Python\n", + "- [NLTK HOWTOs](http://www.nltk.org/howto/)\n", + "- [Python Text Processing with NLTK 2.0 Cookbook (book)](http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book)\n", + "- [Python wrapper for the Stanford CoreNLP Java library](https://pypi.python.org/pypi/corenlp)\n", + "- [guess_language (Python library for language identification)](https://bitbucket.org/spirit/guess_language)\n", + "- [MITIE (new C/C++-based NER library from MIT with a Python API)](https://github.com/mit-nlp/MITIE)\n", + "- [gensim (topic modeling library for Python)](http://radimrehurek.com/gensim/)\n", + "\n", + "### Attend future DC NLP meetups\n", + "\n", + "[![DC NLP logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/dcnlp.jpeg)](http://dcnlp.org/)\n", + "\n", + "- [dcnlp.org](http://dcnlp.org/) | [@DCNLP](https://twitter.com/DCNLP/)" + ] } - ] + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:anaconda3]", + "language": "python", + "name": "conda-env-anaconda3-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 } From 26ca49efbfb9171b7ccd03b0f19c8dd248273859 Mon Sep 17 00:00:00 2001 From: Surya Date: Thu, 22 Feb 2018 16:56:01 +0530 Subject: [PATCH 2/2] clear outputs --- ...attering of NLP in Python-checkpoint.ipynb | 562 ++++++++++++++++++ A Smattering of NLP in Python.ipynb | 519 ++-------------- .../anaconda_logo_web-checkpoint.png | Bin 0 -> 3834 bytes 3 files changed, 611 insertions(+), 470 deletions(-) create mode 100644 .ipynb_checkpoints/A Smattering of NLP in Python-checkpoint.ipynb create mode 100644 images/.ipynb_checkpoints/anaconda_logo_web-checkpoint.png diff --git a/.ipynb_checkpoints/A Smattering of NLP in Python-checkpoint.ipynb b/.ipynb_checkpoints/A Smattering of NLP in Python-checkpoint.ipynb new file mode 100644 index 0000000..6b1b96c --- /dev/null +++ b/.ipynb_checkpoints/A Smattering of NLP in Python-checkpoint.ipynb @@ -0,0 +1,562 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# A Smattering of NLP in Python\n", + "*by Charlie Greenbacker [@greenbacker](https://twitter.com/greenbacker)*\n", + "\n", + "[![Python Powered logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/python-powered-w-200x80.png)](https://www.python.org/)\n", + "\n", + "### Part of a [joint meetup on Natural Language Processing](http://www.meetup.com/stats-prog-dc/events/177772322/) - 9 July 2014\n", + "- #### [Statistical Programming DC](http://www.meetup.com/stats-prog-dc/)\n", + "- #### [Data Wranglers DC](http://www.meetup.com/Data-Wranglers-DC/)\n", + "- #### [DC Natural Language Processing](http://dcnlp.org/)\n", + "\n", + "***\n", + "\n", + "## Introduction\n", + "Back in the dark ages of data science, each group or individual working in Natural Language Processing (NLP) generally maintained an assortment of homebrew utility programs designed to handle many of the common tasks involved with NLP. Despite everyone's best intentions, most of this code was lousy, brittle, and poorly documented -- not a good foundation upon which to build your masterpiece. Fortunately, over the past decade, mainstream open source software libraries like the [Natural Language Toolkit for Python (NLTK)](http://www.nltk.org/) have emerged to offer a collection of high-quality reusable NLP functionality. These libraries allow researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.\n", + "\n", + "This presentation will cover a handful of the NLP building blocks provided by NLTK (and a few additional libraries), including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. Several of these components will then be assembled to build a very basic document summarization program.\n", + "\n", + "[![Natural Language Processing with Python book cover](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/cat.gif)](http://oreilly.com/catalog/9780596516499/)\n", + "\n", + "### Initial Setup\n", + "Obviously, you'll need Python installed on your system to run the code examples used in this presentation. We enthusiatically recommend using [Anaconda](https://store.continuum.io/cshop/anaconda/), a Python distribution provided by [Continuum Analytics](http://www.continuum.io/). Anaconda is free to use, it includes nearly [200 of the most commonly used Python packages for data analysis](http://docs.continuum.io/anaconda/pkg-docs.html) (including NLTK), and it works on Mac, Linux, and yes, even Windows.\n", + "\n", + "[![Anaconda logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/anaconda_logo_web.png)](https://store.continuum.io/cshop/anaconda/)\n", + "\n", + "We'll make use of the following Python packages in the example code:\n", + "\n", + "- [nltk](http://www.nltk.org/install.html) (comes with Anaconda)\n", + "- [readability-lxml](https://github.com/buriy/python-readability)\n", + "- [BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/) (comes with Anaconda)\n", + "- [scikit-learn](http://scikit-learn.org/stable/install.html) (comes with Anaconda)\n", + "\n", + "Please note that the **readability** package is not distributed with Anaconda, so you'll need to download & install it separately using something like easy_install readability-lxml or pip install readability-lxml.\n", + "\n", + "If you don't use Anaconda, you'll also need to download & install the other packages separately using similar methods. Refer to the homepage of each package for instructions.\n", + "\n", + "You'll want to run nltk.download() one time to get all of the NLTK packages, corpora, etc. (see below). Select the \"all\" option. Depending on your network speed, this could take a while, but you'll only need to do it once.\n", + "\n", + "#### Java libraries (optional)\n", + "One of the examples will use NLTK's interface to the [Stanford Named Entity Recognizer](http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download), which is distributed as a Java library. In particular, you'll want the following files handy in order to run this particular example:\n", + "\n", + "- stanford-ner.jar\n", + "- english.all.3class.distsim.crf.ser.gz\n", + "\n", + "[![Stanford NLP Group logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/stanford-nlp.jpg)](http://www-nlp.stanford.edu/software/CRF-NER.shtml#Download)\n", + "\n", + "***\n", + "\n", + "## Getting Started\n", + "The first thing we'll need to do is import nltk:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import nltk" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Downloading NLTK resources\n", + "The first time you run anything using NLTK, you'll want to go ahead and download the additional resources that aren't distributed directly with the NLTK package. Upon running the nltk.download() command below, the the NLTK Downloader window will pop-up. In the Collections tab, select \"all\" and click on Download. As mentioned earlier, this may take several minutes depending on your network connection speed, but you'll only ever need to run it a single time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# nltk.download()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extracting text from HTML\n", + "Now the fun begins. We'll start with a pretty basic and commonly-faced task: extracting text content from an HTML page. Python's urllib package gives us the tools we need to fetch a web page from a given URL, but we see that the output is full of HTML markup that we don't want to deal with.\n", + "\n", + "(N.B.: Throughout the examples in this presentation, we'll use Python *slicing* (e.g., [:500] below) to only display a small portion of a string or list. Otherwise, if we displayed the entire item, sometimes it would take up the entire screen.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from urllib.request import urlopen\n", + "\n", + "url = \"http://venturebeat.com/2014/07/04/facebooks-little-social-experiment-got-you-bummed-out-get-over-it/\"\n", + "html = urlopen(url).read()\n", + "html[:500]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Stripping-out HTML formatting\n", + "Fortunately, NTLK provides a method called clean_html() to get the raw text out of an HTML-formatted string. It's still not perfect, though, since the output will contain page navigation and all kinds of other junk that we don't want, especially if our goal is to focus on the body content from a news article, for example." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# will throw not implemented error\n", + "text = nltk.clean_html(html)\n", + "text[:500]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from bs4 import BeautifulSoup\n", + "\n", + "soup = BeautifulSoup(html)\n", + "text = soup.get_text()\n", + "text[:500]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Identifying the Main Content\n", + "If we just want the body content from the article, we'll need to use two additional packages. The first is a Python port of a Ruby port of a Javascript tool called Readability, which pulls the main body content out of an HTML document and subsequently \"cleans it up.\" The second package, BeautifulSoup, is a Python library for pulling data out of HTML and XML files. It parses HTML content into easily-navigable nested data structure. Using Readability and BeautifulSoup together, we can quickly get exactly the text we're looking for out of the HTML, (*mostly*) free of page navigation, comments, ads, etc. Now we're ready to start analyzing this text content." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from readability.readability import Document\n", + "\n", + "readable_article = Document(html).summary()\n", + "readable_title = Document(html).title()\n", + "soup = BeautifulSoup(readable_article)\n", + "print('*** TITLE *** \\n\\\"' + readable_title + '\\\"\\n')\n", + "print('*** CONTENT *** \\n\\\"' + soup.text[:500] + '[...]\\\"')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Frequency Analysis\n", + "Here's a little secret: much of NLP (and data science, for that matter) boils down to counting things. If you've got a bunch of data that needs *analyzin'* but you don't know where to start, counting things is usually a good place to begin. Sure, you'll need to figure out exactly what you want to count, how to count it, and what to do with the counts, but if you're lost and don't know what to do, **just start counting**.\n", + "\n", + "Perhaps we'd like to begin (as is often the case in NLP) by examining the words that appear in our document. To do that, we'll first need to tokenize the text string into discrete words. Since we're working with English, this isn't so bad, but if we were working with a non-whitespace-delimited language like Chinese, Japanese, or Korean, it would be much more difficult.\n", + "\n", + "In the code snippet below, we're using two of NLTK's tokenize methods to first chop up the article text into sentences, and then each sentence into individual words. (Technically, we didn't need to use sent_tokenize(), but if we only used word_tokenize() alone, we'd see a bunch of extraneous sentence-final punctuation in our output.) By printing each token alphabetically, along with a count of the number of times it appeared in the text, we can see the results of the tokenization. Notice that the output contains some punctuation & numbers, hasn't been loweredcased, and counts *BuzzFeed* and *BuzzFeed's* separately. We'll tackle some of those issues next." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tokens = [word for sent in nltk.sent_tokenize(soup.text) for word in nltk.word_tokenize(sent)]\n", + "\n", + "for token in sorted(set(tokens))[:30]:\n", + " print(token + ' [' + str(tokens.count(token)) + ']')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Word Stemming\n", + "[Stemming](http://en.wikipedia.org/wiki/Stemming) is the process of reducing a word to its base/stem/root form. Most stemmers are pretty basic and just chop off standard affixes indicating things like tense (e.g., \"-ed\") and possessive forms (e.g., \"-'s\"). Here, we'll use the Snowball stemmer for English, which comes with NLTK.\n", + "\n", + "Once our tokens are stemmed, we can rest easy knowing that *BuzzFeed* and *BuzzFeed's* are now being counted together as... *buzzfe*? Don't worry: although this may look weird, it's pretty standard behavior for stemmers and won't affect our analysis (much). We also (probably) won't show the stemmed words to users -- we'll normally just use them for internal analysis or indexing purposes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nltk.stem.snowball import SnowballStemmer\n", + "\n", + "stemmer = SnowballStemmer(\"english\")\n", + "stemmed_tokens = [stemmer.stem(t) for t in tokens]\n", + "\n", + "for token in sorted(set(stemmed_tokens))[50:75]:\n", + " print(token + ' [' + str(stemmed_tokens.count(token)) + ']')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Lemmatization\n", + "\n", + "Although the stemmer very helpfully chopped off pesky affixes (and made everything lowercase to boot), there are some word forms that give stemmers indigestion, especially *irregular* words. While the process of stemming typically involves rule-based methods of stripping affixes (making them small & fast), **lemmatization** involves dictionary-based methods to derive the canonical forms (i.e., *lemmas*) of words. For example, *run*, *runs*, *ran*, and *running* all correspond to the lemma *run*. However, lemmatizers are generally big, slow, and brittle due to the nature of the dictionary-based methods, so you'll only want to use them when necessary.\n", + "\n", + "The example below compares the output of the Snowball stemmer with the WordNet lemmatizer (also distributed with NLTK). Notice that the lemmatizer correctly converts *women* into *woman*, while the stemmer turns *lying* into *lie*. Additionally, both replace *eyes* with *eye*, but neither of them properly transforms *told* into *tell*." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lemmatizer = nltk.WordNetLemmatizer()\n", + "temp_sent = \"Several women told me I have lying eyes.\"\n", + "\n", + "print([stemmer.stem(t) for t in nltk.word_tokenize(temp_sent)])\n", + "print([lemmatizer.lemmatize(t) for t in nltk.word_tokenize(temp_sent)])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### NLTK Frequency Distributions\n", + "Thus far, we've been working with lists of tokens that we're manually sorting, uniquifying, and counting -- all of which can get to be a bit cumbersome. Fortunately, NLTK provides a data structure called FreqDist that makes it more convenient to work with these kinds of frequency distributions. The code snippet below builds a FreqDist from our list of stemmed tokens, and then displays the top 25 tokens appearing most frequently in the text of our article. Wasn't that easy?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fdist = nltk.FreqDist(stemmed_tokens)\n", + "\n", + "for item in list(fdist.items())[:25]:\n", + " print(item)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Filtering out Stop Words\n", + "Notice in the output above that most of the top 25 tokens are worthless. With the exception of things like *facebook*, *content*, *user*, and perhaps *emot* (emotion?), the rest are basically devoid of meaningful information. They don't really tells us anything about the article since these tokens will appear is just about any English document. What we need to do is filter out these [*stop words*](http://en.wikipedia.org/wiki/Stop_words) in order to focus on just the important material.\n", + "\n", + "While there is no single, definitive list of stop words, NLTK provides a decent start. Let's load it up and take a look at what we get:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sorted(nltk.corpus.stopwords.words('english'))[:25]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can use this list to filter-out stop words from our list of stemmed tokens before we create the frequency distribution. You'll notice in the output below that we still have some things like punctuation that we'd probably like to remove, but we're much closer to having a list of the most \"important\" words in our article." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "stemmed_tokens_no_stop = [stemmer.stem(t) for t in stemmed_tokens if t not in nltk.corpus.stopwords.words('english')]\n", + "\n", + "fdist2 = nltk.FreqDist(stemmed_tokens_no_stop)\n", + "\n", + "for item in list(fdist2.items())[:25]:\n", + " print(item)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Named Entity Recognition\n", + "Another task we might want to do to help identify what's \"important\" in a text document is [named entity recogniton (NER)](http://en.wikipedia.org/wiki/Named-entity_recognition). Also called *entity extraction*, this process involves automatically extracting the names of persons, places, organizations, and potentially other entity types out of unstructured text. Building an NER classifier requires *lots* of annotated training data and some [fancy machine learning algorithms](http://en.wikipedia.org/wiki/Conditional_random_field), but fortunately, NLTK comes with a pre-built/pre-trained NER classifier ready to extract entities right out of the box. This classifier has been trained to recognize PERSON, ORGANIZATION, and GPE (geo-political entity) entity types.\n", + "\n", + "(At this point, I should include a disclaimer stating [No True Computational Linguist](http://en.wikipedia.org/wiki/No_true_Scotsman) would ever use a pre-built NER classifier in the \"real world\" without first re-training it on annotated data representing their particular task. So please don't send me any hate mail -- I've done my part to stop the madness.)\n", + "\n", + "![Retrain my classifier models? Ain't nobody got time for that!](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/no_time.jpg)\n", + "\n", + "In the example below (inspired by [this gist from Gavin Hackeling](https://gist.github.com/gavinmh/4735528/) and [this post from John Price](http://freshlyminted.co.uk/blog/2011/02/28/getting-band-and-artist-names-nltk/)), we're defining a method to perform the following steps:\n", + "\n", + "- take a string as input\n", + "- tokenize it into sentences\n", + "- tokenize the sentences into words\n", + "- add part-of-speech tags to the words using nltk.pos_tag()\n", + "- run this through the NLTK-provided NER classifier using nltk.ne_chunk()\n", + "- parse these intermediate results and return any extracted entities\n", + "\n", + "We then apply this method to a sample sentence and parse the clunky output format provided by nltk.ne_chunk() (it comes as a [nltk.tree.Tree](http://www.nltk.org/_modules/nltk/tree.html)) to display the entities we've extracted. Don't let these nice results fool you -- NER output isn't always this satisfying. Try some other sample text and see what you get." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def extract_entities(text):\n", + " entities = []\n", + " for sentence in nltk.sent_tokenize(text):\n", + " chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)))\n", + " entities.extend([chunk for chunk in chunks if hasattr(chunk, 'label')])\n", + " return entities\n", + "\n", + "for entity in extract_entities('My name is Charlie and I work for Altamira in Tysons Corner.'):\n", + " print('[' + entity.label() + '] ' + ' '.join(c[0] for c in entity.leaves()))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you're like me, you've grown accustomed over the years to working with the [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) library for Java, and you're suspicious of NLTK's built-in NER classifier (especially because it has *chunk* in the name). Thankfully, recent versions of NLTK contain an special NERTagger interface that enables us to make calls to Stanford NER from our Python programs, even though Stanford NER is a *Java library* (the horror!). [Not surprisingly](http://www.yurtopic.com/tech/programming/images/java-and-python.jpg), the Python NERTagger API is slightly less verbose than the native Java API for Stanford NER.\n", + "\n", + "To run this example, you'll need to follow the instructions for installing the optional Java libraries, as outlined in the **Initial Setup** section above. You'll also want to pay close attention to the comment that says # change the paths below to point to wherever you unzipped the Stanford NER download file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nltk.tag.stanford import StanfordNERTagger\n", + "\n", + "# change the paths below to point to wherever you unzipped the Stanford NER download file\n", + "st = StanfordNERTagger('/Users/cgreenba/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',\n", + " '/Users/cgreenba/stanford-ner/stanford-ner.jar', 'utf-8')\n", + "\n", + "for i in st.tag('Up next is Tommy, who works at STPI in Washington.'.split()):\n", + " print('[' + i[1] + '] ' + i[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Automatic Summarization\n", + "Now let's try to take some of what we've learned and build something potentially useful in real life: a program that will [automatically summarize](http://en.wikipedia.org/wiki/Automatic_summarization) documents. For this, we'll switch gears slightly, putting aside the web article we've been working on until now and instead using a corpus of documents distributed with NLTK.\n", + "\n", + "The Reuters Corpus contains nearly 11,000 news articles about a variety of topics and subjects. If you've run the nltk.download() command as previously recommended, you can then easily import and explore the Reuters Corpus like so:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nltk.corpus import reuters\n", + "\n", + "print('** BEGIN ARTICLE: ** \\\"' + reuters.raw(reuters.fileids()[0])[:500] + ' [...]\\\"')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our [painfully simplistic](http://anthology.aclweb.org/P/P11/P11-3014.pdf) automatic summarization tool will implement the following steps:\n", + "\n", + "- assign a score to each word in a document corresponding to its level of \"importance\"\n", + "- rank each sentence in the document by summing the individual word scores and dividing by the number of tokens in the sentence\n", + "- extract the top N highest scoring sentences and return them as our \"summary\"\n", + "\n", + "Sounds easy enough, right? But before we can say \"*voila!*,\" we'll need to figure out how to calculate an \"importance\" score for words. As we saw above with stop words, etc. simply counting the number of times a word appears in a document will not necessarily tell you which words are most important.\n", + "\n", + "#### Term Frequency - Inverse Document Frequency (TF-IDF)\n", + "\n", + "Consider a document that contains the word *baseball* 8 times. You might think, \"wow, *baseball* isn't a stop word, and it appeared rather frequently here, so it's probably important.\" And you might be right. But what if that document is actually an article posted on a baseball blog? Won't the word *baseball* appear frequently in nearly every post on that blog? In this particular case, if you were generating a summary of this document, would the word *baseball* be a good indicator of importance, or would you maybe look for other words that help distinguish or differentiate this blog post from the rest?\n", + "\n", + "Context is essential. What really matters here isn't the raw frequency of the number of times each word appeared in a document, but rather the **relative frequency** comparing the number of times a word appeared in this document against the number of times it appeared across the rest of the collection of documents. \"Important\" words will be the ones that are generally rare across the collection, but which appear with an unusually high frequency in a given document.\n", + "\n", + "We'll calculate this relative frequency using a statistical metric called [term frequency - inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf). We could implement TF-IDF ourselves using NLTK, but rather than bore you with the math, we'll take a shortcut and use the TF-IDF implementation provided by the [scikit-learn](http://scikit-learn.org/) machine learning library for Python.\n", + "\n", + "![Chevy Chase: \"It was my understanding that there would be no math.\"](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/i-was-told-there-would-be-no-math.jpg)\n", + "\n", + "#### Building a Term-Document Matrix\n", + "\n", + "We'll use scikit-learn's TfidfVectorizer class to construct a [term-document matrix](http://en.wikipedia.org/wiki/Document-term_matrix) containing the TF-IDF score for each word in each document in the Reuters Corpus. In essence, the rows of this sparse matrix correspond to documents in the corpus, the columns represent each word in the vocabulary of the corpus, and each cell contains the TF-IDF value for a given word in a given document.\n", + "\n", + "[![Scikit-learn logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/Scikit-learn_logo.png)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)\n", + "\n", + "Inspired by a [computer science lab exercise from Duke University](http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html), the code sample below iterates through the Reuters Corpus to build a dictionary of stemmed tokens for each article, then uses the TfidfVectorizer and scikit-learn's own built-in stop words list to generate the term-document matrix containing TF-IDF scores." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import datetime, re, sys\n", + "from sklearn.feature_extraction.text import TfidfVectorizer\n", + "\n", + "def tokenize_and_stem(text):\n", + " tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]\n", + " filtered_tokens = []\n", + " # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)\n", + " for token in tokens:\n", + " if re.search('[a-zA-Z]', token):\n", + " filtered_tokens.append(token)\n", + " stems = [stemmer.stem(t) for t in filtered_tokens]\n", + " return stems\n", + "\n", + "token_dict = {}\n", + "for article in reuters.fileids():\n", + " token_dict[article] = reuters.raw(article)\n", + " \n", + "tfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words='english', decode_error='ignore')\n", + "print('building term-document matrix... [process started: ' + str(datetime.datetime.now()) + ']')\n", + "sys.stdout.flush()\n", + "\n", + "tdm = tfidf.fit_transform(token_dict.values()) # this can take some time (about 60 seconds on my machine)\n", + "print('done! [process finished: ' + str(datetime.datetime.now()) + ']')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### TF-IDF Scores\n", + "\n", + "Now that we've built the term-document matrix, we can explore its contents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from random import randint\n", + "\n", + "feature_names = tfidf.get_feature_names()\n", + "print('TDM contains ' + str(len(feature_names)) + ' terms and ' + str(tdm.shape[0]) + ' documents')\n", + "\n", + "print('first term: ' + feature_names[0])\n", + "print('last term: ' + feature_names[len(feature_names) - 1])\n", + "\n", + "for i in range(0, 4):\n", + " print('random term: ' + feature_names[randint(1,len(feature_names) - 2)])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Generating the Summary\n", + "\n", + "That's all we'll need to produce a summary for any document in the corpus. In the example code below, we start by randomly selecting an article from the Reuters Corpus. We iterate through the article, calculating a score for each sentence by summing the TF-IDF values for each word appearing in the sentence. We normalize the sentence scores by dividing by the number of tokens in the sentence (to avoid bias in favor of longer sentences). Then we sort the sentences by their scores, and return the highest-scoring sentences as our summary. The number of sentences returned corresponds to roughly 20% of the overall length of the article.\n", + "\n", + "Since some of the articles in the Reuters Corpus are rather small (i.e., a single sentence in length) or contain just raw financial data, some of the summaries won't make sense. If you run this code a few times, however, you'll eventually see a randomly-selected article that provides a decent demonstration of this simplistic method of identifying the \"most important\" sentence from a document." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import math\n", + "from __future__ import division\n", + "\n", + "article_id = randint(0, tdm.shape[0] - 1)\n", + "article_text = reuters.raw(reuters.fileids()[article_id])\n", + "\n", + "sent_scores = []\n", + "for sentence in nltk.sent_tokenize(article_text):\n", + " score = 0\n", + " sent_tokens = tokenize_and_stem(sentence)\n", + " for token in (t for t in sent_tokens if t in feature_names):\n", + " score += tdm[article_id, feature_names.index(token)]\n", + " sent_scores.append((score / len(sent_tokens), sentence))\n", + "\n", + "summary_length = int(math.ceil(len(sent_scores) / 5))\n", + "sent_scores.sort(key=lambda sent: sent[0], reverse=True)\n", + "\n", + "print('*** SUMMARY ***')\n", + "for summary_sentence in sent_scores[:summary_length]:\n", + " print(summary_sentence[1])\n", + "\n", + "print('\\n*** ORIGINAL ***')\n", + "print(article_text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Improving the Summary\n", + "That was fairly easy, but how could we improve the quality of the generated summary? Perhaps we could boost the importance of words found in the title or any entities we're able to extract from the text. After initially selecting the highest-scoring sentence, we might discount the TF-IDF scores for duplicate words in the remaining sentences in an attempt to reduce repetitiveness. We could also look at cleaning up the sentences used to form the summary by fixing any pronouns missing an antecedent, or even pulling out partial phrases instead of complete sentences. The possibilities are virtually endless.\n", + "\n", + "## Next Steps\n", + "Want to learn more? Start by working your way through all the examples in the NLTK book (aka \"the Whale book\"):\n", + "\n", + "[![Natural Language Processing with Python book cover](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/cat.gif)](http://oreilly.com/catalog/9780596516499/)\n", + "\n", + "- [Natural Language Processing with Python (book)](http://oreilly.com/catalog/9780596516499/)\n", + "- (free online version: [nltk.org/book](http://www.nltk.org/book/))\n", + "\n", + "### Additional NLP Resources for Python\n", + "- [NLTK HOWTOs](http://www.nltk.org/howto/)\n", + "- [Python Text Processing with NLTK 2.0 Cookbook (book)](http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book)\n", + "- [Python wrapper for the Stanford CoreNLP Java library](https://pypi.python.org/pypi/corenlp)\n", + "- [guess_language (Python library for language identification)](https://bitbucket.org/spirit/guess_language)\n", + "- [MITIE (new C/C++-based NER library from MIT with a Python API)](https://github.com/mit-nlp/MITIE)\n", + "- [gensim (topic modeling library for Python)](http://radimrehurek.com/gensim/)\n", + "\n", + "### Attend future DC NLP meetups\n", + "\n", + "[![DC NLP logo](https://raw.githubusercontent.com/charlieg/A-Smattering-of-NLP-in-Python/master/images/dcnlp.jpeg)](http://dcnlp.org/)\n", + "\n", + "- [dcnlp.org](http://dcnlp.org/) | [@DCNLP](https://twitter.com/DCNLP/)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:anaconda3]", + "language": "python", + "name": "conda-env-anaconda3-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/A Smattering of NLP in Python.ipynb b/A Smattering of NLP in Python.ipynb index 0f32e4a..6b1b96c 100644 --- a/A Smattering of NLP in Python.ipynb +++ b/A Smattering of NLP in Python.ipynb @@ -57,10 +57,8 @@ }, { "cell_type": "code", - "execution_count": 3, - "metadata": { - "collapsed": false - }, + "execution_count": null, + "metadata": {}, "outputs": [], "source": [ "import nltk" @@ -76,10 +74,8 @@ }, { "cell_type": "code", - "execution_count": 4, - "metadata": { - "collapsed": false - }, + "execution_count": null, + "metadata": {}, "outputs": [], "source": [ "# nltk.download()" @@ -97,22 +93,9 @@ }, { "cell_type": "code", - "execution_count": 5, - "metadata": { - "collapsed": false - }, - "outputs": [ - { - "data": { - "text/plain": [ - "b'\\n\\n\\n 5\u001b[0;31m '/Users/cgreenba/stanford-ner/stanford-ner.jar', 'utf-8')\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mst\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtag\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Up next is Tommy, who works at STPI in Washington.'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/nltk/tag/stanford.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 173\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 174\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__init__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 175\u001b[0;31m \u001b[0msuper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mStanfordNERTagger\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__init__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 176\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 177\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mproperty\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/nltk/tag/stanford.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, model_filename, path_to_jar, encoding, verbose, java_options)\u001b[0m\n\u001b[1;32m 56\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_JAR\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath_to_jar\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 57\u001b[0m \u001b[0msearchpath\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0m_stanford_url\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 58\u001b[0;31m verbose=verbose)\n\u001b[0m\u001b[1;32m 59\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 60\u001b[0m self._stanford_model = find_file(model_filename,\n", - "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/nltk/__init__.py\u001b[0m in \u001b[0;36mfind_jar\u001b[0;34m(name_pattern, path_to_jar, env_vars, searchpath, url, verbose, is_regex)\u001b[0m\n\u001b[1;32m 719\u001b[0m searchpath=(), url=None, verbose=False, is_regex=False):\n\u001b[1;32m 720\u001b[0m return next(find_jar_iter(name_pattern, path_to_jar, env_vars,\n\u001b[0;32m--> 721\u001b[0;31m searchpath, url, verbose, is_regex))\n\u001b[0m\u001b[1;32m 722\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 723\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/nltk/__init__.py\u001b[0m in \u001b[0;36mfind_jar_iter\u001b[0;34m(name_pattern, path_to_jar, env_vars, searchpath, url, verbose, is_regex)\u001b[0m\n\u001b[1;32m 635\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 636\u001b[0m raise LookupError('Could not find %s jar file at %s' %\n\u001b[0;32m--> 637\u001b[0;31m (name_pattern, path_to_jar))\n\u001b[0m\u001b[1;32m 638\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 639\u001b[0m \u001b[0;31m# Check environment variables\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mLookupError\u001b[0m: Could not find stanford-ner.jar jar file at /Users/cgreenba/stanford-ner/stanford-ner.jar" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from nltk.tag.stanford import StanfordNERTagger\n", "\n", @@ -665,27 +367,9 @@ }, { "cell_type": "code", - "execution_count": 16, - "metadata": { - "collapsed": false - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "** BEGIN ARTICLE: ** \"ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT\n", - " Mounting trade friction between the\n", - " U.S. And Japan has raised fears among many of Asia's exporting\n", - " nations that the row could inflict far-reaching economic\n", - " damage, businessmen and officials said.\n", - " They told Reuter correspondents in Asian capitals a U.S.\n", - " Move against Japan might boost protectionist sentiment in the\n", - " U.S. And lead to curbs on American imports of their products.\n", - " But some exporters said that while the conflict wo [...]\"\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from nltk.corpus import reuters\n", "\n", @@ -725,20 +409,9 @@ }, { "cell_type": "code", - "execution_count": 17, - "metadata": { - "collapsed": false - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "building term-document matrix... [process started: 2018-02-22 16:41:48.535374]\n", - "done! [process finished: 2018-02-22 16:42:29.088650]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "import datetime, re, sys\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", @@ -776,25 +449,9 @@ }, { "cell_type": "code", - "execution_count": 21, - "metadata": { - "collapsed": false - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "TDM contains 25833 terms and 10788 documents\n", - "first term: 'd\n", - "last term: zzzz\n", - "random term: trackag\n", - "random term: rush\n", - "random term: visa\n", - "random term: government-guarante\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from random import randint\n", "\n", @@ -821,87 +478,9 @@ }, { "cell_type": "code", - "execution_count": 28, - "metadata": { - "collapsed": false - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "*** SUMMARY ***\n", - "Thus the dollar firmed to close the period at 1.8320 marks\n", - " and 153.70 yen.\n", - "The dollar had plumbed a post-World War II low of 149.98\n", - " yen on January 19 and reached a seven-year low of 1.7675 marks\n", - " on January 28.\n", - "But on January 28, the dollar closed at 151.50/60 yen\n", - " after dipping as low as 150.40 yen earlier in the session.\n", - "The dollar had risen as high as 2.08 marks and 165 yen in\n", - " early November.\n", - "The Fed's quarterly review of foreign exchange operations\n", - " said that the U.S. bought 50 mln dlrs through the sale of yen\n", - " on January 28.\n", - "\n", - "*** ORIGINAL ***\n", - "U.S. INTERVENED TO AID DLR IN JANUARY, FED SAYS\n", - " U.S. authorities intervened in the\n", - " foreign exchange market to support the dollar on one occasion\n", - " during the period between the start of November 1986 and the\n", - " end of January, the Federal Reserve Bank of New York said in a\n", - " report.\n", - " The Fed's quarterly review of foreign exchange operations\n", - " said that the U.S. bought 50 mln dlrs through the sale of yen\n", - " on January 28. This operation was coordinated with the Japanese\n", - " monetary authorities and was funded equally by the Fed and the\n", - " U.S. Treasury.\n", - " The Fed's intervention was on the morning after president\n", - " Reagan's State of the Union message and was \"in a manner\n", - " consistent with the joint statement\" made by U.S. Treasury\n", - " secretary James Baker and Japanese finance minister Kiichi\n", - " Miyazawa after their January 21 consultations.\n", - " At that meeting, the two reaffirmed their willingness to\n", - " cooperate on exchange rate issues.\n", - " The Fed's report did not say at what level the intervention\n", - " occurred. But on January 28, the dollar closed at 151.50/60 yen\n", - " after dipping as low as 150.40 yen earlier in the session. It\n", - " had closed at 151.05/15 yen the previous day.\n", - " The dollar had plumbed a post-World War II low of 149.98\n", - " yen on January 19 and reached a seven-year low of 1.7675 marks\n", - " on January 28. It ended that day at 1.7820/30 marks.\n", - " The Fed noted that, after trading steadily throughout\n", - " November and the first half of December, the dollar moved\n", - " sharply lower until the end of January.\n", - " It closed the three-month review period down more than 11\n", - " pct against the mark and most other Continental currencies and\n", - " seven pct lower against the yen and sterling. It had fallen\n", - " four pct against the Canadian dollar.\n", - " During the final days of January, pressure on the dollar\n", - " subsided. Reports of the U.S.-Japanese intervention operation\n", - " and talk of an upcoming meeting of the major industrial\n", - " countries encouraged expectations for broader cooperation on\n", - " exchange rate and economic policy matters, the Fed said.\n", - " Moreover, doubts had developed about the course of U.S.\n", - " interest rates. The dollar's swift fall had raised questions\n", - " about whether the Fed would let short-term rates ease.\n", - " Thus the dollar firmed to close the period at 1.8320 marks\n", - " and 153.70 yen. According to the Fed's trade-weighted index, it\n", - " had declined nine pct since the beginning of the period.\n", - " The dollar had risen as high as 2.08 marks and 165 yen in\n", - " early November.\n", - " The Fed last intervened in the foreign exchange market on\n", - " November 7, 1985 when it bought a total of 102.2 mln dlrs worth\n", - " of marks and yen.\n", - " The Fed's action followed the September 1985 Plaza\n", - " agreement between the five major industrial nations under which\n", - " they agreed to promote an orderly decline of the dollar.\n", - " \n", - "\n", - "\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "import math\n", "from __future__ import division\n", diff --git a/images/.ipynb_checkpoints/anaconda_logo_web-checkpoint.png b/images/.ipynb_checkpoints/anaconda_logo_web-checkpoint.png new file mode 100644 index 0000000000000000000000000000000000000000..5774facee0df1e7156f75c95d6852b172605e1ca GIT binary patch literal 3834 zcmaJ@X*d*Y)PBZz#|#FOZAh|~WSMO7iXlXnh*FjqOJjR2uk2%)kToh<$}&-@Bq7_# zHukI|D%sbPhM2MsGoSbR{(Qf_Kj*ry^XJ^xeVzN!yIeYinL#rR~DViD>Jj_-) z^0X*7mNVDHy#-Z4K!cw45X|d?;FM1Ty_IXe9?3} z7{?sT*`GKL)vVp2QB}A)Vw81i-~s0JP)J*hf^;~9xB$lrhwPq#5aUu?s$CC~tS}H_ zL2698Otd+eou6;XIjYW6ah!e)(Meqal=Ke&BV12{*?{oMma<@b9w5Ki!LI0JHlm2M zWY)iHdV_h8BxRw+mgE5lLae6yydX(}$prH!j{`pa{Q><$ zLdX1(ddB;oTMwb2Kd*%OK4D--RREaaz=D1!)e1;h?N7D-2lV8adouH^U;AKMj-uXm zpuoolS0m23&$sxar6l1MlFY66`ID0HSqbLvjzi|LXOZjUpoD|?n9ytnPf zpSuynyO{OK1LuKPr*ns=SMi^beG1uhY zE%=E*2zcMtD=4(ze@bZVD`9o{FYqH-KB$cK=Y^-1M!r5K_UErwkF3$}O`FU9V8z)KS2a-8x^63{G#OuikK<cGDlkQig8qeqdZ-ea-W+&#?!QE(V1qKuNgJ@5Olrd4lyIB^*D6x;!)jZ zA1jsOY;{P05tQ6HE$n~iIK%9`es>yzKUN*ZZY&goQl)zs1v3KguQ3o2kbEKbL=9q> zTYmw8S^K@xF~cVDR*}m9HvuAyC~C&>U8#|hEL2)j=XHYpihuFH_I3njU_@^V{LiN%#@s?dE)(nV?6`AroV3nTc2pzm;AZQ zB(;NDc>D5uR++&&g-(z6E&L2s?CvJ@srd3YZElIzvvTvwYF#x5+pQ?)=#A7ph7+ZQ z#W+4gZ}DtYdW3wHq@)=OQ7gJ(-|DzaGF`iTZ9%Zkq`e+FEA*~wO8Pc=RE6*?A!Hg$Z(+faHsUX;p4VM^(|}r&OKKKzb|nqYtz!|y$NiH z?`DMu6#aaP#b3i5TnKJ82NQ8Fqc99d5bGts2$32=G&`2Xvm_XEYJkZRX}HiVT^=Pz zFp<)Dz)duiSos3(s)gNQFuIG4&er<)PLqSlNowp8D0rd;;7O4phfiKB2n#zCHaQ{N zoCTKw!Y{knd$2upm+-dFcW$~vJ(D%QK7*KNbe@U*)q)^*%m1#RgiTu>Rc(de9|pB- zXPtKlJ-j{DO*{iGtqRMzT(^!ZAQGmMTQcF-Dt@(GYxaMJ@fk9+TtSh7i4e*!?!qG{ zz<{`I+n8d1n(u=fhFV{-&W36hFWBZVFr3VSZ7Z5=x^b>vy=KW;-sxF8m{oucae9CR zWnmow>BU#7QatH+0E5Cs_i!M<6Z%SsIo)l9#FMh>O-C!o1|SS|zMaEO>` zH$0Z0Q-XDy=JTq!29=W?39bDLDcJSulwYDyi^a#O&|?=ME1m4Mwe{I;b8$YGD29=> z^0@JgWh^-!5|kmqv#<8GbPa|!`WULc>*rz=-#U#f+>yeHJ36>m`BtX&u}NshmxOLa z>D#heB`s!8I0`E+o_*%>#_G+^K+p_@M!p1^G zwIO%@00`vZ#|WRjE&0tw7OY`zI}j0fPYOjF#Z#`{)r$zhiSK3>EYkK+x{jaOU#Hz{ ziK5I*-Fm$8#p6gMIO^ctU2~?gz%`#oYA>CxLkS#hzQOmWv4%T@!yzcAztm_0MB}X- z&QO)vuWF%t4%RIFO%{W<_JY>p|K~o|m^6MNMfCaFn2o~MbChYty$~=iAtI%)hI+usJiM@D#Z}M5Xr9bfDC2Od` zk>(HU%PQWp>WdlVF$OmxVSFn#A&o@E(Mqj&E-VDBe*9ulBvF0Bfx6pY>tu1laxa8y zZS-u?T>xaIUP0jKJdXMVY!#xO3niZNFwu+c1?-4K%*aYXa@8NClQcX~w zt_nTfnDs-|F8@?(P0E1kwjCn*^fid`H!X~ri&gsVQPR1$H@MQi*K=6Qi4alKaTl*s z7YxevkNBin-%UGXf$Oh*x%WG3)=T=(ymCCFgj_)jiZQzQca2$5Pq#*+OK~tVoT*aT z6LKCZCqUClL0l16Bs^KRMAW7T%GkY|JniJa4)49VYxmM{ep|g!a=76b8&d9D)ZLU( zo0+JuguAP$<}s!foAC8%=I-=mJtem<5weI|Ez>#=z74xx<z8B)At-fY?C{yUy(EmbN|tLqa?xOY$SO!aM8MW#)_MO%%d zO>eYj$IQC33F|{+KkkoFk7cx_WkZ74=05e>#QM%&zjgY-l6Smr=Fu{`=)?2MuOrCK zsN6JaSRn1#xAV||`ow5yGv)A*QLzLwCN2r>cDZh`Tm>uysSc9{8#vlg$Vj(T*Z=2_)qr~#S2teM7jt?2|Mi`VG~={<}Xs5vzof5 zPH;K>TBq@0|G3E=O?DpYyDXOv6@lfs4r2p(+=ZMgV=4A@4Gh{|7P^kOZ~YOZB)VHM zth^aBw-Rk%*!QT*^sLC|ZP=!6NMeY#k{4jN1H9*|s41{nez(iqh1Ne*d@K;Prh-RZ z!S-S;*OZ!8p`e2P(a-KD{fn>-;-)w zRSHhEh90HZc3riWw;UOeMiUd()qdJ7=*m;kAAel+?()i6F2^RRXqY#dEKI?Mj3c6* zp&u|r85pLq6}T!ITCa(E=L9antO$?Bn)=ZIHcQV_vd??ss_+`Jh6xKOl*x<=Kq=^eXrDAFMiF`=V4g@uE!hlgd?XZd}=> zzks1_`1G@69PoXCO?2d%-aLFh62|t#Q%2+l3}1;-GQWcH0#UtEwYWJSaLUs5wdz3$ zrl+5(&0Y8*=+zGgejxH|vW z+?3pRqlwW%TT}n|mMDgJ0|z`KY|w@N)R^|S>{AJWx3? z!;kJWZzKOv*^ZNXP4WA7ONT7uKiWUJX_$NJPhM(Bk3-`K_^rUKu5&*z&)jzEd_-%o z7BLohtmKoPxAp-z+Fw}BA~u&j{I35v`~^WFi#kLvfXzx7jxJFBVhKs%Vs;yq^;_l! z4XRMMVCQ&mSudT`5byd)4f3S%V*`>>*IlF5n9)(v7vS$Bhjs^7Uton5)bqVOV0=<3 ze$wRA_ZV+gclWrTNVnbyR9Dje6j@%!0VOZ_ohaHdn6DOSp217MEg;{F_(Ksj;HgrD zhBb2J(pF^oG2