📔
intro-to-python
  • An Introduction to Programming in Python (for Business Students)
  • exercises
    • Data Flow Diagramming Exercise
    • Developer Collaboration Exercise
    • README
    • "Web App" Exercise
      • checkpoints
        • Checkpoint 5: Bootstrap Layout
        • Checkpoint 4: Submitting Data from Web Forms
        • Checkpoint 3: Rendering HTML Pages
        • Checkpoint 1: Routing
        • Checkpoint 2: Modular Organization
      • "Web App" Exercise - Further Exploration
    • hello-world
      • "Hello World (Local)" Exercise
      • "Hello World (Local w/ Version Control)" Exercise
      • "Hello World (Colab)" Exercise
    • "Interface Capabilities" Exercise
    • "Continuous Integration 1, 2, 3" Exercise
    • "Web Service" Exercise
      • "Web Service" Exercise - Further Exploration
    • "Testing 1, 2, 3" Exercise
    • "Command-line Computing" Exercise
      • "Command-line Computing" Exercise
      • Professor Rossetti's Mac Terminal Configuration
      • Command-line Computing Exercise
    • "Codebase Cleanup" Assignment
    • "List Comprehensions" Exercise
    • "Groceries" Exercise
      • Python Datatypes (a.k.a. "Groceries") Exercise
      • Python Datatypes (a.k.a. "Groceries") Exercise
    • "Rock, Paper, Scissors" Exercise
      • "Rock, Paper, Scissors" Exercise
    • README
    • "Monthly Sales Predictions" Exercise
    • Setting up your Local Development Environment
    • "Chart Gallery" Exercise
    • "Run the App" Exercise
    • "Web Requests" Exercise
    • "API Client" Exercise
    • "Custom Functions" Exercise
    • Process Diagramming Exercise
  • notes
    • python
      • packages
        • The bigquery Package
        • The PySimpleGUI Package
        • The dotenv Package
        • The matplotlib Package
        • The requests Package
        • The altair Package
        • The gspread Package
        • The PyMySQL Package
        • The psycopg2 Package
        • The selenium Package
        • The seaborn Package
        • The pytest Package
        • The SpeechRecognition Package
        • The flask Package
        • The pandas Package
        • The spotipy Package
        • The pipenv Package
        • The nltk Package
        • The sqlalchemy Package
        • The pymongo Package
        • The plotly Package
        • The BeautifulSoup Package
        • The sendgrid Package
        • The fpdf Package
        • The autopep8 Package
        • The tweepy Package
        • The twilio Package
        • The tkinter Package
      • Python Datatypes Overview
        • Numbers
        • Classes
        • Dates and Times
        • Strings
        • None
        • Dictionaries
        • Booleans
        • Lists
        • Class Inheritance
      • Control Flow
      • Python Modules
        • The webbrowser Module
        • The time Module
        • The csv Module
        • The sqlite3 Module
        • The itertools Module
        • The json Module
        • The math Module
        • The os Module
        • The statistics Module
        • The random Module
        • The pprint Module
        • The datetime Module
        • The collections Module
      • Printing and Logging
      • Comments
      • Syntax and Style
      • Functions
      • Variables
      • Errors
      • Docstrings
      • File Management
      • User Inputs
      • Debugging
    • clis
      • The git Utility
      • Heroku, and the heroku Utility
      • Anaconda
      • The chromedriver Utility
      • The brew Utility (Mac OS)
      • The pdftotext Utility
      • The python Utility
      • The pip Utility
    • Software
      • Software Licensing
      • Software Documentation
      • Software Ethics
      • Software Testing Overview
      • Application Programming Interfaces (APIs)
      • Software Version Control
      • Software Refactoring Overview
    • devtools
      • The VS Code Text Editor
      • Code Climate
      • Travis CI
      • GitHub Desktop Software
      • Git Bash
      • Google Colab
    • Information Systems
      • Computer Networks
      • Processes
      • Datastores
      • Information Security and Privacy
      • People
    • Technology Project Management
      • Project Management Tools and Techniques
      • The Systems Development Lifecycle (SDLC)
    • hardware
      • Servers
    • Environment Variables
  • projects
    • "Executive Dashboard" Project
      • testing
      • "Exec Dash" Further Exploration Challenges
    • The Self-Directed (a.k.a "Freestyle") Project
      • "Freestyle" Project - Demonstration
      • "Freestyle" Project - Implementation (TECH 2335 Version)
      • "Freestyle" Project - Implementation
      • "Freestyle" Project Proposal
      • plan
    • "Robo Advisor" Project
      • Robo Advisor Project - Automated Testing Challenges
      • "Robo Advisor" Further Exploration Challenges
    • "Shopping Cart" Project
      • "Shopping Cart" Project - Automated Testing Challenges
      • "Shopping Cart" Further Exploration Challenges
      • "Shopping Cart" Project Checkpoints
  • License
  • Exam Prep
  • units
    • Unit 4B: User Interfaces and Experiences (Bonus Material)
    • Unit 5b: Databases and Datastores
    • Module 1 Review
    • Unit 7b: Processing Data from the Internet (Bonus Material)
    • Unit 9: Software Products and Services
    • Unit 8: Software Maintenance and Quality Control
    • Unit 7: Processing Data from the Internet
    • Unit 6: Data Visualization
    • Unit 5: Processing CSV Data
    • Unit 4: User Interfaces and Experiences
    • Unit 3: Python Datatypes
    • Unit 12: Project Presentations
    • Unit 2: Python Language Overview
    • Unit 11: Project Implementation Sprint
    • Unit 1: The Python Development Environment
    • Unit 10: Software Planning, Analysis, and Design
    • Unit 0: Onboarding
    • Unit 5B: Advanced Data Analytics
  • Contributor's Guide
Powered by GitBook
On this page
  • Installation
  • Usage
  • Sentiment Analysis
  • Entity Identification

Was this helpful?

  1. notes
  2. python
  3. packages

The nltk Package

PreviousThe pipenv PackageNextThe sqlalchemy Package

Last updated 5 years ago

Was this helpful?

An original version of this guide was contributed by Mike Zhu (@mz888).

The nltk (Natural Language Tool Kit) package is a good introduction to some common Natural Language Processing (NLP) processes, including Sentiment Analysis, Named Entity Recognition, and document preprocessing. You can also download corpus collections with nltk for practice or to serve as training data for machine learning applications.

Reference:

Installation

First install the package using Pip, if necessary:

pip install nltk

Usage

Sentiment Analysis

One of the most widely used NLP techniques is Sentiment Analysis. One of the modules available from nltk is the Vader Sentiment Analyzer, a relatively simple, vocabulary-based tool for measuring sentiment.

import nltk
# you will need to download the Vader sentiment lexicon the first time you use it.
# to do this, we can use nltk's download function, which will bring up a GUI.
nltk.download()

# Select the vader_lexicon file and download
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

# let's make two sample sentences to test out Vader
positive = "Python is fantastic and useful"
negative = "R is cruel and unusual"

# Vader outputs a score for positive, negative, and neutral
sid.polarity_scores(positive)
#> {'neg': 0.0, 'compound': 0.7579, 'neu': 0.316, 'pos': 0.684} # the compound score is the overall score of the text
sid.polarity_scores(negative)
#> {'neg': 0.559, 'compound': -0.5859, 'neu': 0.441, 'pos': 0.0}

Entity Identification

Let's say we want to find instances of people, places, or other proper nouns in a document. This NLP task, called "Named Entity Extraction," can also be implemented with nltk.

import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

# NER in nltk actually takes 3 discrete steps. First, we tokenize the sentence by splitting it up into words.
sent = "Derek Jeter met Mariano Rivera in New York."
token = word_tokenize(sent)
token #> ['Derek', 'Jeter', 'met', 'Mariano', 'Rivera', 'in', 'New York', '.']

# Then, we employ part-of-speech tagging to get the grammatical construct of the sentence
tagged = pos_tag(token)
# notice that each word in the list below is designated a grammatical label: NNP, for example, is a proper noun
tagged #> [('Derek', 'NNP'), ('Jeter', 'NNP'), ('met', 'VBD'), ('Mariano', 'NNP'), ('Rivera', 'NNP'), ('in', 'IN'), ('New York', 'NNP'), ('.', '.')]

# Finally, we use the ne_chunk function to detect proper nouns.
chunk = ne_chunk(tagged)
# Each Named Entity is assigned a type; New York is identified as a GPE (geopolitical entity)
chunk #> Tree('S', [Tree('PERSON', [('Derek', 'NNP')]), Tree('PERSON', [('Jeter', 'NNP')]), ('met', 'VBD'), Tree('PERSON', [('Mariano', 'NNP'), ('Rivera', 'NNP')]), ('in', 'IN'), Tree('GPE', [('New York', 'NNP')]), ('.', '.')])

# Can you think of a way to clean up the output?

The nltk package contains many modules with different functionalities. Consult the as well as other online guides to explore its many uses.

http://www.nltk.org/
http://www.nltk.org/book
NLTK book