Introduction to Text Mining in R
Introduction
This workshop will introduce you to text analysis techniques in R, an open source programming language. Some familiarity with R is expected as a requirement to attend this course. You can attend the Intro to R course we provide or take this online tutorial https://www.datacamp.com/courses/free-introduction-to-r from DataCamp.
Topics
- Preprocessing text
- Creating a DTM (document term matrix)
- Working with DTMs
- Word frequencies
- Wordclouds - of course!
- Discriminating / Distinctive Words
- Time permitting: Dictionary Methods (i.e. Sentiment Analysis)
Dates
November 16, 2016 (9:00 AM - 12:00 PM)
Location
Biomedical Library Classroom 4
Things not covered
- Document Similarity
- Topic modeling
- Clustering
Instructors
Audience
All graduate students and researchers.
Setup
This lesson assumes you have the R, RStudio software installed on your computer.
R can be downloaded here.
We will use the following packages in R, if you can, install prior to class:
RStudio is an environment for developing using R. It can be downloaded here. You will need the Desktop version for your computer.
Required R Packages
-
tm
# text mining in R -
RTextTools
# a machine learning package for text classification -
qdap
# quantiative discourse analysis -
qdapDictionaries
# for sentiment analysis, etc -
entropy
# tools applying Information Theory -
dplyr
# data preparation and pipes $>$ -
ggplot2
# for plotting -
SnowballC
# for stemming -
matrixStats
# for stats -
data.table
# for easier data manipulation -
scales
# to help us plot -
lsa
# latent semantic analysis -
cluster
# for clustering analysis -
fpc
# flexible procedures for clustering -
mallet
# a wrapper around the Java machine learning tool MALLET -
wordcloud
# to visualize wordclouds -
rJava
# dependency for mallet - Any dependencies to the packages above.
Data
https://drive.google.com/open?id=0ByRar-ghNtRlNGpENWJmNGNlS2s
Resources
- Collaborative Notes: Etherpad
- Computational Text Analysis Workshop Materials by Rochelle Terman (@rochelleterman)
- Text analysis with R for students of literature
- Guide to Text Corpora
- Crimson Hexagon
Credits
- Rochelle Terman https://github.com/rochelleterman/text-analysis-dhbsi
- DataCamp.com for some examples