Machine Learning for the Analysis of Text As Data (Chapel Hill, NC)


  • Brice Acree, University of North Carolina

Quantitative analysis of digitized text represents an exciting and challenging frontier of data science across a broad spectrum of disciplines. From the analysis of physicians' notes to identify patients with diabetes, to the assessment of global happiness through the analysis of speech on Twitter, patterns in massive text corpora have led to important scientific advancements. In this course we will cover several central computational and statistical methods for the analysis of text as data. Topics will include the manipulation and summarization of text data, dictionary methods of text analysis, prediction and classification with textual data, document clustering, text reuse measurement, and statistical topic models. Each method will be illustrated with hands-on examples using R. Participants will develop an understanding of the challenges and opportunities presented by the analysis of text as data, as well as the practical computational skills to complete independent analyses. The R packages covered in this course include tm, lda, textreuse, glmnet and openNLP.

One distinguishing focus of this course will be the use of text analytics for the reliable and valid development and testing of scientific theory. Most methods of text analysis have been developed with predictive or descriptive motivations. For each method we cover in the current course, we will review how the method has been and can be applied to draw theoretical inferences regarding processes surrounding text generation.

Fee: Members = $1700; Non-members = $3200

Tags: text, machine learning

Course Sections

Section 1

Location: University of North Carolina -- Chapel Hill, NC

Date(s): June 19 - June 23

Time: 9:00 AM - 5:00 PM


  • Brice Acree, University of North Carolina