Text Analytics


Statistical methods for the analysis of textual data have come of age. Techniques allow you to mine textual data for underlying sentiments, scan for hateful or discriminatory language, or create predictive features that improve familiar predictive models. These methods do more than simply count, though counting through millions of words is in itself impressive. Along with counting, algorithms developed in the field of natural language processing, or NLP, provide richer grammatical and syntactic analysis, such as identifying parts of speech, sentence parsing (e.g., subject and predicate), and named entity recognition (people and places). Modern software tools allow the routine use of these methods by non-specialists -- at least those that have taken this course!

Textual data typically comes with other information. Modern data streams routinely combine text with the familiar numerical data that might be used, for instance, in a regression model. For example, real estate listings routinely combine the selling price of the property with a verbal description. Some descriptions include numerical data, such as the number of rooms, but many others only verbally describe the property, often using an idiosyncratic vernacular. Advances in text analytics allow us to convert this text into numerical features suitable for other statistical models. Unsupervised techniques are available to create features directly from text, requiring minimal user input. Because these constructions are unsupervised, the resulting features perform like typical regressors. The techniques range from naïve to subtle. One can simply use raw counts of words, form principal components from these counts, or build regressors from counts of adjacent words. We will consider several examples to illustrate the surprising success of these methods. To partially explain the success, we will explore proposed hierarchical generating models often associated with nonparametric Bayesian analysis. Because regressors derived from text may be difficult to interpret, we also show how to develop interpretive hooks from quantitative features.

Prerequisites: This course is self-contained with no explicit prerequisite beyond familiarity with statistical methods at the level of multiple regression. That said, some familiarity with multivariate methods (particularly principal components) and exposure to probability models would be helpful. The course will predominantly use packages from R as the main software tool.

Standard Fee: Members = $1700; Non-members = $3200

Course Sections