What is this course about?
This course introduces core concepts and methods in text mining, focusing on how large-scale textual data can be represented, modeled, and analyzed using machine learning techniques. The course begins with a review of fundamental machine learning concepts and progresses to text representation methods, ranging from sparse models to dense and contextual representations including word embeddings and transformer-based language models. Building on these representations, students will study text classification methods, semi-supervised and multi-task learning, and domain adaptation techniques. The course also covers modern search systems, including lexical and neural retrieval, and LLM-enhanced retrieval, with an emphasis on applying text mining techniques to real-world problems.
Resources: There is no required textbook for this class. Slides are mostly self-contained. You can refer to the following books for further understanding:
Jurafsky and Martin, Speech and Language Processing
Prerequisites: Students are expected to have the following background:
Familiarity with basic programming (Python 3)
Basic knowledge of linear algebra and probability
Grading
Programming assignments: 20%
Midterm exam: 40%
Final exam: 40%
Previous offerings
2025 Fall