What is this course about?
This course focuses on big data analytics with an emphasis on text data, which constitutes a large portion of the data generated daily in modern systems. The course aims to answer a central question: how to represent large-scale data, how to learn from it, and how to apply it to real-world problems such as search and recommendation. Students will study core methods for text representation, including sparse and dense models, and learn how these representations are used in text classification, graph-based learning, and modern search and retrieval pipelines.
Resources: There is no required textbook for this class. Slides are mostly self-contained. You can refer to the following books for further understanding:
Jurafsky and Martin, Speech and Language Processing
Hamilton, Graph Representation Learning
Prerequisites: Students are expected to have the following background:
Familiarity with basic programming (Python 3)
Basic knowledge of linear algebra and probability
Introductory understanding of machine learning concepts
Grading
Attendance: 10%
Up to five absences will have no penalties. Each absence beyond five will result in a 1% deduction.
Programming assignments: 20%
Project: 30%
Final exam: 40%
Previous offerings
2025 Fall: Special talk (Keunchan Park, NAVER), Final exam