Text mining

Study program / study programs: Advanced data analytics

Course: Text Mining

Teacher(s): Jelena Jovanović, Sonja Dimitrijević

Course status: Elective

ECTS points: 10

Prerequisites: none

Course objective:
To guide and assist students in:- developing a solid understanding of a typical text mining workflow – learning principal text mining methods and techniques, including those used in text classification and clustering, topic modeling, key-terms extraction, and text summarization. – developing working knowledge of text mining in R and/or Python programming language(s).

Learning outcomes:
Students will be able to apply text mining methods and techniques to classify and cluster unstructured text-based content, as well as to extract key terms and main topics from such content. They will also know how to evaluate the performance of individual methods and techniques, as well as how to benchmark different methods and techniques.

Course structure and content:
The course will cover the overall text mining process and examine in detail each of the key phases of a typical text mining workflow. In particular, the following will be covered:- exploratory analysis of a given corpus (i.e. text-based dataset)- text preprocessing- transformation of unstructured textual content to a structured numerical format, that is, feature creation; different text representation / feature creation methods will be considered, including both traditional ones (e.g. vector space model) and more recent ones (e.g. word vectors)- reducing typically very large feature space through feature selection techniques- selection of a statistical, or a machine learning, or a graph-based algorithm to be used in conjunction with the created feature set to build a model for pattern mining or information extraction – examining and evaluating the results produced by the built model.
Various methods for typical text mining tasks will be introduced, including methods for text classification and clustering, as well as those used for the detection of key-terms and topics. Finally, the course will demonstrate the iterative (cyclic) nature of the text mining workflow, aimed at achieving better performance through alteration of individual phases of the process.
All phases of the text mining workflow will be introduced through practical work with publicly available software libraries for text mining (e.g., relevant R or Python packages) and real-world corpora (i.e. text-based datasets).
The course relies on using relevant Python libraries.

Literature/Readings:
Selected chapters from the following books:
J. Silge & D. Robinson. Text Mining with R – A Tidy Approach. O’Reilly, 2017. E-book publicly available at: http://tidytextmining.com/
T. Kwartler. Text Mining in Practice with R. Wiley, 2017

The number of class hours per week:
Lectures: 4
Labs: 0
Workshops: 1
Research study: 2
Other classes: 0

Teaching methods:
Lectures will introduce main concepts for each course topic, and will include a lot of practical work with the topic-specific software libraries. Workshops and research study will be fully practical, based on individual and group work.

Evaluation/Grading (maximum 100 points):
Pre-exam requirements (Project: simple application case): 40
Final exam (Project: real-world application case): 60