Natural Language Processing
for the Byzantine Corpus
Postgraduate Programme — Level 7
LEARNING OUTCOMES
Upon completion, students will be able to:
1)
Explain core NLP concepts (tokenization, lemmatization, named entity recognition, text classification, topic modelling) and their relevance to Byzantine philology and historical research.
2)
Use Python and standard NLP libraries (CLTK, spaCy, NLTK, pandas) to load, preprocess, and analyse Byzantine Greek texts from major digital repositories.
3)
Design and apply annotation guidelines for Byzantine named entities (persons, offices, toponyms, dynasties, dates) and produce annotated datasets using Prodigy or Label Studio.
4)
Train, evaluate, and critically interpret a domain-specific NER model using precision, recall, and F1 metrics.
5)
Apply stylometric and topic modelling methods to Byzantine corpora and interpret results within their historical and philological context.
6)
Critically evaluate the possibilities and limitations of large language models and AI tools when applied to Medieval Greek texts.
7)
Design and deliver an end-to-end NLP pipeline project on a Byzantine text, integrating philological expertise with computational methods.
COURSE SYLLABUS
13 Modules
Week 01 | Working with text in Python
Digital resources for Byzantine texts.
Week 02 | Prompt Literacy for Humanities Research
Loading Byzantine texts, basic string operations.
Week 03 | Preprocessing Medieval Greek
Tokenization, normalization, polytonic Unicode, scribal abbreviations. Tools: NLTK, spaCy, CLTK.
Week 04 | Morphological analysis and lemmatization
Challenges of Greek inflection; evaluating CLTK on Byzantine texts.
Week 05 | Corpus statistics and stylometry
Word frequency, Zipf’s Law, authorship analysis applied to Byzantine texts.
Week 06 | Named Entity Recognition Recognition — concepts.
Entity types in Byzantine texts; why off-the-shelf NER fails; IOB tagging scheme.
Week 07 | Annotation practice
Designing annotation guidelines for Byzantine entities.
Week 08 | Training and evaluating a NER model with spaCy
Evaluation metrics: precision, recall, F1. Error analysis.
Week 09 | Text classification and topic modelling
Genre classification. Critical interpretation.
Week 10 | Knowledge graphs and linked data
Entity linking to PBW, Pleiades, Wikidata; introduction to RDF triples.
Week 11 | Large language models and Byzantine texts
Prompting strategies; critical evaluation of LLM performance on Medieval Greek; hallucination in historical contexts.
Week 12 | Project workshop
Student presentations of NER pipeline progress; peer feedback; troubleshooting.
Week 13 | Final presentations
Closing discussion: computational text analysis in Byzantine scholarship.
ASSESSMENT
Student Evaluation
40%
Weekly exercises
formative; submitted via course platform; graded on correctness, code quality, and critical reflection
60%
Final Project
summative; NER pipeline on a Byzantine text of the student’s choosing, including annotation manual, trained model, evaluation report, and public presentation
Workload — ECTS Distribution
250 Hours Total
Lectures
39
91
120
Course Total
250
Recommended Bibliography
Suggested bibliography:
Bird, S., Klein, E. and Loper, E. (2009). Natural Language Processing with Python. O’Reilly.
Jurafsky, D., & Martin, J. H. (2026). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models (3rd ed.). https://web.stanford.edu/~jurafsky/slp3/

