Step 01 / Ingest + Normalize
SumTablets turns Sumerian transliteration into a supervised sequence-to-sequence task over Unicode glyphs, with open data and reproducible baselines.
The dataset pairs source glyph sequences with transliterations for 91,606 tablets (about 6.97M glyphs), preserving structural context via special tokens for surfaces, line breaks, rulings, columns, blank space, and breakage.
On a stratified 90/5/5 split by historical period, a weighted dictionary sampler reaches 61.22 chrF, while an XLM-R-initialized encoder-decoder reaches 97.54 chrF. The objective is practical: speed philologist review, target uncertain readings, and make downstream restoration/translation pipelines tractable.