Cole Simmons

SumTablets

A Transliteration Dataset of Sumerian Tablets

ACL 2024 ยท ML4AL Workshop

overview

SumTablets turns Sumerian transliteration into a supervised sequence-to-sequence task over Unicode glyphs, with open data and reproducible baselines.

The dataset pairs source glyph sequences with transliterations for 91,606 tablets (about 6.97M glyphs), preserving structural context via special tokens for surfaces, line breaks, rulings, columns, blank space, and breakage.

On a stratified 90/5/5 split by historical period, a weighted dictionary sampler reaches 61.22 chrF, while an XLM-R-initialized encoder-decoder reaches 97.54 chrF. The objective is practical: speed philologist review, target uncertain readings, and make downstream restoration/translation pipelines tractable.

Corpus size 91,606 tablets / 6,970,407 glyphs
Source coverage Readings to glyph 99.93%; glyph to Unicode 99.96%
Evaluation Character-level chrF, averaged by tablet on held-out test data
Model delta Weighted dictionary 61.22 → neural baseline 97.54 chrF

data construction

The bottleneck was not model choice first. It was a stable, machine-usable pairing of glyph strings with transliteration strings that preserves tablet structure and Assyriological conventions.

Input glyphs 𒀀𒂗𒆤 𒈦𒆳𒆳𒊏 𒀊𒁀𒀀𒀀𒌷𒉈𒆤 ...

Output transliteration {d}en-lil2 lugal kur-kur-ra ab-ba dingir-dingir-re2-ne-ke4 ...

Step 01 / Ingest + Normalize

Standardize transliterations from ePSD2/Oracc pipelines

Source transliterations are cleaned and canonicalized into a consistent format suitable for glyph-to-text learning, while retaining period and genre metadata for each tablet ID.

Step 02 / Reading → Glyph Mapping

Recover source glyphs from transliterated readings

A two-stage mapping converts transliterated readings to glyph names and then glyph names to Unicode cuneiform signs. Successful conversion retained 6,724,498 readings (99.93%) and 6,638,081 Unicode glyph mappings (99.96%).

Step 03 / Structural Fidelity

Encode tablet layout as aligned special tokens

Structure tokens are inserted in both glyph and transliteration streams so models do not lose tablet-level formatting that affects reading constraints.

[<SURFACE>] [\n] [...] [<RULING>] [<COLUMN>] [<BLANK_SPACE>]

Periods 10 labeled periods for temporally aware evaluation and transfer studies.
Genres 14 labeled genres exposing style/domain shift across administrative, literary, legal, and ritual texts.
Partitioning 90/5/5 train/val/test, stratified by period to reduce period leakage and stabilize genre mix.
Lexical handling Lexical texts excluded from validation/test and added to training only.

modeling approach

Two baselines establish the floor and ceiling: a weighted reading sampler and a multilingual transformer encoder-decoder with task-specific tokenization.

Baseline A / Non-neural

Weighted dictionary sampler

For each glyph, sample from known readings proportional to observed frequency. The weighted mean number of readings per glyph is 6.75.

This baseline establishes how far frequency-only disambiguation can go without context modeling.

[lookup sampling] [frequency prior] [no sequence context]

Baseline B / Neural

XLM-R initialized encoder-decoder

Encoder and decoder both initialize from a 279M-parameter XLM-R checkpoint, then are adapted for glyph-to-transliteration generation.

Custom SentencePiece vocabularies: 632 glyph tokens and 1024 transliteration tokens, each including 11 shared special tokens.

[XLM-R] [SentencePiece] [seq2seq] [beam search]

Stage 1 Encoder MLM pretraining: 50 epochs, seq len 64, LR 5e-5, batch 2048, mask prob 0.10, warmup 200.
Stage 2 Joint model with frozen encoder: LR 1e-4, 2 epochs, batch 128, warmup 100.
Stage 3 Joint model with unfrozen encoder: LR 5e-5, 4 epochs, batch 128, warmup 100.
Inference + Compute Beam size 5, AdamW optimization, trained on single A100 SXM 80GB.

results

The neural baseline closes most of the transliteration gap, but performance remains uneven across historically and stylistically distinct genres.

Category Dictionary chrF Neural chrF
Overall 61.22 97.54
Administrative 63.15 98.14
Royal inscription 54.58 95.15
Literary 37.73 90.67
Liturgy 55.92 77.68

Interpretation

Genre imbalance still matters

Administrative texts dominate training distribution and also yield the highest scores. Lower-resource genres (especially liturgical/literary) remain a meaningful error source, suggesting gains from balanced curriculum sampling and uncertainty-aware decoding.

Operational Value

Useful now as expert-in-the-loop tooling

Even with residual errors, high chrF at corpus scale makes the model practical for draft transliteration generation and triaging legacy transliterations for specialist review rather than full manual first-pass reading.

why this matters

SumTablets is infrastructure: it turns transliteration from a one-off manual act into a reproducible ML problem with public benchmarks and reusable assets.

Research trajectory

From transliteration to restoration and translation

A strong glyph-to-transliteration model is a prerequisite for damaged-tablet restoration, uncertainty estimation over readings, and low-resource Akkadian/Sumerian translation pipelines.

Scholarly unlock

Scale philological workflows without flattening nuance

By preserving layout and metadata, the dataset supports models that can respect period, genre, and document structure rather than collapsing tablets into context-free token streams.

My contribution

End-to-end technical ownership

Led data design, preprocessing, Unicode mapping, split strategy, tokenizer retraining, baseline implementation, training, evaluation, and release packaging across ACL paper, repository, and Hugging Face assets.

artifacts

Every major component is public and reusable.