Cole Simmons

On Encoding Gilgamesh

Last month I spent a week trying to get a language model to produce passable Akkadian transliterations of the Standard Babylonian Gilgamesh. The results were instructive, mostly about the limits of the approach.

The setup

The Standard Babylonian version of the Epic of Gilgamesh runs to about 3,000 lines across twelve tablets. In transliteration — the romanized representation of cuneiform signs — the text is dense with determinatives, sign-value ambiguities, and broken passages marked by lacunae. A typical line looks like this:

i-na KASKAL-šu ša a-la-ku la i-du-ú

"On his journey, which going he did not know." That's Tablet IX, line 2. The model needs to handle Sumerian logograms (KASKAL), phonetic Akkadian syllables (i-na, a-la-ku), and the grammatical morphology connecting them.

What went wrong

The core problem is segmentation. Cuneiform transliteration has its own tokenization logic that doesn't map to any modern language the model has seen in training. Hyphens join syllables within a single word. Spaces separate words. But determinatives — semantic classifiers like superscript "d" for deities — have no separator at all. The model kept hallucinating hyphens where there shouldn't be any, merging determinatives with the following word, or splitting compound logograms.

Here's a concrete example. Given the first line of Tablet I:

ša nag-ba i-mu-ru iš-di ma-a-ti

The model produced:

ša nag-ba i-mu-ru iš-di ma-a-ti

Looks identical, right? It's not. The model inserted a zero-width space after "iš-di" and used a regular hyphen instead of the Assyriological convention hyphen (U+2010) throughout. These are the kinds of errors that make downstream processing break silently.

The deeper issue

What I'm really circling around is the question of whether subword tokenization — BPE, WordPiece, whatever — is fundamentally the wrong approach for transliterated cuneiform. The writing system has its own internal logic: a sign can be a logogram, a syllabogram, or a determinative depending on context. No amount of statistical pattern-matching on surface forms can recover that.

I think the answer is a custom tokenizer trained on ATF (ASCII Transliteration Format) corpora, probably with explicit sign-boundary markers. But that's a larger project, and for now I'm stuck with fine-tuning on top of a tokenizer that thinks "lugal" is two tokens.

What's next

I've been looking at the ORACC corpus as a training set. It has about 2.5 million words of transliterated cuneiform across multiple languages and periods, all in standardized ATF. The challenge is that the corpus is predominantly Sumerian, and the Akkadian portions are heavily skewed toward administrative texts rather than literary ones. A model trained mostly on grain receipts from Ur III is not going to produce convincing epic poetry.

The plan for now is to build a custom tokenizer, train it on the full ORACC dump, and then fine-tune a small transformer on the literary subcorpus. If that works, I'll write it up properly. If it doesn't, at least I'll have a good tokenizer.