Post-mortem of a fiction generator: asemic writing with some properties of natural languages

John Ohno on 2018-07-02

A couple years ago I wrote a demo/display hack that generated an alphabet & vocabulary for asemic writing. For NaNoGenMo 2017, I modified it to use PIL instead of turtle, use markov chains (rather than picking random words), and write to an A1-sized page image.

I’d like to make a couple notes about what’s going on under the hood here.

First, I generate a character set.

Each character is a series of strokes separated by angle changes. Originally this logic was for pyturtle’s pen-based system, which made a lot of sense for simulated handwriting. So, a stroke feeds into the next one — every character can be drawn without lifting the pen, with the exception of accents. (A character can have one or two dots or grave/acute accents — if a character has two dots it’s an umlaut and if a character has both an acute and grave accent it has a carat.)

Every element of the character with the exception of the accents is actually phonetic: each stroke type is a consonant sound and each angle change is a vowel sound. (This is inspired by hangul, where what appears to be a logogram is actually a cluster of up to three phonetic characters.) In this case we have up to five stroke-angle pairs. These phonetic readings aren’t used, but in the original version of the script they were in the debug output.

Strokes can be either full length or half length, and they can be either lines or semicircles. Angle changes are limited to 45 degree intervals (i.e., 45, 90, 180, -45, and -90). These limitations are intended to mimic the kinds of differences that might actually work in a hand-written language — there needs to be a big threshhold between distinct characters or else it’s easy to misread.

A character set is between 20 and 36 characters — about the same range as in reality for one- or two-sound characters in phonetic writing systems. Since ours actually has up to five syllables per character, we really should have many more, but that’s a pain.

Then, I create a vocabulary by combining random characters. Originally, I had a bias toward short words and tied this bias to word frequency, but I don’t do that anymore because I was having problems with the output. The vocabulary is supposed to be about 300 words, between one and five characters long.

Once I have a vocabulary, I make something resembling a grammar by creating a bunch of sentences whose markov model will resemble a markov model of a real language. Basically, I create a sentence pool and accumulate randomly chosen words from the vocabulary to randomly chosen parts of the pool while growing the pool. The result is that some words will have significantly stronger associations, so once we make a markov model, the distribution of stuff produced by chaining from that model will be zipf — I think. I didn’t actually calculate it out properly, so I might be completely wrong.

I create an image for every word in the vocabulary, and then chain & render the result onto pages. I was getting a lot of single-word lines so I created a filter that merged lines 98% of the time, which brought the page count down to something more reasonable.

In my first pdf the characters are a little hard to see, since the base stroke unit is so small (5 pixels). So, I created a second one with a 10 pixel base stroke length.

Since getting kerning right is really hard, I turned on cursive mode & created another version with a connected script.

All of these have 50k or more ‘words’.