ASCIIfying

I’ve been adding more automation to my static blog publishing workflow.1 The scripts themselves are of no use to anyone else, but some bits and pieces may be of wider interest. For example, this morning I wrote a script using a library that converts Unicode strings to their nearest ASCII equivalent.

The script, written to be used as a Text Filter in BBEdit, automates the generation of header lines in the Markdown source code of a post. The header of this post, for example, looks like this:

Title: ASCIIfying
Keywords: python, programming
Date: 2014-10-19 22:10:00
Slug: asciifying
Link: http://www.leancrew.com/all-this/2014/10/asciifying/

I write the Title and Keyword lines as I start the post, using a simple BBEdit Clipping. But before I publish, I need the other lines. The Date is easy to generate using the datetime library. That’s also the library I use to generate the year and month portions of the Link URL. The tricky thing is automating the creation of the Slug, which also shows up in the Link.

Oh, it’s very easy to make a slug when the title is as simple as this one, but suppose we started with this:

Title: Çingleton/Montréal isn't done
Keywords: test

Non-ASCII characters are allowed in URLs, but they can be troublesome, and I prefer to avoid them. Also, we can’t have the slash in there, and the apostrophe ought to go, too. Finally, I don’t want any spaces, because they cause nothing but trouble in the file system, and I hate seeing %20 in a URL.

The function I settled on is this:

python:
1:  def slugify(u):
2:    "Convert Unicode string into blog slug."
3:    u = re.sub(u'[–—/:;,.]', '-', u)  # replace separating punctuation
4:    a = unidecode(u).lower()          # best ASCII substitutions, lowercased
5:    a = re.sub(r'[^a-z0-9 -]', '', a) # delete any other characters
6:    a = a.replace(' ', '-')           # spaces to hyphens
7:    a = re.sub(r'-+', '-', a)         # condense repeated hyphens
8:    return a

All of the lines are straightforward and obvious except the unidecode call in Line 4. That is the one function exported by the unidecode library, and it does the substitutions that make slugify generate strings that are much more useful than anything I could write with the standard encode and decode methods. My script turns that two-line header above into

Title: Çingleton/Montréal isn't done
Keywords: test
Date: 2014-10-19 21:31:22
Slug: cingleton-montreal-isnt-done
Link: http://www.leancrew.com/all-this/2014/10/cingleton-montreal-isnt-done/

which has a perfectly readable URL that includes nothing but lowercase ASCII characters, numerals, and hyphens.

The unidecode library is a Python port of a Perl module, and its documentation is sparse. If you want to know what it does and why it does it, go to Sean Burke’s writeup of his original Perl module, Text::Unidecode. It lays out his goals for the module, explains its limitations, and includes little gems like this:

I discourage you from being yet another German who emails me, trying to impel me to consider a typographical nicety of German to be more important than all other languages.

If you ever need to ASCIIfy some text, Text::Unidecode or one of its ports (here’s one for Ruby) will come in handy.


  1. “Static blog publishing workflow” may be the most jargon-filled four-word phrase I’ve ever written.