SLaks.Blog

Making the world a better place, one line of code at a time

Migrating client-side syntax highlighting to Jekyll

Posted on Sunday, June 02, 2013

The next step in migrating my blog to Jekyll was to convert the code blocks to use Jekyll’s {% highlight %} tag.

Since Blogger has no support for server-side syntax highlighting, all of the code blocks in my HTML are implemented as <pre class="brush: someLanguage">...</pre>, with HTML-escaped code inside the tag. (the class name is used by SyntaxHighlighter) I needed to convert that to Liquid tags with raw (non-escaped) code inside of them.

To do this, I wrote a small C# script:

const string PostsFolder = @".../_posts";
var langMappings = new Dictionary<string, string>{
	{ "vb", "vb.net" }
};
Func<string, string> GetLanguage = lang => {
	string mapped;
	if (langMappings.TryGetValue(lang, out mapped))
		return mapped;
	return lang;
};

Func<string, MatchEvaluator> Replace = replacement => ManifestResourceInfo => ManifestResourceInfo.Result(replacement);

var regexes = new [] {
	Tuple.Create<Regex, MatchEvaluator>(
		new Regex(@"\s*\<pre\s*class=""brush:\s*(\w+);?\s*"">\n*(.*?)\n*</pre>\s*", RegexOptions.Singleline),
		m => "\n{% endraw %}\n{% highlight " + GetLanguage(m.Groups[1].Value) +" %}\n" 
			+ WebUtility.HtmlDecode(m.Groups[2].Value) 
			+ "\n{% endhighlight %}\n{% raw %}\n"
	),
	Tuple.Create(new Regex(@"{% raw %}\s*{% endraw %}\s*", RegexOptions.Singleline), Replace("")),
	Tuple.Create(new Regex(@"<font face=""Courier New"">([^<]+)</font>", RegexOptions.Singleline), Replace("<code>$1</code>"))
};

foreach (var file in Directory.EnumerateFiles(PostsFolder, "*.html")) {
	File.WriteAllText(file,
		regexes.Aggregate (File.ReadAllText(file), (txt, tuple) => tuple.Item1.Replace(txt, tuple.Item2))
	);
}

(This kind of script is most easily run in LINQPad or scriptcs)

This code loops through every HTML file in PostsFolder and runs the text through a series of regular expressions (yes, evil) to update it for Jekyll.

The first, and biggest, regex matches <pre class="brush: someLanguage">...</pre>, and convers each to Jekyll {% highlight someLanguage %} blocks. Since SyntaxHighlighter uses different language names than Pygments’, I have a langMappings that maps language names from the HTML to language names for the Jekyll output. I also un-HTML-escape the contents of each code block. Finally, because I modified blogger2jekyll to wrap each post in a {% raw %} tag, it terminates the {% raw %} and re-enters it after the code.

Depending on what your code blocks look like, you may want to change it to wrap the contents of each {% highligh %} tag in a {% raw %} block too.

The next regex strips empty {% raw %} blocks in case there are two code blocks in a row (which I had here).

The last regex replaces bad non-semantic <font> tags created by Windows Live Writer with <code> tags.
If your imported HTML has similar issues, you can add more regexes to correct them.

Next time: Preserving the RSS feed

Categories: Jekyll, C# Tweet this post

comments powered by Disqus