Exporting open office with equations to markdown and mathjax/tex formulas

Well, that was the most un-inspirational title ever. Especially because this is the first blog post in ages on here. Anyhow….

The problem:

So a colleague of mine wrote a very hefty 300 pages tome on electric fundamentals. It is written in Microsoft Word but, being the 21st century and all, we really would love to have this syllabus also as an online gitbook.com-site. However, it contains over 700+ equations that just won’t get converted.

When we used gitbook-convert on the .docx the output it generated was okay-ish (though the image uris needed some manual labour afterwards). However, no equations to be found whatsoever, it simply skipped those like a lazy student.

The solution:

After dicking around with several “solutions” from StackOverflow we finally managed to get the solution ourselves. As a fair warning: I suck/can’t use regex so you will see some cringy stuff down here…but hey, it works and that’s what count!

How we solved it:

  1. Saved the .docx document as an open office document (ODT) from within Microsoft Word.
  2. Send the odt through my epic code (seen below) which does:
    1. Unpack the odt file (it’s just a zip with lots of XML-files)
    2. Identify the equations in the document
    3. Transforms the equations to Mathjax compatible versions
    4. Insert transformed equations into odt
    5. Repack everything to an odt file
  3. Send the odt through gitbook-convert
  4. Profit!

Gimme epic code!

Ok, so step 2 was, of course, the main problem. Here’s “a solution”, but as warned it’s just a quick-and-dirty fix.

Step 1: Unpack the odt file

  System.IO.Compression.ZipFile.ExtractToDirectory(source, tempfolder);

Easy huh 😀

Step 2: Identify the equations in the document
Using a very science-y way we discovered that all equations are conveniently inside subfolders called “Object x” (x being a number) in which the actual equations is described in a separate content.xml document using MathML, the OpenOffice way of describing equations.

So we used a simple loop over all extracted folders to identify the XML-files we needed.

            List<String> files = new List<String>();
            try
            {

                foreach (string d in Directory.GetDirectories(sDir))
                {
                    if (d.Contains("Object"))
                    {
                        var filesc = Directory.GetFiles(d);

                        foreach (var item in filesc)
                        {
                            if (item.Contains("content.xml"))
                                files.Add(item);
                        }

                    }

                }
            }
            catch (System.Exception )
            {
                //On to the next file!
            }

            return files;

Step 3: Replace the equations in the document with Mathjax compatible version
This is the meaty part of the solution.
There are several XSL-files to be found online that transform MathML to MathJax/Tex format. So we used one such as this one.

So first we iterate over all the found XML-files from the previous step and transform them using the xsl-files:

                foreach (var item in files)
                {
                    var myXslTrans = new XslCompiledTransform();
                    myXslTrans.Load(@"mmltotex\mmltex.xsl");
      
                    using (StringWriter sw = new StringWriter())
                    using (XmlWriter xwo = XmlWriter.Create(sw, myXslTrans.OutputSettings)) // use OutputSettings of xsl, so it can be output as HTML
                    {
                        myXslTrans.Transform(item, xwo);

Next, we need to save this transformed equation so we can, later on, inject it inside the actual odt-document.

Before saving the transformation I also cleaned it up a bit so that gitbook won’t start crying like a little baby (It’s not very happy with two opening curly braces next to each other and with multiline equations). After cleanup, I save each equation in a dictionary with the folder name being the key since this is the same id the main odt-document (content.xml in the root of the odt/zip) used to pinpoint to the xml files in the Object-folder (long sentence, too tired to write out :p).

And finally I add the much needed extra dollar signs since my xsl only adds one and we definitely need two at the start and end:(

                        string dirname = new DirectoryInfo(Path.GetDirectoryName(item)).Name;
                        string texform = sw.ToString();
                        if (!texform.StartsWith("$"))  //Replace multiline equations to single lines
                        {
                            texform = $"${texform.Replace(Environment.NewLine, String.Empty)}$";
                        }
                        texform = ("$" + texform + "$").Replace("{{", "{ {"); 
                        res += $"{dirname};{texform};{Environment.NewLine}";

                        eqslistres.Add(dirname, texform);
                    }

Step 4: Insert transformed equations into odt
Next step we replace all equation-XML-elements inside the main document  with new textspan-elements that contain our transformed equations:

            var doc = XDocument.Load(path + "\\content.xml");

            var desc = doc.Descendants("{urn:oasis:names:tc:opendocument:xmlns:drawing:1.0}" + "frame").ToList();
            for (int i = 0; i < desc.Count(); i++)
            {
                var item = desc[i];
                var obj = item.Descendants("{urn:oasis:names:tc:opendocument:xmlns:drawing:1.0}" + "object").FirstOrDefault();
                if (obj != null)
                {
                    var atr = obj.Attribute("{http://www.w3.org/1999/xlink}" + "href").Value.Replace("./", string.Empty);
                    if (atr.EndsWith("/"))
                        atr = atr.Replace("/", string.Empty);
                    if (records.ContainsKey(atr))
                    {

                        Console.WriteLine($"{records[atr]}");
                        XElement v = new XElement(XName.Get("span", "urn:oasis:names:tc:opendocument:xmlns:text:1.0"));
                        v.Add(records[atr]);
                        item.ReplaceWith(v);
                    }
                }

            }

Step 5: Repack everything to an odt file
Last but not least: we save the new XML:

doc.Save(path + "\\content.xml");

And repack everything:


System.IO.Compression.ZipFile.CreateFromDirectory(tempfolder, "mynewepicbook.odt");

One final step

We can now send this odt through gitbook-convert and all the equations, as by magic, will be there and rendered in all its glory!

There’s only one final step to do. The formulas will only be rendered in gitbook if you add the mathjax-plugin. Add a book.json file to the root of your gitbook folder and add it:


{
"gitbook": "3.2.3",
"plugins": ["mathjax"]
}

Profit!

Please note, that if you serve the book for the first time (using gitbook serve) the formulas might not render in your browser. There’s some delay there. Simply refresh, or wait a few seconds and normally all should show up as promised!

And so we go from:

To:

 

PS Check out the epic course (in Dutch) on Electric Fundamentals here!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close