Exporting open office with equations to markdown and mathjax/tex formulas

Well, that was the most un-inspirational title ever. Especially because this is the first blog post in ages on here. Anyhow….

The problem:

So a colleague of mine wrote a very hefty 300 pages tome on electric fundamentals. It is written in Microsoft Word but, being the 21st century and all, we really would love to have this syllabus also as an online gitbook.com-site. However, it contains over 700+ equations that just won’t get converted.

When we used gitbook-convert on the .docx the output it generated was okay-ish (though the image uris needed some manual labor afterwards). However, no equations to be found whatsoever, it simply skipped those like a lazy student.

The solution:

After dicking around with several “solutions” from stackoverflow we finally managed to get the solution ourself. As a fair warning: I suck/can’t use regex so you will see some cringy stuff down here…but hey, it works and that’s what count!

How we solved it:

  1. Saved the .docx document as an open office document (ODT) from within Microsoft Word.
  2. Send the odt through my epic code (seen below) which does:
    1. Unpack the odt file (it’s just a zip with lots of xml-files)
    2. Identify the equations in the document
    3. Transforms the equations to Mathjax compatible versions
    4. Insert transformed equations into odt
    5. Repack everything to an odt file
  3. Send the odt through gitbook-convert
  4. Profit!

Gimme epic code!

Ok, so step 2 was ofcourse the main problem. Here’s “a solution”, but as warned it’s just a quick’n dirty fix.

Step 1: Unpack the odt file

  System.IO.Compression.ZipFile.ExtractToDirectory(source, tempfolder);

Easy huh 😀

Step 2: Identify the equations in the document
Using a very science-y way we discovered that all equations are conveniently inside subfolders called “Object x” (x being a number) in which the actual equations is described in a separate content.xml document using mathml, the openoffice way of describing equations.

So we used a simple loop over all extracted folders to identify  the xml-files we needed.

            List<String> files = new List<String>();
            try
            {

                foreach (string d in Directory.GetDirectories(sDir))
                {
                    if (d.Contains("Object"))
                    {
                        var filesc = Directory.GetFiles(d);

                        foreach (var item in filesc)
                        {
                            if (item.Contains("content.xml"))
                                files.Add(item);
                        }

                    }

                }
            }
            catch (System.Exception )
            {
                //On to the next file!
            }

            return files;

Step 3: Replace the equations in the document with Mathjax compatible version
This is the meaty part of the solution.
There’s several XSL-files to be found online that  transform MathML to MathJax/Tex format. So we used one such as this one.

So first we iterate over all the found xml-files from the previous step and transform them using the xsl-files:

                foreach (var item in files)
                {
                    var myXslTrans = new XslCompiledTransform();
                    myXslTrans.Load(@"mmltotex\mmltex.xsl");
      
                    using (StringWriter sw = new StringWriter())
                    using (XmlWriter xwo = XmlWriter.Create(sw, myXslTrans.OutputSettings)) // use OutputSettings of xsl, so it can be output as HTML
                    {
                        myXslTrans.Transform(item, xwo);

Next we need to safe this transformed equation so we can later on inject it inside the actual odt-document.

Before saving the transformation I also cleaned it up a bit so that gitbook won’t start crying like a little baby (It’s not very happy with two opening curly braces next to each other and with multiline equations). After cleanup I save each equation in a dictionary with the foldername being the key since this is the same id the main odt-document (content.xml in the root of the odt/zip) used to pinpoint to the xml files in the Object-folder (long sentence, too tired to write out :p) .

And finally I add the much needed extra dollar signs since my xsl only adds one and we definitely need two at the start and end:(

                        string dirname = new DirectoryInfo(Path.GetDirectoryName(item)).Name;
                        string texform = sw.ToString();
                        if (!texform.StartsWith("$"))  //Replace multiline equations to single lines
                        {
                            texform = $"${texform.Replace(Environment.NewLine, String.Empty)}$";
                        }
                        texform = ("$" + texform + "$").Replace("{{", "{ {"); 
                        res += $"{dirname};{texform};{Environment.NewLine}";

                        eqslistres.Add(dirname, texform);
                    }

Step 4: Insert transformed equations into odt
Next step we replace all equation-xml-elements inside the main document  with new textspan-elements that contain our transformed equations:

            var doc = XDocument.Load(path + "\\content.xml");

            var desc = doc.Descendants("{urn:oasis:names:tc:opendocument:xmlns:drawing:1.0}" + "frame").ToList();
            for (int i = 0; i < desc.Count(); i++)
            {
                var item = desc[i];
                var obj = item.Descendants("{urn:oasis:names:tc:opendocument:xmlns:drawing:1.0}" + "object").FirstOrDefault();
                if (obj != null)
                {
                    var atr = obj.Attribute("{http://www.w3.org/1999/xlink}" + "href").Value.Replace("./", string.Empty);
                    if (atr.EndsWith("/"))
                        atr = atr.Replace("/", string.Empty);
                    if (records.ContainsKey(atr))
                    {

                        Console.WriteLine($"{records[atr]}");
                        XElement v = new XElement(XName.Get("span", "urn:oasis:names:tc:opendocument:xmlns:text:1.0"));
                        v.Add(records[atr]);
                        item.ReplaceWith(v);
                    }
                }

            }

Step 5: Repack everything to an odt file
Last but not least: we save the new xml:

doc.Save(path + "\\content.xml");

And repack everything:


System.IO.Compression.ZipFile.CreateFromDirectory(tempfolder, "mynewepicbook.odt");

One final step

We can now send this odt through gitbook-convert and all the equations, as by magic, will be there and rendered in all it’s glory!

There’s only one final step to do. The formulas will only be rendered in gitbook if you add the mathjax-plugin. Add a book.json file to the root of your gitbook folder and add it:


{
"gitbook": "3.2.3",
"plugins": ["mathjax"]
}

Profit!

Please note, that if you serve the book for the first time (using gitbook serve) the formules might not render in your browser. There’s some delay there. Simply refresh, or wait a few seconds and normally all should show up as promised!

And so we go from:

To:

 

PS Check out the epic course (in Dutch) on Electric Fundamentals here!

Advertisements

About timdams
C#, .NET, Microsoft, security, .... Read more on : https://timdams.com/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: