Simple Word document templating using Ruby and XML

Ruby, Software Add comments

In my current project we have a requirement to merge simple data into Microsoft Word document templates. Ruby comes with the WIN32OLE library which can manipulate Office documents. WIN32OLE has a few major downsides — it only runs on Windows, it requires Microsoft Office to be installed and it works by sending commands to Word itself to perform operations. Using Word as a back end system for a web application used by 50 people made us nervous so a different approach was needed. We came up with a combination of Ruby, Office Open XML file format, XML processing with Nokogiri and native Zip libraries that works.

Office Open XML file formats

The new Office file formats (.docx, .xlsx, .pptx files) are basically a zipped collection of XML files. We focused on Word files (.docx) but this approach would work with any of the other types of files as well. The specification for the format weighs in at several thousand pages. Producing a file from scratch without a purpose built library that handles all the intricacies of the format would be quite a task. Instead, we drafted the templates in Word and placed markers to tell our templating engine where to insert values. We created document properties which reference data values and added these as fields into the document in the place where the values should be inserted. For example, we could have fields like:

  • label_tag #{data[:user].name}
  • label_tag #{data[:user].address}
  • label_tag #{data[:booking].number}
  • label_tag #{data[:booking].items.collect{|i| i.name}.join(‘,’)}

If it looks a bit like Ruby code, it’s because it is! The expressions get evaluated by our templating engine and the results are inserted into the document. Ruby in Word documents, a world first?

Opening the documents

To read and create documents we need to unzip and re-zip the document. We had trouble using Ruby’s standard RubyZip library. For some reason Word gave a nasty warning when opening files created with RubyZip. Our application has to run on Windows, Linux and Mac so we created an adapter that delegated to standard operating system zip executables based on the host platform. To keep it fast, we extract and re-added only the files that we need to work on. This is important because some documents can become very large when they contain embedded objects such as images.

Processing the template

The document content can be found in the file word/document.xml inside the zip archive. The fields in the template come out as fldSimple tags that look like this:



Template Field: User Name

To process the document.xml we simply need to find all the fields that have the text label_tag in the w:instr attribute:

xml.xpath(“//w:fldSimple[contains(@w:instr, 'label_tag')]“).each do |element|
# process each element here
end

The rest is simple. We extract the expression in the element text using a regular expression, evaluate it and insert it back into the XML which ends up looking like this:



Tomas Varsasvky

We add the attribute fldLock with value true to make the field read-only so the user cannot change it when they open the document.

We also have tags to create lists, insert rows into tables and duplicate sections in the document. These are a bit more complicated in their XML manipulation. Beware, we had a few issues dealing with Word’s nasty XML which can vary a bit between versions and sometimes do unexpected things with formatting.

Conclusion

This approach worked really well for us and I would recommend it for simple field merging.

16 Responses to “Simple Word document templating using Ruby and XML”

  1. Dan Milne Says:

    Hey – that’s pretty cool! Nice advantage of MS using XML for file formats. If you’re going to lock the document though, wouldn’t PDF have been easier to work with? I suppose it’s easier to give the template Word file to a User to modify and have things “just work”.

  2. Tomas Varsavsky Says:

    Our users want to edit the document after we generate it so PDF wouldn’t have worked.

  3. Doug Mahugh : Links for 05/04/2009 Says:

    [...] (#LanguageTrollBait)  I came across a post that Tomas Varsavsky wrote a month ago on how to generate DOCX files from a template that includes Ruby source code, using a technique that includes actual Ruby source code within fields.  Very [...]

  4. Links for 05/04/2009 | Coded Style Says:

    [...] (#LanguageTrollBait)  I came across a post that Tomas Varsavsky wrote a month ago on how to generate DOCX files from a template that includes Ruby source code, using a technique that includes actual Ruby source code within fields.  Very [...]

  5. Gerald Clette Says:

    Yes, That sounds pretty cool.
    One question remains for me : how do your users can add easily the fields in the docx if they want to ?
    That is to say : can your user “design” a docx which contains these fields and then pass this docx in a “black box” taht detect fldSimple and replace it ? If the answer is ‘yes’ : how can he find out which field to create ? Help file ? Drag & drop from a list ?
    Thank you

  6. Tomas Varsavsky Says:

    If they know ruby, yes. They create document properties with the right expression and can then insert them as fields. Our users don’t know ruby so we give them a catalogue of pre-canned document properties with appropriate expressions in the document. We make the name of the property human readable so when they do ‘insert field’ they can choose one from the list of available properties.

  7. jcran Says:

    tomas,

    any chance you could make the code for this public? i also have a requirement to drop data from a database into word format. this is the most straight-forward method i have found, but could definitely use a little bit of guidance in this method.

    jcran

  8. Reuben Says:

    Hi Tomas,

    looks like a great solution! Would you mind sharing how you generated/manipulated tables on the ruby side? Feel free to email a file my way if that is easier or saves you time…

    Thanks very much,
    Reuben.

  9. Ryan Says:

    I am very keen to implement this in an in-house rails application. I have seen other approaches but this is by far the best – it really is quite amazing.

    I do not have the regex skills though, and I am quite new to rails – would you be prepared to share some more code to set me on the correct path?

    Thanks again for documenting your method.

  10. Mirko Says:

    Hi Tomas,

    This is a great idea. I am looking to do something similar for a internal document app we use to print client docs. Would it be ok if we could see some of the code and also does it work in ruby 1.9?

  11. Joshua Says:

    Hey Tomas

    What about images/pictures? Can they be embedded? A client asked if we could put in not just bio text but also bio picture for each individual. I suppose I should take a look inside a docx template (or document?) with an image embedded to see how it works?

    Thanks for sharing!

  12. Tomas Varsavsky Says:

    Yes it can be done, images are stored inside the zip file and referenced by the document.xml.

  13. Tiago Says:

    Hello tomas
    Do internship in a company and gave me this task to do in a ruby web application that reads the format (odt), which already should you get Ruby code to generate numbers and document name, then pick what type and save the User in DOCX .

    Its help me?

  14. Steve Says:

    Hi, Can you upload a simple example template and code?

    That would help me get started.

    Cheers.

  15. Jason Says:

    You can do a lot with fields, and your own “tags” for lists, table rows etc.

    But I think content control data binding is a better approach, because it gives you a two-way binding between data (in an XML format of your own choosing), and the document surface.

    See http://www.opendope.org for suggestions on how to use Open XML’s content controls for a complete solution.

  16. Links for 05/04/2009 - Doug Mahugh - Site Home - MSDN Blogs Says:

    [...] (#LanguageTrollBait)  I came across a post that Tomas Varsavsky wrote a month ago on how to generate DOCX files from a template that includes Ruby source code, using a technique that includes actual Ruby source code within fields.  Very [...]

Leave a Reply

WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in