In my current project we have a requirement to merge simple data into Microsoft Word document templates. Ruby comes with the WIN32OLE library which can manipulate Office documents. WIN32OLE has a few major downsides — it only runs on Windows, it requires Microsoft Office to be installed and it works by sending commands to Word itself to perform operations. Using Word as a back end system for a web application used by 50 people made us nervous so a different approach was needed. We came up with a combination of Ruby, Office Open XML file format, XML processing with Nokogiri and native Zip libraries that works.
Office Open XML file formats
The new Office file formats (.docx, .xlsx, .pptx files) are basically a zipped collection of XML files. We focused on Word files (.docx) but this approach would work with any of the other types of files as well. The specification for the format weighs in at several thousand pages. Producing a file from scratch without a purpose built library that handles all the intricacies of the format would be quite a task. Instead, we drafted the templates in Word and placed markers to tell our templating engine where to insert values. We created document properties which reference data values and added these as fields into the document in the place where the values should be inserted. For example, we could have fields like:
- label_tag #{data[:user].name}
- label_tag #{data[:user].address}
- label_tag #{data[:booking].number}
- label_tag #{data[:booking].items.collect{|i| i.name}.join(‘,’)}
If it looks a bit like Ruby code, it’s because it is! The expressions get evaluated by our templating engine and the results are inserted into the document. Ruby in Word documents, a world first?
Opening the documents
To read and create documents we need to unzip and re-zip the document. We had trouble using Ruby’s standard RubyZip library. For some reason Word gave a nasty warning when opening files created with RubyZip. Our application has to run on Windows, Linux and Mac so we created an adapter that delegated to standard operating system zip executables based on the host platform. To keep it fast, we extract and re-added only the files that we need to work on. This is important because some documents can become very large when they contain embedded objects such as images.
Processing the template
The document content can be found in the file word/document.xml inside the zip archive. The fields in the template come out as fldSimple tags that look like this:
To process the document.xml we simply need to find all the fields that have the text label_tag in the w:instr attribute:
xml.xpath(“//w:fldSimple[contains(@w:instr, 'label_tag')]“).each do |element|
# process each element here
end
The rest is simple. We extract the expression in the element text using a regular expression, evaluate it and insert it back into the XML which ends up looking like this:
We add the attribute fldLock with value true to make the field read-only so the user cannot change it when they open the document.
We also have tags to create lists, insert rows into tables and duplicate sections in the document. These are a bit more complicated in their XML manipulation. Beware, we had a few issues dealing with Word’s nasty XML which can vary a bit between versions and sometimes do unexpected things with formatting.
Conclusion
This approach worked really well for us and I would recommend it for simple field merging.
April 4th, 2009 at 8:54 pm
Hey – that’s pretty cool! Nice advantage of MS using XML for file formats. If you’re going to lock the document though, wouldn’t PDF have been easier to work with? I suppose it’s easier to give the template Word file to a User to modify and have things “just work”.
April 5th, 2009 at 8:34 am
Our users want to edit the document after we generate it so PDF wouldn’t have worked.
May 5th, 2009 at 9:22 am
[...] (#LanguageTrollBait) I came across a post that Tomas Varsavsky wrote a month ago on how to generate DOCX files from a template that includes Ruby source code, using a technique that includes actual Ruby source code within fields. Very [...]
May 12th, 2009 at 1:00 pm
[...] (#LanguageTrollBait) I came across a post that Tomas Varsavsky wrote a month ago on how to generate DOCX files from a template that includes Ruby source code, using a technique that includes actual Ruby source code within fields. Very [...]
May 14th, 2009 at 6:31 pm
Yes, That sounds pretty cool.
One question remains for me : how do your users can add easily the fields in the docx if they want to ?
That is to say : can your user “design” a docx which contains these fields and then pass this docx in a “black box” taht detect fldSimple and replace it ? If the answer is ‘yes’ : how can he find out which field to create ? Help file ? Drag & drop from a list ?
Thank you
May 14th, 2009 at 10:48 pm
If they know ruby, yes. They create document properties with the right expression and can then insert them as fields. Our users don’t know ruby so we give them a catalogue of pre-canned document properties with appropriate expressions in the document. We make the name of the property human readable so when they do ‘insert field’ they can choose one from the list of available properties.
September 30th, 2009 at 1:24 pm
tomas,
any chance you could make the code for this public? i also have a requirement to drop data from a database into word format. this is the most straight-forward method i have found, but could definitely use a little bit of guidance in this method.
jcran
October 30th, 2009 at 10:56 pm
Hi Tomas,
looks like a great solution! Would you mind sharing how you generated/manipulated tables on the ruby side? Feel free to email a file my way if that is easier or saves you time…
Thanks very much,
Reuben.
October 30th, 2010 at 9:10 pm
I am very keen to implement this in an in-house rails application. I have seen other approaches but this is by far the best – it really is quite amazing.
I do not have the regex skills though, and I am quite new to rails – would you be prepared to share some more code to set me on the correct path?
Thanks again for documenting your method.
November 15th, 2010 at 3:16 pm
Hi Tomas,
This is a great idea. I am looking to do something similar for a internal document app we use to print client docs. Would it be ok if we could see some of the code and also does it work in ruby 1.9?
November 30th, 2010 at 3:50 am
Hey Tomas
What about images/pictures? Can they be embedded? A client asked if we could put in not just bio text but also bio picture for each individual. I suppose I should take a look inside a docx template (or document?) with an image embedded to see how it works?
Thanks for sharing!
November 30th, 2010 at 6:12 am
Yes it can be done, images are stored inside the zip file and referenced by the document.xml.
February 4th, 2011 at 3:04 am
Hello tomas
Do internship in a company and gave me this task to do in a ruby web application that reads the format (odt), which already should you get Ruby code to generate numbers and document name, then pick what type and save the User in DOCX .
Its help me?
March 12th, 2011 at 2:14 am
Hi, Can you upload a simple example template and code?
That would help me get started.
Cheers.
May 17th, 2011 at 10:51 pm
You can do a lot with fields, and your own “tags” for lists, table rows etc.
But I think content control data binding is a better approach, because it gives you a two-way binding between data (in an XML format of your own choosing), and the document surface.
See http://www.opendope.org for suggestions on how to use Open XML’s content controls for a complete solution.
July 8th, 2011 at 1:37 am
[...] (#LanguageTrollBait) I came across a post that Tomas Varsavsky wrote a month ago on how to generate DOCX files from a template that includes Ruby source code, using a technique that includes actual Ruby source code within fields. Very [...]