Skip to content

Rhaptos Software Development

Personal tools
You are here: Home » Developer Blog » Brian's Sooth » Making a Better Word Importer ...

Making a Better Word Importer ... Making a Better Word Importer ...

Document Actions
Submitted by bnwest. on 2006-12-06 10:00. Development
Making a Better Word Importer ...

Add a reply with your suggested improvements.


Brian

Re: Making a Better Word Importer ...

Posted by bnwest at 2006-12-06 10:04
From the desk of Kathi Fletcher:

Notes on making a module out of Three Special Events by Burrus and Baraniuk

* Metadata -- title, author, etc. Not only did I have to type those all in to the interface, but then remove all the paragraphs they generated.

* Headings in the paper were just bold -- I added sections by hand, which is hard to do when you have to scroll down to find and insert the end of the section in the right place. It would have been much easier to do this in the Word file -- switch them to "heading" etc.

* Losing all indication of bold and italic was a huge pain.

Things that the style chooser would have helped with

* Cites needed to be added for book titles.
* Quotes had to be added.

* Bib entries all came out as individual lists of one item each (Word was inserting "continue numbering" styles.

* Editing bib entries was a big pain. I think actually that I used CiteULike and went and found as many of them as I could on Google Scholar and then had those tools generate BibXML. Can't remember whether it completely matched what CNX wanted.

Re: Making a Better Word Importer ...

Posted by bnwest at 2006-12-11 11:55
Manpreet's UETaskApril06-DLPK.doc fails to import.

The problem is that one of the table cells contains a list, which is not permitted by CNXML 0.5. The OOo XMl has a list in a table cell, the XSL transform creates CNXML with a list in a table cell, and the generated CXML fails to parse.

Thus, the XSL transform produces CNXML which can not parse. The parse message is not useful to the end user. Even if it were, the user may very well expect us to translate the entire word ducument minus the problem, which is clearly beyond the scope of our word importer.

FWIW the problem list in the above document is not likely the desired structural effect the author wanted. My bet is that the list was a "user error" (they did not want a list here at all) which they did not correct since what was displayed was fine.

Re: Re: Making a Better Word Importer ...

Posted by maxwell at 2006-12-12 15:05
The XSL solution would be to do something like this:

<xsl:template match="list[ancestor::tablecell]">
<xsl:comment>Bad author! Er, I mean, we regret to inform you that lists cannot be put inside tables.</xsl:comment>
<xsl:apply-templates />
</xsl:template>

and then

<xsl:template match="listitem[ancestor::tablecell]">
<xsl:apply-templates />
</xsl:template>

That way, they should just get a long string of text inside the table cell, which is better than nothing. Of course, this is going to look bad if, say, the entire Word document was put in some kind of table for formatting reasons. But I guess it's still better than choking.

Re: Re: Making a Better Word Importer ...

Posted by cbearden at 2006-12-18 09:29
In CNXML 0.6, the content model for CALS 'entry' includes 'para', which in turn may contain 'list'.

I think we want to permit lists in table cells--it's a legitimate construct. For an example of a reasonable use of (enumerated) lists in tables, see the Notes column of the table on <http://www.imsglobal.org/metadata/imsmdv1p2p1/imsmd_infov1p2p1.html>

Re: Making a Better Word Importer ...

Posted by bnwest at 2007-02-02 16:12
I seeing OOo XML instances of

<text:p text:style-name="P3">
<text:span text:style-name="T1">The Last Word or Two</text:span>
</text:p>

which visually looks like a section. Our add section logic looks for <text:h> and <text:section> so the above is not sectioned. From the above example, style "P3" get defined

<style:style style:name="P3" style:family="paragraph"
style:parent-style-name="Standard">
<style:properties fo:line-height="200%"
fo:text-align="center" style:justify-single-word="false" />
</style:style>

which contains enough hints to determine that this style is section like. To take advantqage of this, we would need to more passes to the OOo to CNXML conversion process, one pass which reads the defined styles looking for section candidates and then another pass which replace the section like tags with <text:section>.

What makes the above look a section candidate is how it gets styles which in turned is triggered from the text:style-name attributes. The values thereof are defined at the beginning of the OOo XML document, i.e. they are document specific.

Re: Making a Better Word Importer ...

Posted by bnwest at 2007-02-02 16:16
OOo XML styles can also be used to implement subscripts and superscripts in Math. Since we ignore styles in the OOo to CNXML XSL transform, x squared is imported as 'x2'.

If x squared had been put in by the MS Word math editor (assuming that it was not) and MS Word had saved that math as MathML, we would have done the right thing in import.

Re: Making a Better Word Importer ...

Posted by bnwest at 2007-02-02 16:23
Some saved math from the MS Math Editor gets saved to the <draw:object-ole> tag. The external object associated with this tag is not MAthML but a binary file, which we do nothing with while importing.

We do do the right thing when we encounter <draw-object> tags, since their external object is a MathMl file.

We do not know how/why MS Word (or the user) gets one ofr the other. This is a nut that would to be cracked on Windows (if possible), since dealing with an OLE binary object on a Linux platform is not tenable (assuming the open source world has not gone there).

Re: Re: Making a Better Word Importer ...

Posted by bnwest at 2007-02-12 11:57
<draw:object-ole> tag appears to only come from the MathType 5.0 Equation Editor while the <draw:object> appears to come from the Equation Editor 3.1. Both are Design Science products.

One can edit the <draw:object> math within OOo. An edit box allows a micro math language (ie not MathML) to be edit.

One edit can not natively edit MathType 5.0 Equation Editor math within MS in MS Word. Double clicking on this math causes an external app (MathType 5.0) to be launched. MathML can be got from MathType 5.0 via setting up the translator and performing a copy to the system Clipboard.

Note OOo can display/read the <draw:object-ole> tag (MathType 5.0 Equation Editor output) but can not edit it. Thus, the OLE object has the display information which OOo can understand. Note that MathType 5.0 appears to be the only app that can edit the math.

For our purposes, we like to convert the OLE object into MathML, on Linux via Pyhon bindings, as part of our MS Word doc to OOo XML to CNXML conversion process on the live server.

Re: Making a Better Word Importer ...

Posted by bnwest at 2007-02-06 16:43
The new word importer uses a OOo version 2.0 to write out a OOo 1.0 file (zipped archive with a content.xml at root). The outputed file is not necessarily the same file we would get directly from OOo 1.0.

This has been seen a problem. Case where <text:h> nodes are no longer outputted, which gives the impression that we have a regression in the logic. We can handle some cases (via beefing up our OO 2 OO XSL xform) but not all.
Developer Blog
« July 2008 »
Su Mo Tu We Th Fr Sa
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
2008-07-02
12:53-12:53 Performance tests of module PDF generation
2008-07-14
16:46-16:46 Building a Word Importer Test Bed
2008-07-15
13:09-13:09 Search results tips from Jarod Spool -- (more scent, most relevant first, no pogosticking, no wacko results, more result per page)
Categories:
Content (55)
Copyright (0)
Deep Code (3)
Development (200)
Markup (22)
Metadata (1)
Printing (7)
Style (9)
Testing (2)
Usability (6)