Skip to content

Rhaptos Software Development

Personal tools
You are here: Home » Developer Blog » Elchin's Blog » Importing .doc files

Importing .doc files Importing .doc files

Document Actions
Submitted by easgarov. on 2007-06-19 17:46. Documentation
My experiences in playing around with Word Importer

I first imported the word file which had some tables, and text, no math. Table import was good, everything was perfect. There was problem with titles though. Titles in word file were marked with bigger font and bold typeface, which apparently was not regicnized by word importer, and titles were displayed as just plain text.

Import of another document which had titles was successful, due to the fact that author used Heading 1 (H1) and Heading 2 (H2) styles. But then there is a problem with handling closure of these titles. Document was structured in following way: First it was H1 then H2 under it, and at the end there were references. References were regarded as a part of the H2. And also Word importer thought of H2 as being part of H1, which may not be true for that word document, so author need to take care of title hierarchy.

I imported two-column word document, and it seems like Word Importer can handle it. But again there were problems with title hierarchy. And also pictures did not get imported, there were titles of pictures, but pictures themselves were not there. Some pictures were imported, and they were in .png format. Pictures that were in .wmf format were not imported.
Another issue is that picture that was embedded in text in word document got centered after Import.

Table which was bigger than one page in Word document was successfully imported.

Links were successfully imported.

MathType 5.0 equations were not imported. There was following message in their place:
***SORRY, THIS MEDIA TYPE IS NOT SUPPORTED.***
There were also problems with importing subscripts and superscripts, they were not imported, and text near them was imported with errors. Also there was no line wrap after import, so there were a lot of really long lines for some documents.

List that started at one page, and continued at another one was imported successfully. In the same document there were problems with importing characters that were written in 'Symbol' font.

"Continued lists" as they were discussed are not supported. And fancy bullet signs or numbering in numbered lists become uniform due to cnxml format. Nested lists are imported successfully.

It looks like quotes are not handled. And "Continued lists" are actually used in one of the papers.

Embedded Microsoft Equation formulas seems to work. Embedded PBrush pictures were not imported though.

NOTE ON REIMPORTING: If you imported one word file, and then imported another one, dont forget to delete the all the files in module, or discard the module. Not doing this can cause problems in names of pictures for example, I ran into this problem.

'Courier New' font was failed to be imported.
There was part in word with following content:
4x+6y=16
4x+6y=16
0=0
This was not properly imported by Importer, due to unability do display underlined typeface. I guess authors should replace such things with MathML.

some MathML's had following thing:
< m:annotation encoding="StarMath 5.0"> size 12{ { {5} over {2 cdot 2 cdot 3} } + { {7} over {2 cdot 3 cdot 5} } } {}</m:annotation>
And following MathML was not displayed:

<m:math>
    <m:semantics>
        <m:mrow>
            <m:mstyle fontsize="12pt">
                  <m:mrow/>
            </m:mstyle>
            <m:mrow/>
        </m:mrow>
        <m:annotation encoding="StarMath 5.0"> size 12{ {4} wideslash {3} } {}</m:annotation>
    </m:semantics>
</m:math>

Following MathML was was displayed as error message saying "Invalid Markup":
<m:math>
    <m:semantics>
        <m:mrow>
            <m:mstyle fontsize="12pt">
                <m:mrow>
                    <m:mfrac/>
                </m:mrow>
            </m:mstyle>
            <m:mrow/>
        </m:mrow>
        <m:annotation encoding="StarMath 5.0"> size 12{ {  { {x rSup { size 8{2} }  - 3x} wideslash {2x rSup { size 8{2} }  - "13"x+6} }  over  { {x rSup { size 8{3} } +4x} wideslash {x rSup { size 8{2} }  - "12"x+"36"} } } } {}</m:annotation>
    </m:semantics>
</m:math>

<m:annotation> is MathML tag to include non-XML annotations in MathML. So what is inside of this tag in our case is in StarMath encoding, which is as far as I got is a math encoding used by OpenOffice. I supppose that this document was first made in Word, and then modified in OpenOffice, so it should not be done by authors.
Playing with OpenOffice, I saw that OpenOffice stores math in both MathML format, and then adds annotation of StarMath encoding, which encodes same as MathML encoded, so it kind of encodes same thing twice, once with MathML and then with StarMath.

If you upload word document that was edited in Change Mode ( I don't know how exactly it is called, when you change document, there is menu given which lets you to accept or reject the change), it is regarded as if all the changes were accepted, which means that 'last' version of document is imported.

Embedded Excel tables failed to be imported.
WordArt object are not imported as pictures, only text inside them is imported, and it is imported as plain text.

Embedded images are not always imported very well. They are sometimes imported as figures.

Developer Blog
« July 2008 »
Su Mo Tu We Th Fr Sa
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
2008-07-02
12:53-12:53 Performance tests of module PDF generation
Categories:
Content (55)
Copyright (0)
Deep Code (3)
Development (198)
Markup (22)
Metadata (1)
Printing (7)
Style (9)
Testing (2)
Usability (6)