Word-import transform improvements
Deferred to post-March 1
List of test modules
http://mountainbunker.org/~jenn/word-side-by-side
Testing Files
http://mountainbunker.org/~jenn/word-side-by-side/Harvested.xls
Vickie's notes: VickiesWordNotes.doc - VickiesWordNotes-2-23.doc - FixedIssuesVerification.doc
Installation Path Issues
Issues
Failure to import: Generated CNXML is invalid
- bnwest__Felder_Algebra__01._Functions [CB: sectioning step fails, leaving some text:heading unhandled as CDATA in 'content' element]
Felder_Algebra__05._Exponents [CB: sectioning step fails, leaving some text:heading unhandled as CDATA in 'content' element]
[Brian: OOo version problem rears its head again. updated add section logic to handle new-to-me OOo XML. FIXED.] burtoma__Peterhouse_Cambridge [CB: Fixed in r16562]
Spacing issues
bnwest__lizzardg__Texas_Gov02 [KEF 2/23: This seems to have not been reimported -- the errors reported were gone upon reimport. This Word doc has CNXML Cite styles and they seem to be workinng well now.]
ncpea_Revised_Article_59_AH [KEF 2/23: Still has space problems -- continues throughout doc-- bold added by KEF. "Knight (2002) stated thatexport ofhigher education is a billion dollar industry, including recruitment of international students, establishment of university campuses abroad, franchised provision and online learning. The supply of transnational (or cross-border)education plays an increasinglyimportant role for the economic growth of a country(van der Wende, 2003)." Various missing spaces. The Word doc does have style transitions where the spaces are missing and they are not visible (the transitions) in the .doc file. Weird.]
[Brian: the following OOo XML, '<text:span> </text:span>', failed to import as a single blank. FIXED.]
Bullets becoming numbers
[Brian: on cnx.org OOo 1.0 server writes out a 1.0 document with the node <text:unordered-list> for the list in question. on depot OOo 2.0 server writes out a 1.0 document with the node <text:ordered-list> instead. OOo appears to hide the bullet/number info in the text:style-name attribute. Chuck FIXED it.]
[KEF: 2/23 FIXED]
cpasich_Problems [KEF: image size display improved]
Chuck__scientific_process_psyc221_headings.doc [KEF: initial heading appears (was removed on cnx)]
xsands__intro.doc
Felder_Algebra__Teachers_Guide [KEF: Does have a 2 item numbered list coming out as two 1. lists. I think it is a CNXML limitation because the list items have paras in them? Chuck? Search for "The last problem on the homework"]
jpsf__MoodleFCTUNCL_Roadmap_v2 [KEF: This document has text numbered lists also. Additionally, they have inserted a non-commercial license in the midst]
Chuck__Range [KEF: One empty bullet in the module is from the original doc (thus not a bug)]
mwise__appendcharge [KEF: FIXED]
lstewart_betty_joe_monk_draft [KEF: 2/23 FIXED -- Does have one doubled bullet -- the very first list -- we can live with it.]
mhusband_texasgovWHeading3 [KEF: 2/23 NOTBROKEN -- Weird numbering caused by Roman Numeral numbering in the original interspersed with Roman Numeral text -- author couldn't make word connect the lists]
kef_dw_tutorial [KEF: FIXED]
[KEF: 2/23 NOT FIXED -- PROBABLY CNXML limitations]
yangmm_Authoring_Interface_of_Lulu [KEF: 1st a, b, c list comes out bulleted, numerous others come out as single 1. lists. Those all have images in list items. Chuck -- CNXML limitation? Why, though, does the first set become bulleted, and the subsequent ones numbered?]
Felder_C03_SimultaneousConcepts [Lots and lots of single item 1. lists because the lists within the documents have paras. Chuck -- this is a CNXML limitation, right?]
Word Table of Contents (post Mar1)
yangmm_Authoring_Interface_of_Lulu [KEF: 2/23 -- Word's Table of Contents imports as links into the module -- did this work before? Actually the links don't work http://depot.cnx.rice.edu:8080/GroupWorkspaces/wg641/bad/module.2007-02-14.0986947235/module_view#3.How%20to%20Publish%20a%20Paperback%20Book|outline]
Image Size Problem
yangmm_Authoring_Interface_of_Lulu [KEF: 2/23 -- Images in Word file are nicely scaled. Images as resized in CNXML are not readable and look squished.]
[Brian: ljubisa_CAGD1.doc is the counterpoint. Images here are larger than screen width, which forces half of the text and images of the module off the screen. If we don't resize, the user will have to resize by hand.ljubisa_CAGD1. If we resize using the Word image's approximate bounding box, the user may also have to resize to get the desire effect. Thus, in both cases resize is necessary. I do not know which case would be more probable and which case would cause the most angst with the use.]
[Brian: Removing the height and width paramas from the image is simple thing after the import via EiP. In this case, the image diplayed in perfect clarity but does not fit the screen. Either way we handle it, the user will want to resize.]
[Brian: Imagine in OOo has been cropped, mostly like via its "styling". The cropping takes out the browser frame (image is a png of the Firefox browser dsiplaying a page) and just displays the browser page. We thus on resizing try to display more then OOo did in the same real estate.]
[Brian: Resizing the browser page portion to match the size as seen in OOo yields non optimal results. I posit that the display engine (or the resizing/cropping engine) is better in OOo than in Firefox.]
List structure
Numbered lists show odd nesting behavior, and often have more levels than they should.
Felder_Algebra__09._Imaginary.doc
[Brian: Word allows all kinds of nonsense with lists that have absolutely no chance of translating into CNXML. NOT FIXED.]
ncpea__Final_Edit_55_4
[Brian: odd nesting also appears in the Word doc. nothing to fix.]
jsack_BlockIt_game_setup_and_rules
Superior to before -- numbered list comes out as 1 2 instead of 1 1.
peterpetrake2000__Protocol_ATC_.doc
Bulleted list coming through as a series of one-item lists.
[Brian: Bullet lists are actually a series of one item bulleted lists. nothing to fix.]
Figures
yangmm__Benefit_change_by_sample_size.doc
Extraneous empty figure at the bottom of the module
[Brian: last figure is .wmf image, which we can not import. we purposedly left the empty figure as a placemarker. nothing to fix.]
yw4630__5experimental_data_for_FFT_matrix_approach
sstarks_Fig1
Nice that you now get empty placeholder figures for drawing objects that
don't come through.
Equations
Generally pretty darned good, but some docs have things that seem to be
Equation Editor objects and yet don't convert. I'd be surprised if all of
them were using MathType.
[Brian:
not having looked but a possible explanation is... once you installed
MathTyep 5.0, all equaions will be edit via its interface and thus
become MathType 5.0 equations.]
[KEF: This is confirmed by Elizabeth and we will attach more documentation about how the Math is behaving]
[Brian: we are importing a comment now importing clear text when possible else a comment. FIXED.]
rlortman__RobertGauss.doc
Felder_Algebra__09._Imaginary.doc
objects missing from the last five list items above the first section
break, and many more later in the doc
[Brian: the pi symbol aint importing.]
[Brian: notation x to a power aint importing; superscripts are styled.]
Footnotes
Footnote items in the module text have spaces before them instead of after
them, so they look like they belong with the following text instead of the
preceding text. Also, the footnotes themselves have extra index numbers.
les561__Business_Valuation.dot
mwise__1cultural.doc
Elizabeth__quick2.doc
tyang__stochastic
mwise__2cultureNEW
[Brian: preceding space is not a regression. ths space shows up due to us running tidy as our last step in our import. the trade off is properly space footnotes (and all other inline tags) versus having a one line editable string for our users to do a full source edit. Chuck and I will investigate other tidy options available.]
[Brian: extra index numbers. FIXED]
Line breaks
Single-line paragraphs separated by only a single line break get crammed
all into one paragraph, in both versions of the transform. That may be
deliberate. It's nice that at least now the items are separated by spaces;
let's not lose that with the spacing fixes.
yangmm__Probability.doc.doc
mhusband__Evidence_for_Continental_Drift.doc
felder_Algebra__C03._Simultaneous concepts.
[Brian: failing to properly deal with <text:line-break>, which has the effect of stopping then restating a paragraph. I speculate that authors add line break to get the display as they like. We could force a para break, but I am very unsure if that is the right thing to do. ]
Confirmed Fixed
Failed imports
Several files so far now give "Generated CNXML is invalid", whereas before the XSL fix they imported on depot with truncation.
Jennifer__paper029.doc
Maxwell__cnxWhitePaper11mar04.doc
sstarks_Low_Pass_Filter
[Brian:
root problem is that depot is using a later version of OOo 2.0 than I
used in development. different xml is generated and is not handled
properly. both files will now import.FIXED]
Truncation
Four examples of things that still truncate, in both cases right before a table that includes math
Felder_Algebra__C11._Data_Concepts.docFelder_Algebra__C07._Rational_Concepts.doc
jpsf__MoodleFCTUNCL_Roadmap_v2
Felder_Algebra__C04._Quadratic_Concepts
[Brian: oo2cnxml xsl xform has a syntax errors. the problem xsl code is not new. guessing a newer version of libxml2 is more pickier than before. FIXED.]
Missing Stuff (added Feb21)
ncpea_AngelWings -- title and following paragraph
ncpea_angeltwowings -- same.
bnwest_ncpea_angeltwowings
[Brian:
missing were both empty headers. in Word this is a Header1 that was
follwed directly by another Header 1. The second one was a complete
paragraph, likely not the intention of the user. We could import empty
sections, if we added an empty paragraph. This would look good in
preview; not so much in EiP. FIXED.]
[Brian: more dropped empty sections.]
Spacing issues
Almost every module has an example of this, but particularly meaningful ones appear in:
ajarocho__PreTest_IM.doc
bolded initial letters in words come out as "E thical" or "N o"
bnwest__Felder_Algebra__C09._Imaginary_Concepts.doc
"i+5i" becomes "i +5 i". Space around an operator is okay, but it's asymmetrical, and the space between 5 and i is just bad. "...one false premise , such as..." "illegal operationsin math"
Chuck__Capretteworkshop_handout.doc
"T = Theory ( scientificdefinition)"
rlortman__RobertGauss.doc
s-sub-m comes out as "s m" in the new one, as opposed to "sm" in the old one. Neither is great, but the new one is worse.
Figures
felder_Algebra__C03._Simultaneous concepts
The new plan of making every image a figure has odd consequences for essentially inline images, like the checkboxes and x-marks in Felder's docs.
[ KEF Is this the right decision? Is it possible to recognize the inline and treat it differently?]
[Brian: Not a regression, fwiw. we can determine if image is inlined => don't wrapped those with <figure>. FIXED]
ljubisa__CAGD1.doc
[Brian: figures were importing
impossibly big which made the module hard to read. we need to import
the height and width for each figure. FIXED.]
mwise__2cultureNEW
les561__Business_Valuation.dot
see p. 24 of the Word doc for an odd list that turns even odder when transformed
[Brian:
I see nested lists, bulleted inside of a numbered list. The contianer
numbered list is not even a list, just plain text. The interior bullet
list is being xformed into a numbered list, which is a problem.]
[Brian:
I also see an instance of some funky additional nesting which
replicates the internal list structure as seen in OOo XML but not the
displayed Word doc. funkiness == a sinlge list having one entry which
is a multi-entry list. The contianer list here is just for display
purpose in Word and has been removed. FIXED.]
[Brian: The answer list contains bolded True or False answers. boldness is not importing. ]
Missing half of section title
Maxwell__cnxWhitePaper11mar04.doc
"Inefficient methods of knowledge development and transmission -- An
educator’s perspective" loses everything before the dash.
[Brian:
Chuck say USER ERROR :) Sections that went away were empty. CNXML 0.5
does not allow empty sections. We could put an empty <para> in
the empty sections, which would look BAD in EiP and GOOD in preview.
FIXED.]
