Importing math into CNX author story
Note that I have uploaded a Excel spreed sheet to this blog entry. See contents tab. The spreed sheet contains a list mathematical and MathML products and a comparison of their features.
Word and LaTeX Input
From Design Science's website (http://www.dessci.com/en/reference/white_papers/mathml_workflows.htm),
70% of manuscript submissions to STM publishers were in Microsoft Word's document format.
The implication is that the remainder 30% is some variant of TeX.
Our current MS Word author story:
Using the Equation Editor package with all MS Word releases prior to Office 2007, our Word importer will able to import the math as Presentation-MathML.
Our current LaTeX author story:
We do not have one. We do support an undocumented import path from Scientific Workplace/Word into XHTML+Presentation-MathML into CNX. SW is WYSIWYG math document editor that uses LaTeX as its input and output format and that is available at many institutions via a site license (i.e. is likely available and affordable to many of our authors).
Different Requirements for New Versus Existing Content
Another important dichotomy for authors is between those who wish to import existing content into CNX and those who are entering new content.
I am assuming those authors with existing content are looking an easy import path, will considered loading a 80 page document into a single CNX module a quality idea, and will be nonplussed to have to make changes in both the original document and the CNX module.
New content authors will likely be more flexible about their import path and about structuring their content to fit the CNX module paradigm. We plausibly could suggest that they use MS Word and EE-3.1 for their math equations (or that they could use Scientific Workplace).
Need to Develop and Document Author Workflow Best Practices
As I mentioned above, creating modules from a 80 page Word documents may not be the best idea for organizing content in CNX, from the author and the reader perspective. We need to establish Best Practice for author workflow path and focus our energy on providing a quality experience to those authors who stay on that path.
Concentrating on creating modules around single learning objectives might force authors into a different approach to their content development. A byproduct of this might be authors with existing content may want to rework that content to fit CNX better versus bulk uploading.
Recent comment from Sidney Burrus
CNX needs to offer a "win" for authors putting their content into CNX.
Recent comment from Don Johnson
Don only wants to maintain his content in one place, versus keeping his content current in both LaTeX and CNXML. (Due to the current and likely future state of our inexact importing into CNXML, Don and authors with similar reasoning have only one choice: maintain the CNXML. This may not be seen as a "win".)
MathML Workflows in STM Publishing
http://www.dessci.com/en/reference/white_papers/mathml_workflows.htm
"One of the chief characteristics of STM (Scientific, Technical, Medical) content is the presence of mathematical equations. ... Now that many publishers are moving to a production system built around a central XML-based repository, describing mathematics using MathML, the XML-based language for mathematics, is a better choice. And, support for MathML is virtually an absolute requirement for targeting the new HTML+MathML delivery medium for publishing web content containing math. "
"MathType 5 has the ability to convert equations to MathML. This can be done one equation at a time, but usually an entire Word document is converted to HTML+MathML using MathType's MathPage feature. A web page produced in this manner can be published directly to the web but in an XML-based production system, a transformation could be done on the page to get it into the repository. MathType's conversion capabilities are also available through a scripting API, which can be used to automate conversion of equations to MathML in a workflow setting."
"MathType is not a true MathML editor as it contains features that have no analog in MathML. MathType produces MathML as a translation process and, like most translations, the results are not 100% equivalent to the original. MathType is also only capable of producing Presentation MathML, rather than Content MathML."
"The MathFlow and WebEQ Editors, on the other hand, are MathML editors from the start — they represent the equation internally in MathML and read and save in MathML."
"As TeX must always be converted to MathML, there's not much to say about the authoring process for TeX except perhaps that such authors should consider MacKichan's Scientific Word and Workplace as these are TeX-oriented authoring systems that can save as MathML."
"The mathematics representation of both TeX and LaTeX is equivalent to Presentation MathML. Except for trivial mathematics, conversion to Content MathML is problematic."
"Computer algebra systems, like Waterloo Maple's Maple [14] and Wolfram Research's Mathematica [15], are built around a "notebook" interface and contain conversion to and from MathML. This should make it possible for a scientist or engineer to do their basic research and documentation within the interactive environment of these products. They can then submit such a notebook document to the publisher for conversion and integration into the XML repository. In fact, Mathematica version 4.2's notebook files are XML-based so, in theory, an XML-to-XML transformation is all that would be needed to incorporate a notebook into a publisher's XML repository."
"TeX4ht is a free software package for converting TeX and LaTeX documents into various HTML and XML formats. It is highly configurable and supports conversion of equations into MathML."
"If the output medium for the publication is HTML+MathML, the conversion is fairly simple. XSLT can be used to transform the repository XML items into HTML elements. "
"If the output medium is PDF or PostScript for print, the process is a bit more involved. One route is with XyEnterprise's XPP product [21] which combines a content management system with PDF and print output processing and MathML support."
"Another possible route to print or PDF is by converting the repository XML to XSL-FO, using XSLT or other text-to-text conversion tools. XSL-FO (FO = Formatting Objects) [22] is an XML-based language for describing the layout of paginated documents. ... Unfortunately, XSL-FO does not support mathematical notation. However, Design Science is working with vendors of XSL-FO formatters to combine our MathPlayer MathML formatting technology with their products to create a comprehensive output solution."
Great resource for finding MathML software
http://www.w3.org/Math/Software/mathml_software.html
Categories include:
* Browsers
* Browser plugins, scripts and extensions
* Editors
* Scientific Computation
* Composition and Rendering Engines
* Converters
* Authoring Systems
* Stylesheets to/from MathML
* DTDs and Schemas using MathML
* Components and SDKs
* Research Projects
* Accessibility
W3C Recommendation on How to Include MathML in Web Pages
http://www.w3.org/Math/XSL/
Including MathML in Web pages
In order to maximize the number of platforms it will be viewable on, a document should be written using the rules below.
1. Create the page using XHTML with inlined MathML.
2. Add a stylesheet processing instruction.
--------
Some now believe that XHTML+MathML is THE math web publishing standard. Most if not all of the STM publishing client side software (Scientific WorkPlace, Mathematica + Publicon, Maple, MS Word + MathType, WebEQ, SciWriter, and XmetaL + MathFlow) have an option to publish as XHTML+MathML.
Re: W3C Recommendation on How to Include MathML in Web Pages
Design Science MathType 5.0's MathPage
The MathPage exported XHTML does look very nice in the browser (a claim we can not always make).
The MathPage exported XHTML does not unfortunately look as good in a simple text editor. Some structural information is missing from the XHTML, like in lists. The structural information is essential in converting the XHTML into CNXML. I saw cases where the CNXML importer did a better job with the structure than MathPage.
MathPage did transform all of the image files to file types that can be displayed on the web.
Formulator MathML Weaver 3.7
With reading the fine manual, I was able to create Presentation and Content MathML. Formulator provided an expression view, a MathML tree view, a MathML markup view and finally a HTML view. The MathML can be edited via the first three views.
After creating several equations without the manual, I did read the manual, a pdf that I downloaded from from the Formulator site. I found the documentation both good and thorough.
How would Formulator integrate into the CNXML author workflow?
1. MS Word authors could switch out the Presentation-MathML with Content-MathML create in Formulator, as one of their many copy edit steps post importing.
2. New authors could create Content-MathML in Formulator. If the author's starting point is MS Word, the MathML equations would be developed in separate documents and work flow steps. If the author is writing the CNXML directly, again the MathML would be created and maintained separately.
Separate MathML editing is obviously not the preferred "one stop" solution from the author's perspective.
SciWriter 2.0
Text styles include Paragraph, Heading{1,2,3,4,5,6}, List, Ordered List, Title, Abstract, and Footnote. These styles appear to be mostly a subset of CNXML which implies that the xform (at least at the conceptual level) from SciWriter XHTML to CNXML should be doable.
The XHTML generated from SciWriter is minimalistic and straight forward. Translating to CNXML appears to be doable.
SciWriter also has some styles associated with the math called Theorem Environments which include Acknowledgement, Algorith, Axiom, Case, Claim, Comment, Conclusion, Condition, Conjecture, Corollary, Criterion, Definition, Example, Explanation, Exercise, Lemma, Notation, Problem, Proof, Proposition, Remark, Solution, Summary, and Theorem. Not all of these have counterparts in CNXML. Translating into CNXML will likely remove some of the semantic meaning.
IMO SciWriter may be a good fit for our project, for author who looking to add content with math from scratch. Given the minimal set of text style, new non-math authors might also find SciWriter appealing.
Re: SciWriter 2.0
Note for all XHTML import strategies ...
In our existing Scientific Writer import pathway, the image files have to be added by hand after the XHTML has been imported.
A comprehensive, one step solution would be to bundle the XHTML and image files in a zip file and import the zip file.
May need separate XHTML importer per exporter.
<div class="s4s-table-center">
<table class="s4s-figure">
<tbody>
<tr>
<td align="center">
<img width="50%" alt="WallaceAndGrommet" src="brian_files/WallaceAndGrommet.jpg" />
</td>
</tr>
<tr>
<td class="s4s-figure-numbered" align="center">
<span class="s4s-figure-number">Figure 1.2: </span>[Wallace and Gromit]</td>
</tr>
</tbody>
</table>
</div>
To get the desired presentation in XHTML a table was used with a row for the image and for the caption. A XSL xform could key off of the XHTML node class names to differentiate this table from a real table. The XHTML node class names are specific to the exporter, i.e. SciWriter.
I think it is a safe generalization that each of the different XHTML exporters have their own pattern of XHTML usage. Each XHTML exporter would likely then need its own CNX importer.
Edit MathML with free XML editor, XMLmind XML Editor (XXE)
I also took a CNXML module, both with and without MathML, and edited it in XXE. I was able to edit both via the tree view. XXE also knew contextually what CNXML and MathML elements could be inserted where.
Finally I used XXE to edit a Simplified DocBook file. XXE had WYSIWYG editing for Simplified DocBook. FWIW XXE supports WYSIWYG Simplified DocBook editing via an plug-in (optionally provided by XXE).
Edit MathML with free XML editor, Exchanger XML Editor
For DocBook files, Exchanger has a feature ("Insert Fragment") which will ease the editing pain by allow the user to easily add a DocBook tag and and end tag, based on what is possible from the DTD.
For MathML files, I could not get the "Insert Fragment" feature to work.
Re: Edit MathML with free XML editor, Exchanger XML Editor
The helper tab had the correct children for the DocBook, MathML, and CNXML files that I tested. The "Insert Fragment" feature appears to be a keyboard shortcut to the helper functionality.
Reviews for the two free XML editors mentioned above ...
http://www.winwriters.com/articles/xmlmind/index.html
http://www.hyperwrite.com/features/XMLmind_review.html
Review of Cladonia Exchanger XML Editor ver 3.2
http://www.winwriters.com/articles/exchanger/index.html
XXE WYSIWYG editing
XXE WYSIWYG DocBook editing still does not rule out the end user abusing the structure to get the right presentation.
(A motivating example for the above is that in Word the end user can center a list by enclosing the list of note with one or more single item lists. Each new enclosing list pushes the list of note closer to the center of the page.)
A truism may be that an XML author necessarily will need to be more conscious of the document structure.
Science and Nature just say no to Office 2007
http://www.robweir.com/blog/2007/04/math-markup-marked-down.html
The magazine Science says ...
"Because of changes Microsoft has made in its recent Word release that are incompatible with our internal workflow, which was built around previous versions of the software, Science cannot at present accept any files in the new .docx format produced through Microsoft Word 2007, either for initial submission or for revision. Users of this release of Word should convert these files to a format compatible with Word 2003 or Word for Macintosh 2004 (or, for initial submission, to a PDF file) before submitting to Science."
"Users of Word 2007 should also be aware that equations created with the default equation editor included in Microsoft Word 2007 will be unacceptable in revision, even if the file is converted to a format compatible with earlier versions of Word; this is because conversion will render equations as graphics and prevent electronic printing of equations, and because the default equation editor packaged with Word 2007 -- for reasons that, quite frankly, utterly baffle us -- was not designed to be compatible with MathML. Regrettably, we will be forced to return any revised manuscript created with the Word 2007 default equation editor to authors for re-editing. To get around this, please use the MathType equation editor or the equation editor included in previous versions of Microsoft Word."
MathType and Design Science may live to see another day.
The magazine Nature says ...
"We currently cannot accept files saved in Microsoft Office 2007 formats. Equations and special characters (for example, Greek letters) cannot be edited and are incompatible with Nature's own editing and typesetting programs."
Strong words indeed from Science and Nature.
The author of the blog (Rob Weir) had this to say ...
"The choice to invent a new "Open Math Markup Language" rather then use the well-established existing standard, MathML, appears to be a serious flaw."
Linked reply from the above blog entry ...
XHTML and MathML from Office 2007
http://dpcarlisle.blogspot.com/2007/04/xhtml-and-mathml-from-office-20007.html
Word 2007 has MathML input/output (via an XSL stylesheet installed with the system), and has HTML input/output (via its save as web page file menu), so the plan of action is: save the document as html, clean it up to xhtml, using the stylesheet to convert the mathematics to MathML at the same time.
1. Write your document in Word 2007, save as web page file.htm .
2. Use tagsoup to get some usable XML from this output. java -jar tagsoup-1.1.jar --lexical --output-encoding=iso-8859-1 file.htm > temp.xml
3. Use the supplied xhtml-mathml stylesheet to do some further cleanup and apply the Microsoft supplied omml2mml.xsl stylesheet to the math fragments. java -jar saxon8.jar -o file.xml temp.xml xhtml-mathml.xsl
The same author also has an entry on how to convert OpenOffice to XHTML+MathML:
XHTML and MathML from OpenOffice.org 2.2
http://dpcarlisle.blogspot.com/2007/04/xhtml-and-mathml-from-openofficeorg-22.html
Hat Tip to Ross ...
TeXmacs: WYSIWYG Math editor ... Hat tip to Chuck
http://www.texmacs.org/Samples/texmacs.pdf
"Above all, TeXmacs can be used to write structured documents with mathematical formulas. As the name suggests, we have been inspired by the TeX-LaTeX system [Knu84, Lam94, GMS93] from the typesetting point of view and the emphasis on content rather than presentation. However, we do believe that presentation is important during the writing phase. Much like high quality typesetting allows the reader to concentrate on what he is reading, a good editor should allow the author to concentrate on what is written and not on how it is written. In particular, TeXmacs does not rely on TeX-LaTeX, but has been written from scratch to be both structured and wysiwyg (what-you-see-is-what-you-get). This makes the program more efficient to learn and use than TeX-LaTeX, without sacrificing its major advantages."
---
Redux
1. WYSIWYG MathML editor
2. Exports XHTML+MathML and LaTeX
3. Creates structured documents
4. Emphasis on content rather than presentation
5. Runs on both Windows and Linux (where the Windows app is a Linux port)
6. LaTex commands can be entered instead of WYSIWYG
7. Positioning itself as a scientific office suite
Re: TeXmacs: WYSIWYG Math editor ... Hat tip to Chuck
UI is clunky and only may be intuitive to emacs users viz-a-viz generic Windows users.
To get MathML out, you need to set preferences :
Edit→Preferences→Converters→TeXmacs->Html→Use MathML
and then
File->Export->Html.
The promise of structural/semantic documents was not meet. Sections acted more like Word's Heading1 (or Html's <h2>) than CNXML's <section>, i.e. the section did not contain its semantic children.
The MathML generated also was not semantic, i.e. Presentation-MathML was generated and not Content-MathML. The document can also be exported as "Xml" which does contain the math semantically. Go figure.

http://www.idealliance.org/proceedings/xml05/ship/17/17-Mathews.HTML
"There are many good tools for converting Word documents into XML, and a number of them support conversion of Equation Editor/MathType equations into MathML. All of them ultimately rely on MathType conversion facilities. MathType can convert all the equations in a document into a textual form (either MathML or TeX/LaTeX)."
"The main problem with converting Word documents is not software but authors. Documents that have been carefully authored to be regular in their format tend to convert well. However, most Word documents are not carefully authored. The author often enters simple math expressions as a mix of symbols entered directly into Word, and Equation Editor or MathType equations for the more complicated bits. Similarly, the document text may look uniform, but is really a mix of many styles. Consequently, a number of production workflows have manuscripts cleaned up or re-keyed, and lock them down to an XML schema."
"LaTeX conversion is harder. There are three or four main LaTeX-to-XML+MathML converters: TtM, TeX4ht, Hermes, and WebEQ."
"A new LaTeX converter called Hermes [7] is still in development, but seems promising. It is currently best run on Linux hardware. If authors use special macro packages, Hermes can output content MathML markup, which is basically unique among TeX converters."
"In general, both Word and LaTeX conversions are difficult, time- and labor-intensive tasks, except in rare situations where the authoring is rigorously controlled. Consequently, virtually all of the production workflows with which we are familiar outsource the conversion of manuscripts."
"Like all XML languages, MathML is verbose and not intended for direct human editing, therefore WYSIWYG tools are essential for productivity."
"Another common workflow pattern, particularly in enterprise settings, arises when output formatting is fairly regular and not too complex (e.g., regular updates to technical manuals). In these situations, the emphasis is usually on coming up with a robust, high-volume, hands-off composition process. Such workflows are more likely to rely on techniques such as using XSL to produce HTML, or XSL-FO engines."
http://www.cse.ohio-state.edu/~gurari/TeX4ht/mn.html
TeX4ht uses the TeX compilation system => If Tex can handle the macros thrown at it, so can TeX4ht.
The OpenOffice.org conversions creates a 1.0 OOo file (.sxw file) which our Word importer also takes. Thus, the output of a client conversion to OOo via TeX4ht can be directly feed into the Word importer.
This site http://www.tug.org/utilities/texconv/textopc.html (Converters from LaTeX to PC Textprocessors - Overview) claims that
"The most successful path is using TeX4ht to convert to the OpenOffice format (.sxw, which actually is a zip compressed archive containing the document and vector graphics as XML and the bitmap graphics as bitmap files)"