Friday, December 5, 2008

XML, Structure, and DocBook

In my last post, I mentioned that all I really need to be able to produce is PDF and XML. PDF can be generated from suitable XML, and a colleague who works extensively with the publishing industry remarked that my leaning towards LaTeX was out of step with the industry's move towards XML-based content representation. In and of itself, that didn't strike me as terribly interesting. As I wrote to my colleague:
I don't doubt your observation that publishers are converging on XML as a representation format, but I don't think it's very meaningful. XML is just text in a particular syntactic format, and anything can be translated into XML: FrameMaker, Word, LaTeX, reST, HTML, you name it. Making sense of an XML document requires knowing the schema that assigns semantics to the document's elements, and my sense is that the publishing world is not converging on a common schema. O'Reilly uses DocBook. The Pragmatic Programmers use PML. Presumably Pearson uses something else. If I take my book in DocBook/XML format and give it to somebody whose tool chain expects XML using a different schema, that tool chain will be unable to do anything interesting with the document until it's been translated from DocBook XML to OtherSchema XML. Such a translation may be easy or difficult, depending on how well the semantic elements of the two schemas correspond.
Still, I started poking into information about generating XML from FrameMaker (what I've used for my previous books), and that led into a detour about the difference between Unstructured FrameMaker (the variant I've been using, where there is no document schema) and Structured FrameMaker (the variant that uses document schemas). Both can generate XML, but then I read an article at scriptorium.com that yielded an XML epiphany. The XML generated by Unstructured FrameMaker consists of a flat sequence of paragraphs identified by their styles, e.g.,
Book Title
Book Author
Book Chapter
Chapter Intro
Chapter Section
Section Para
Section Para
Chapter Section
Section para
Book Chapter
Chapter Intro
...
For purposes of generating PDF, this is fine, because all we need to know is how to format each paragraph style. But the flat sequence of paragraphs fails to reflect the underlying structure of the document. That looks more like this:
Book Title
Book Author
Book Chapter
Chapter Intro
Chapter Section
Section Para
Section Para
Chapter Section
Section para
Book Chapter
Chapter Intro
...
The structural information isn't needed for typesetting, but it's present in my head as I write, and it's reflected in the eventual formatting (e.g., chapter titles are typeset bigger than section titles, which are typeset bigger than subsection titles, etc.), so having it present in the XML seems like a pretty reasonable notion. Furthermore, XML schemas used by the publishing industry for book representation are doubtless going to contain such information, so if I want to facilitate transformation of my book's XML into whatever XML a publisher might want, my XML needs to have the structural information the target XML will require.

In short, there XML and there's XML, and XML without structural information about the book content it represents almost certainly imposes serious restrictions on what can be done with it. Going down that road seems foolish.

My need, then, is to be able to generate XML that reflects the logical structure of my book. I thus need an XML schema that defines that structure. I could come up with one from scratch, but I'm not so näive as to believe that that's a simple task, or, more precisely, a simple task to do well. Call me a reuse buff, but I want to pick up a pre-fab book schema, assume that the people who developed it knew what they were doing, and get on to the real work of producing content. Which pretty much takes me back to DocBook and the search for a DocBook-aware XML editor.

5 comments:

Allan said...

Regarding a Docbook-aware XML Editor, Altova's XMLSpy can load the DocBook XML Schema, and it offers an "Authentic" view for editing the content without having to deal with the markup:

DocBook in XMLSpy

I have no experience with this approach -- your comments prompted me to poke around (just now) for a potential solution on the Altova web site. But it seems promising on paper... er... on the screen...

Anonymous said...

Have you consider LyX? LyX is a WYSIWYM. (What you see is what you mean.) It combines TeX/LaTeX with a GUI. One can also export the document to DocBook SGML. YMMV, I have never used the SGML export feature.

Keith Fahlgren said...

oXygen is getting better with every release.

One consideration: Unless your publisher really screws up, your content will be viewed by less people in the printed form than via other media. Does that change your perspective on the problem?

Matt Doar said...

I used LaTeX and Emacs to write my Ph.D. thesis in 1992, and then more recently I wrote "Practical Development Environments" for O'Reilly (2005) and used Docbook and Emacs to do so. PDF and HTML were generated using FOP.

I found LaTeX and Docbook to be about the same amount of effort to use, but the output from LaTeX was generally superior to my eyes.
O'Reilly convert everything into Frame files for internal use and final presentation, and that's a one-way trip for them.

~Matt

Keith Fahlgren said...

"""O'Reilly convert everything into Frame files for internal use and final presentation, and that's a one-way trip for them."""

O'Reilly now has a DocBook-based workflow (and has for a couple of years), so we'd no longer go to Frame in this case.