Tools of Change Notes: XML in Practice
Mike's notes from the XML in Practice session at the O'Reilly Tools of Change Conference 2/9/09
This was the second-half of a day-long set of XML tutorials. In the morning, there was an Introduction to XML for Publishers talk that I did not attend. If you're interested, you can find the description of that session and the presenter's slides here.
The descriptions, slides, handouts, and working files for XML in Practice can be found here.
Speaker 1: Bill Kasdorf (Apex Content Solutions)
Bill spoke about XML Models for books. He gave quick-hit looks at several XML schema for describing book content: ISO 12083, TEI, NLM, DocBook, DITA, and DTBook. He described how there often is no easily identifiable target schema for most publishers because "books are messy," meaning they are often more complex than they seem and require some degree of customization, no matter what schema you choose for your XML workflow. I wholeheartedly agree. Here's my 2 cents on two of the most popular schema for books, DocBook and DITA:
The advantage of working with off-the-shelf schema is that they are widely understood and a number of XML tools support them out-of-the-box. If you can avoid customization, you'll save time, work, and money. But there is no magic. Books are messy.
DocBook was designed to describe technical documentation. Though it describes narrative documents (books, of course), it may not (probably won't) fit your content precisely, and you'll either have to make do or end up doing other sorts of customization (and paying for it somewhere else along the line). Still, it has been around a long time (since long before there was such a thing as XML). It wouldn't have survived and thrived if it hadn't been very useful for publishers.
DITA (Darwin Information Typing Architecture) was also designed to describe technical documentation, but in a very different way. It describes independent chunks of content called topics, linked together to make documents. It's relatively new and has become very popular very quickly. Lots of tools support DITA out of the box. You just press the DITA button and start writing. It is wonderfully efficient for creating those chunks. But it is very hard to bring existing narrative content into DITA, (or to make nicely-flowing narrative content out of DITA), because it is so granular and modular. One of the keys for choosing a schema is finding one not only with appropriate tags for your content, but also right level of granularity.
If you want a wider view of the issues involved with creating an XML publishing workflow, you might want to check out the ongoing series of articles on that topic by Eric Damitz, one of my partners on Publicious.net. OK, now back to the O'Reilly show.
Bill Kasdorf also spoke a little about epub. epub is an XML standard for producing eBooks. It is composed of three open standards, the Open Publication Structure (OPS), Open Packaging Format (OPF) and Open Container Format (OCF), all produced by the IDPF.
IDFP maintains user forums where you can ask questions and learn more about epub.
Font rights are an issue. (I believe all fonts are born with certain inalienable rights: life, ligatures, and the pursuit of happiness. But I digress.) IDPF is working on font-mangling spec for epub to prevent pirates (Arrr!) from extracting fonts from epub.
Speakers 2 & 3: Bob Kelly, The American Physical Society and John Gardner, ViewPlus Technologies, Inc.
Their topic was Universally Usable Mainstream Online Publishing. They spoke about the challenges and benefits of working in XML to make content accessible. Here are a few interesting technologies they referenced.
ONIX is an XML standard for marketing and distribution information about published material.
Fire Vox is a free open source, talking browser extension for the Firefox. Install it and it acts a screen reader for content in Firefox.
The IVEO learning system by ViewPlus enables you to output an SVG file in three sensory modalities: sight, sound, and touch (utilizing an embosser). If this interests you, hang on to your copy of InDesign CS3, since CS4 does not export SVG.
Speaker 4: Norman Walsh (Mark Logic Co.)
Norm's topic was "Where is the New When." He spoke about the possibilities of adding geospatial information to content. He cited new opportunities for delivering (pushing) content to people with devices (basically phones with GPS) that know where they are. The example: you can get a message that the bar down the block has half-price drinks right now. You can also seek out geospatial content. So you can search for a sushi restaurant nearby that has a movie theater and an ATM within certain radius. Clearly we're talking Big City stuff, but you get the point.
What is meant by "were is the new when?" One example is that instead of having our search results ordered by most recent first, we can have them ordered by distance from our location.
Some digital cameras have GPS chips to tag the images taken with them. Norm showed a map that aggregated data from hundreds of geotagged Photoshop images. It looked like one of those "Earth at night" images. The datapoints made a perfect outline of the US, with cities, and even major highways visible (apparently people take a lot of pictures as they drive down the road). Not sure exactly how this relates to publishing books, magazines, journals, etc, but it was neat.
Speaker 5: Marisa DeMeglio (DAISY Consortium)
In XML circles, DAISY stands for Digital Accessible Information SYstem. It is the stage name for the NISO Z39.86 standard. (You can see why they needed to come up with something catchy like "DAISY"). It is a standard for making digital content accessible to blind, visual impaired, print-disabled, and learning-disabled people. When you follow the DAISY standard, you produce a Digital Talking Book (dtbook). The dtbook DTD is the set of tags and rules for creating a valid dtbook XML file. A subset of the DAISY format comprises the NIMAS (National Instructional Material Accessibility Standard) for K-12 core student-facing content. There are several tools worth knowing about for working with DAISY.
The Save as DAISY add-in for Microsoft Word allows you to save content from Word XP, Word 2003, and Word 2007 as DAISY output.
Odt2dtbook is an OpenOffice.org Writer extension, that allows you to save content from OpenOffice documents as DAISY output.
The DAISY Pipeline is a very cool, open source package for converting files to and from DTBook format.
Design Science makes an product called called MathDaisy (currently in beta, PC-only : p), which handles the math content for either the DAISY Pipeline or the Save As DAISY add-in for Word.
And some obscure application called InDesign CS4 can export dtbook XML in an epub document (via the Export to Digital Editions feature).
Next up: Google Book Search, Adobe, and Quark.