Real World XML: XML in Business (Part 2)

March 7, 2009 09:35 by Brian

Challenges with XML Implementations

In the previous article in this series, I introduced the origins and benefits of XML. In this article, I will discuss the common challenges associated with XML implementations (focusing on the common business uses for XML that I identified in the previous article).

Content Storage

Storing information in XML promised to allow information to be stored in an open non-proprietary format, unlike relational databases (e.g. Oracle, SQL Server, MySQL) or binary formats (e.g. Microsoft Excel, Microsoft Word). However, deployment of XML content storage hasn’t been has widespread as originally expected. There are a number of reasons for this.

First, there is an abundance of data stored in relational databases, binary files, and other systems. Migrating legacy data to new system is complex and requires significant quality assurance controls.

Second, storing large amounts of data in XML also does not scale or perform as well as relational databases and other binary formats which have matured over decades.

Third, for many corporations, there is also a significant cost of ownership constraint to implement XML storage solutions. Most enterprises have an internal DBA staff to support their existing database systems. There is also no shortage of expert DBA consultants, thus lowering consulting fees. Since XML storage systems are highly specialized, most corporations don’t have the in-house expertise to support these systems and highly specialized skills often demand higher consulting fees.

Fourth, re-architecting applications to support XML storage systems is also complex and expensive. Most business users would rather enhance existing systems by providing increased business process automation or more functionality, rather than a riskier and a more costly technical architecture change where ROI is questionable.

Native XML Content Editing

Editing content natively in XML (e.g. XML Editors) also has it’s own set of challenges.

First and most significant, large-scale migrations of “legacy data” have slowed adoption of XML content editing implementations. For example, over the past five years, the pharmaceutical industry has been forced to spend significant effort and money (planning and implementing) to convert existing labels to XML. In the United States the XML format is SPL. In Europe, the XML standard is PIM. These two schemas are dramatically different because the original goals and visions dramatically differed.

In addition, many companies are waiting for XML based standards to mature. Companies often embrace line-of-business (LOB) XML schemas, such as XRBL, SPL, PIM, only when mandated by governments and regulatory authorities. There are often competing standards due to different design goals when the XML schemas were developed. In order to speed up XML adoption, there needs to be harmonization among competing standards.

In contrast to the various line-of-business (LOB) XML schemas, general purpose XML schemas have emerged over the past five to six years. The two most important schemas are OpenDocument and OpenXML. These XML schemas hold the most promise, as they provide more features and capabilities than line-of-business (LOB) schemas. The learning curve and costs to migrate to OpenDocument or OpenXML are also significantly lower than migrating to line-of-business (LOB) XML schemas because Office Suites (e.g. OpenOffice and Microsoft Office) have built-in conversion capabilities.

OpenDocument and OpenXML are likely to coexist as document formats, and proprietary (often binary) formats will die off (but not soon enough). The coexistence of the OpenDocument and OpenXML standards is actually not a bad thing, as competition will force these standards to evolve at a faster rate than if only one standard existed (I.e. slow evolution HTML). The storing of content in these formats will allow for content to be more easily converted to line-of-business (LOB) schemas on demand, but allow for authoring to be done in traditional and familiar office suite applications (e.g. OpenOffice and Microsoft Office).

OpenDocument and OpenXML are a better solution than editing line-of-business (LOB) schemas or ever generalized schemas. They are more robust and most importantly more familiar to end users. It is often easier to convert OpenDocument and OpenXML to line-of-business (LOB) schemas.

For example, it would be better for the pharmaceutical industry to keep labeling documents in OpenDocument or OpenXML than to use line-of-business (LOB) or "generalized labeling" schemas. By embracing OpenXML and OpenDocument for labeling documents, the industry can develop universal schema converters that are able to convert OpenXML and OpenDocument to PIM, SPL, etc., thus providing better ROI and efficiency gains than pushing the pharmaceutical industry to embrace line-of-business (LOB) schemas.

Many pharmaceutical companies have struggled or are struggling implementing XML editing solutions for labeling without seeing true ROI. On the usability front, users aren’t familiar with the new XML Editors, which are simply not as user-friendly and robust as Microsoft Word, which has matured over decades. On the document management for labeling documents front, supporting multiple formats or converting to specific schemas based on geographical regulatory authorities is expensive and cumbersome.

Content Publishing

Content Publishing is a very strong benefit of XML. It is possible through stylesheets (XSLT) to publish content from a single source to a variety of output formats (I.e HTML, XML, PDF, RTF, SpreadSheetML, WordML, OpenDocument, OpenXML).

For example, you can publish product information content to a website (HTML) using XSLT or to PDF for printing using XSL:FO. You can publish the content to a variety of different layouts and styles based on business requirements, such as publishing content for a brochure, a flyer, or a website. The possibilities are endless, but not without challenges. In this section, I will discuss a few of the common challenges and pitfalls.

First and foremost, each output handles formatting differently. Each output format handles layout, style inheritance, and pagination differently. HTML has no concept of pagination. HTML style inheritance is handled differently than XSL:FO style inheritance. This can complicates stylesheet and template development, in that no assumptions can be made. Creating a mirror image of content in HTML and PDF can be a formidable challenge, especially when printing the documents. Testing the stylesheet requires significant testing due to the vast permutations and possibilities.

Second, the source XML doesn’t (or at least shouldn’t) contain layout information, such as keep this section on the same page as the next or don’t split these words. These are design layout criteria, not data related. For this reason, the XSLT stylesheets and templates can get cumbersome or complicated quickly when trying to format the information correctly.

Third, the more special cases that you create in the stylesheet and templates, the more complex the stylesheets are to maintain. Content re-use through publishing content to various output can be slowed considerably due to complexities with formatting and layout.

Data Exchange and Enterprise Application Integration

XML as a data exchange format is where XML truly shines. But again, it is not without it’s own set of challenges.

Before we delve into XML as data exchange format, we need to provide some historical context for data exchange. Prior to XML, one of the most common file formats for data exchange was comma-separated (CSV) files. The source system would simply export table data from a proprietary database to a flat file. Typically the file would be FTP’d to a second server, where a data feed would be configured to import the CSV file to another database system. This presented three common problems. First, it worked quite well for flat data (relational data), but data often is hierarchical and structured. Second, it required knowledge of the file layout, as the CSV file was not self-describing. The file would need a definition file to accompany the CSV file in order to understand the data. Third, there is no validation, thus a single misplaced comma can throw off the entire data feed.

XML solves all of these challenges. It is self describing and hierarchical. It is structured, rather a linear dump of data records. It provides schema validation to ensure data integrity. However, there a few challenges often associated with XML as a data exchange. Many of the challenges aren’t necessarily limited to XML.

First, XML is rather verbose. This provides the self-describing benefits of XML, but comes at a cost. The file size for XML documents is considerably larger than other flat file formats. The decreasing cost of storage, increasing network bandwidth, and increasing processing power help offset this cost, but for large-scale integration projects, it cannot be ignored or performance issues will likely surface.

Second, XML schemas can become challenging to version and maintain, especially when both the sender and receiver perform schema validation (as they should). Both source and target need to ensure they are using the same XML schema.

Third, XML schema provides validation, but the target and source still need to agree upon formats and data types. For example, not all dates are stored or interpreted the same. Field level conversions may still be required on the receiver to convert a date string to a format that can be understood by the target database.


Currently rated 5.0 by 1 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Copyright (c) 2007-2009 Brian J. Stewart, Copyright Policy

Comments

Add comment


 

  Country flag

biuquote
  • Comment
  • Preview
Loading