OASIS Open Architecture for XML Authoring and Localization Reference Model (OAXAL)

Reference Model for Open Architecture for XML Authoring and Localization Version 1.0

The Open Architecture for XML Authoring and Localization (OAXAL) provides a comprehensive, efficient, and cost-effective model for building an XML lifecycle production framework based completely on Open Standards from ic trademarked names, abbreviations, etc. here] are trademarks of OASIS, LISA OSCAR and W3C.

This document was last revised or approved by the OAXAL TC on the above date. The level of approval is also listed above. Check the "Latest Version" or "Latest Approved Version" location noted above for possible later revisions of this document.

Technical Committee members should send comments on this specification to the Technical Committee's email list. Others should send comments to the Technical Committee by using the "Send A Comment" button on the Technical Committee's web page at http://www.oasis-open.org/committees/oaxal/.

For information on whether any patents have been disclosed that may be essential to implementing this specification, and any offers of patent licensing terms, please refer to the Intellectual Property Rights section of the Technical Committee web page (http://www.oasis-open.org/committees/oaxal/ipr.php.

Notices

All capitalized terms in the following text have the meanings assigned to them in the OASIS Intellectual Property Rights Policy (the "OASIS IPR Policy"). The full Policy may be found at the OASIS website.

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published, and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this section are included on all such copies and derivative works. However, this document itself may not be modified in any way, including by removing the copyright notice or references to OASIS, except as needed for the purpose of developing any document or deliverable produced by an OASIS Technical Committee (in which case the rules applicable to copyrights, as set forth in the OASIS IPR Policy, must be followed) or as required to translate it into languages other than English.

The limited permissions granted above are perpetual and will not be revoked by OASIS or its successors or assigns.

This document and the information contained herein is provided on an "AS IS" basis and OASIS DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY OWNERSHIP RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

OASIS requests that any OASIS Party or any other party that believes it has patent claims that would necessarily be infringed by implementations of this OASIS Committee Specification or OASIS Standard, to notify OASIS TC Administrator and provide an indication of its willingness to grant patent licenses to such patent claims in a manner consistent with the IPR Mode of the OASIS Technical Committee that produced this specification.

OASIS invites any party to contact the OASIS TC Administrator if it is aware of a claim of ownership of any patent claims that would necessarily be infringed by implementations of this specification by a patent holder that is not willing to provide a license to such patent claims in a manner consistent with the IPR Mode of the OASIS Technical Committee that produced this specification. OASIS may include such claims on its website, but disclaims any obligation to do so.

OASIS takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on OASIS' procedures with respect to rights in any document or deliverable produced by an OASIS Technical Committee can be found on the OASIS website. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this OASIS Committee Specification or OASIS Standard, can be obtained from the OASIS TC Administrator. OASIS makes no representation that any information or list of intellectual property rights will at any time be complete, or that any claims in such list are, in fact, Essential Claims.

The names "OASIS", [insert specific trademarked names, abbreviations, etc. here] are trademarks of OASIS, the owner and developer of this specification, and should be used only to refer to the organization and its official outputs. OASIS welcomes reference to, and implementation and use of, specifications, while reserving the right to enforce its marks against misleading uses. Please see http://www.oasis-open.org/who/trademark.php for above guidance.

The Open Architecture for XML Authoring and Localization (OAXAL) represents a comprehensive, efficient, and cost-effective model regarding the authoring and translation aspects of XML publishing. OAXAL encompasses the following key Open Standards:

This diagram is annotated and described in detail in subsequent parts of this document. OAXAL is designed to cope with the common requirements for XML authoring and Localization. The authoring plus Localization aspects of OAXAL are most effective within a Content Management System (CMS) environment. For a translation-only workflow, OAXAL can be implemented without a CMS system.

OAXAL is designed to integrate tightly and transparently within the document-life-cycle workflow model which includes:

For the translation-only environment, OAXAL provides an elegant and open architecture for processing XML documents for translation.

A reference model is an abstract framework for understanding significant relationships among the entities of some environment. It enables the development of specific reference or concrete architectures using consistent standards or specifications supporting that environment. A reference model consists of a minimal set of unifying concepts, axioms, and relationships within a particular problem domain and is independent of specific standards, technologies, implementations, or other concrete details.

As an illustration of the relationship between a reference model and the architectures that can derive from such a model, consider what might be involved in modeling important aspects of residential housing. In the context of a reference model, we know that concepts such as eating areas, hygiene areas, and sleeping areas are all important in understanding what goes into a house. There are relationships among these concepts and constraints on their implementation. For example, there may be a physical separation between eating areas and hygiene areas.

The role of a reference architecture for housing would be to identify abstract solutions to the problems of providing housing. A general pattern for housing, one that addresses the needs of its occupants in the sense of, say, noting that there are bedrooms, kitchens, hallways, and so on is a good basis for an abstract reference architecture. The concept of "eating area" is a reference model concept; a kitchen is a realization of "eating area" in the context of the reference architecture.

There may be more than one reference architecture that addresses how to design housing; for example, there may be a reference architecture to address the requirements for developing housing solutions in large apartment complexes, another to address suburban single family houses, and another for space stations. In the context of high-density housing, there may not be a separate kitchen but rather a shared cooking space or even a communal kitchen used by many families.

An actual – or concrete – architecture would introduce additional elements. It would incorporate particular architectural styles, particular arrangements of windows, construction materials to be used, and so on. A blueprint of a particular house represents a specific architecture as it applies to a proposed or an actual constructed dwelling.

The reference model for housing is, therefore, at least three levels of abstraction away from a physical entity that can be lived in. The purpose of a reference model is to provide a common conceptual framework that can be used consistently across different implementations and is of particular use in modeling specific solutions.

The goal of this reference model is to define the component parts of XML publishing with respect to the authoring and Localization aspects of the process. It provides a normative reference that remains relevant for OAXAL as a comprehensive model.

The OAXAL standards components stack shows how the reference model for OAXAL is constructed from its constituent Open Standards. The concepts and relationships defined by the reference model are the basis for describing the reference architecture.

Architecture must account for the goals, motivation, and requirements that define the actual problems being addressed. While reference architectures can form the basis of classes of solutions, concrete architectures will define specific solution approaches.

Architecture is often developed in the context of a pre-defined environment, such as the protocols, profiles, specifications, and standards that are pertinent.

OAXAL implementations combine all of these elements, from the more generic architectural principles and infrastructure to the specifics that define the current needs, and represent specific implementations that will be built and used in an operational environment.

New readers are encouraged to read this reference model in its entirety. Concepts are presented in an order that the authors hope promote rapid understanding.

This section introduces the conventions, defines the audience, and sets the stage for the rest of the document. Non-technical readers are encouraged to read this information because it provides background material necessary to understand the nature and use of reference models.

The glossary provides definitions of terms within the reference-model specification but does not necessarily form part of the specification itself. Terms that are defined in the glossary are marked in bold at their first occurrence in this document.

Note that while the concepts and relationships described in this reference model may apply to other "service" environments, the definitions and descriptions contained herein focus on the field of software architecture and make no attempt to completely account for use outside of the software domain. Examples included in this document that are taken from other domains are used strictly for illustrative purposes.

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC2119.

Open Architecture for XML Authoring and Localization (OAXAL) is a reference model of how to construct an effective and efficient system for XML authoring and Localization based on Open Standards. OAXAL comprises the following standards:

This Reference Model will demonstrate the integration of the standards listed above to present a complete automated package from authoring through translation; additional standards may be added by the Technical Committee (TC), and the TC may elect not to include (or to make optional) any of the standards listed above that prove, upon review, not to be feasible or useful to integrate in its profiles. Authors are provided with a systematic way to identify and store all previously authored sentences. OAXAL allows for some variation in how the standards are used and integrated. OAXAL variants may not use all of the standards enumerated above.

Key to the concept of OAXAL are Authoring and Localization. Authoring in OAXAL implies XML-based source and the concept of a document lifecycle centered around some form of content management. Content management may be achieved by means of a fully fledged Content Management System (CMS) or a Source Control System (SCS). The document lifecycle implies one or more of the following stages:

OAXAL is designed to provide an effective and elegant solution to these requirements within an authoring/localization workflow:

OAXAL can also be used to design Localization-only solutions. In this instance, a document is submitted for Localization. The format of the document may not be XML; nevertheless, OAXAL assumes a conversion to an XML form of the data prior to processing and a conversion back into the original format on completion. Translation/Localization comprises the following steps:

Unicode provides the underlying character encoding for OAXAL. Although XML allows various encoding schemes based on 7 and 8 bits, OAXAL mandates full Unicode encoding, preferably using UTF-8 or UTF16 encoding. The benefits of using Unicode character encoding for OAXAL are as follows:

The key characteristic of OAXAL is the use of an Open Architecture based on Open Standards with XML as the source format for both the document format and, in most cases, the vocabulary of the standards. XML underpins the foundations of OAXAL. The XML source content provides semantic and structured text that can be localized. XML provides many benefits regarding authoring and Localization:

At the center of authoring and Localization is the actual XML document text to be authored and/or localized. OAXAL encompasses all publishing-oriented Open Standard XML vocabularies such as DITA, Docbook, XHTML, SVG, ODF, and others that may emerge as standards. OAXAL may also be used with proprietary XML vocabularies or with non-XML based documents that are converted into a non-Open Standard XML format.

W3C ITS is the Internationalization Tag Set Recommendation. ITS allows for the declaration of Document Rules for Localization. In effect, it provides a vocabulary that allows the declaration of the following for a given XML document type:

W3C ITS provides many more features, including a namespace vocabulary that allows for fine-tuning Localization for individual instances of elements within a document instance. W3C ITS is therefore at the core of XML Localization processing.

In OAXAL, W3C ITS stipulates the rules by which an XML document is localized in terms of its translatable content.

Unicode TR29 is the Unicode standard defining word and sentence boundaries. It allows for a uniform way of defining word boundaries for OAXAL and, as such, is used by SRX and GMX/V to tokenize text into individual words. It plays a fundamental role in OAXAL.

SRX - Segmentation Rules eXchange is an Open Standard XML vocabulary for defining the segmentation rules for a given language published by LISA OSCAR. Segmentation is an important aspect of both the authoring and Localization processes. SRX allows OAXAL to have a sentence-level granularity. SRX depends on Unicode TR29 in order to provide the basis for tokenizing text into individual words.

xml:tm is a namespace vocabulary providing a LISA OSCAR standard for author and translation memory. xml:tm is a key component of OAXAL. xml:tm introduces the concept of XML-based text memory that encompasses both in-document author memory and translation memory. In the xml:tm scenario, author and translation memory are embedded within the XML document, providing both an edit-change history of the document as well as the mechanism for 'In Context Exact' (ICE) translation-memory matching. ICE matching guarantees that the text-unit matches are from exactly the same source as the previous iteration of an updated document, as opposed to leveraged matching which cannot guarantee the provenance of a 100% match.

Given a W3C ITS rule set for a given XML vocabulary and the SRX segmentation rules for a given language, it is possible to construct a totally generic process for embedding the xml:tm text-memory namespace within the source document. xml:tm relies on W3C ITS and SRX. xml:tm allocates immutable unique identifiers to each translatable text content or a subdivision of such text content, resulting in identifiable individual sentences.

The key role of xml:tm within OAXAL is in preparing an XML document for further processing as well as providing the syntactical basis for ICE matching.

GMX - Global Information Management Metrics Exchange is a LISA OSCAR standard for word and character count and metrics exchange. GMX is a tri-partite set of standards:

Currently only GMX/V has been defined. GMX/V is a key component of OAXAL in terms of providing a uniform and consistent way of calculating the word- and character-count metrics for a given document or set of documents, as well as providing a way of embedding and exchanging such information. GMX/V depends on Unicode TR29 in order to provide the basis for tokenizing text into individual words. GMX/V also uses XLIFF as the canonical form for counting.

TMX - Translation Memory eXchange is a LISA OSCAR standard for exchanging translation memories. TMX is a key component of OAXAL, allowing for the free exchange of translation memories.

XLIFF - XML Localization Interchange File Format is an OASIS standard for exchanging Localization data. Within OAXAL, the previously described standards help prepare the XML document for translation. The xml:tm version of the document contains all of the information required for extraction and both ICE and in-document leveraged and fuzzy matching. The transformation and matching process that goes into creating an XLIFF version of the document creates a document that can be processed and exchanged by any software that can read and understand an XLIFF file. XLIFF provides an important element of protection regarding the original XML document as well as a means to embed matching information.

The key concept of OAXAL concerns how to build an efficient and effective systems architecture based on its constituent standards. The most important aspect of this architecture is how the standards interact with one another.

Unicode TR29 is used by SRX and GMX/V to tokenize text into white space, words, and punctuation. This tokenization is key to processing text for segmentation (SRX) and metrics (GMX/V).

W3C ITS provides the rule set and in-document namespace directives for identifying translatable text within document elements and attributes. It is sufficient to create a W3C ITS rules file for a given XML vocabulary such as DITA or ODF to allow all such documents to be processed by OAXAL. There is no need to write separate filter programs for each XML vocabulary. W3C ITS is used by xml:tm to identify translatable text and segment it using SRX.

xml:tm provides the basis of sentence-based text extraction by XLIFF as well as the foundation for ICE matching and all in-document leveraged and fuzzy matching. xml:tm can also be used to create TMX files from the aligned source and target versions of the document.

GMX/V is used to provide all of the metrics for XLIFF extraction and matching, as well as xml:tm authoring metrics during the document life cycle.

XLIFF is used by GMV/V as the canonical form for metric-counting purposes, as well as providing the basis for TMX files based on the source and translated segments. Within OAXAL, XLIFF uses xml:tm to identify text units requiring translation, as well as GMX/V for metrics in terms of how many words/characters require translation.

The true benefits of OAXAL accrue from the ability to produce a generic processing model for XML Authoring and Localization. The xml:tm and XLIFF operations are conducted by general-purpose programs which are completely parameter driven by input from the other standards. This parameterization allows for an elegant and very efficient process that is totally generic and easy to maintain. This benefit can be extended to non-XML document formats by converting them to an XML form and then processing them via OAXAL.

In the traditional Localization scenario, there is little or no automation of the Localization process. A file, or group of files, is handed over to a Localization facility, and the subsequent workflow is made up of the following activities:

Each of the arrows in this workflow model represents a potential point of failure as well as manual intervention. Not only is this process very error prone, it also adds significantly to the cost of Localization. The following cost model for this scenario was presented by Prof. Reinhard Schäler of the Limerick University Localisation Research Centre at the Aslib Conference in London in 2002:

The lack of an automated workflow has a very detrimental affect on the Localization process. Without automation, considerable manual intervention is required, as is evidenced in the figure Traditional Localization Workflow. This lack of automation accounts for up to 50% of the total cost of Localization.

The Localization workflow using OAXAL significantly reduces the processing costs:

All of the processing steps, apart from the actual translation and QA activities, are completely automated. In addition, the use of XLIFF as the interchange standard means that translation can be presented via a browser interface, thus significantly simplifying the whole process.

These results lead in the long term to reduced translation and authoring costs as well as improvements in the quality of the documentation.

OAXAL is fundamentally rooted in the concept of a document life cycle. The life-cycle steps comprise the following:

Thus, a document is created. It is authored by one or more writers and submitted to editorial review and correction. The document is subsequently published in the source language and localized into one or more target languages for publication. The document is then subjected to further modifications according to the requirements of the business unit that is charged with maintaining it. The updated document then requires localization again to translate any new or modified text, and so on during its existence. This paradigm is typical of the vast majority of technical documentation life-cycle processes.

The unit of granularity defined by OAXAL is the text unit. A text unit is either of the following:

OAXAL can be viewed in terms of a workflow comprising a series of processes that interact with the source text:

The whole OAXAL environment is best viewed in terms of an authoring and localization workflow, encompassing the following:

An alternative translation-only workflow is possible, without the use of the xml:tm namespace:

The OAXAL processes described in the above-mentioned use cases can be broken down into the following fundamental operations:

This operation involves updating an XML document with the xml:tm namespace. The required standards used are Unicode TR29, SRX, and W3C ITS. The xml:tm namespace is used to allocate a unique identifier to each translatable sentence or individual translatable standalone text segment. These are referred to as text units. The identifier is immutable for each text unit for the lifespan of the document.

For each update stage of a document, the original version of the document in its xml:tm form is required, as well as the updated version with a fresh xml:tm namespace. The two documents are compared, and any unchanged xml:tm text elements inherit the identifiers from the original version, thus maintaining the immutable identifiers.

The extraction process involves the identification of each translatable text unit, transferring it to an XLIFF document, and replacing the text unit with an identifier marking the location for the resultant translated text unit. A skeleton file is thus created with the placeholders for the translated text. The extraction process also involves extracting all translatable text and implementing all document-centered matching (ICE, leveraged, and fuzzy) as well as database matching, resulting in the creation of an XLIFF file along with a commensurate skeleton file.

If no database matching is attempted, then the whole extraction process can be implemented as an XSLT transformation.

The alternative extraction process does not use xml:tm versions of the document and does not implement document-centric matching. It involves the identification of translatable text, segmenting the text into sentences, transferring the resultant text units to an XLIFF document, and replacing the text units with an identifier marking the location for the resultant translated text unit. A skeleton file is thus created with placeholders for the translated text. The extraction process also involves extracting all translatable text and implementing all document-centered matching (ICE, leveraged, and fuzzy) as well as database matching, resulting in the creation of an XLIFF file along with a commensurate skeleton file.

If no database matching is attempted, then the whole extraction process can be implemented as an XSLT transformation.

MERGING involves recreating the target file from the translated XLIFF file and original skeleton file. The translated text is 'merged' with the skeleton file, replacing the placeholders for the translated text.

An optional stage is used to pack the xml:tm version of the document into the existing document as a zipped, base-64 encoded processing instruction. PACKING takes place in the following operations:

STRIPPING is the process of removing the xml:tm namespace from the document. STRIPPING takes place at any stage that requires a non-xml:tm version of the document.

The whole of the STRIPPING process can be implemented as an XSLT transformation. The result of this process is then suitable for output to a variety of output formats by using XSLT.

The authors of this reference model envision that architects may wish to declare that their work is conformant with this reference model. Conforming to a reference model is not generally an easily automatable task, given that the reference model’s role is primarily to define concepts that are important to OAXAL rather than to give guidelines for implementing systems.

We do expect, however, that any given Service Oriented Architecture will reference the concepts outlined in this specification. As such, we expect that any design for a system that adopts the OAXAL approach will:

It is not appropriate for this specification to identify best practices with respect to building OAXAL-based systems. The ease with which the above elements can be identified within a given OAXAL-based system, however, could have significant impact on the scalability, maintainability, and ease of use of the system.

[CRC] AUTODIN II Polynomial Cyclical Redundancy Check singature for a byte sequence. Provides a unique 32-bit signature for a given byte sequence.

[Localization] - Localization is the process of adapting a product or service to a particular language and culture. Translation usually forms a large part of localization, where the target language is different from the source language.

[Text Unit] - A text unit is either the complete text content of a document element or a subdivision of the same into identifiable sentences if possible.

[Workflow] - Workflow is a term used to describe the tasks, procedural steps, organizations, or people involved; required input and output information; and tools needed for each step in a business process.

The following Technical Committees provided the constituent Open Standards for OAXAL:

The following individuals were members of the committee during the development of this specification and are gratefully acknowledged: