Draft Guidelines for Filenames, URIs, Namespaces [and Metadata]
Editor: Robin Cover
Review Comments: send to email@example.com
Current Activity: produce condensed format with just the rules; add examples and commentary
Between February 2003 and February 2006, at least twenty-one (21) numbered drafts relating to "OASIS Naming Guidelines" were produced under an informal process, for review by OASIS Members, Chairs, TAB, Board, and others. These drafts have several different titles, suggesting variable focus and scope.
This document seeks closure on a few key decisions that need to be made in order to proceed with design and development of new document management facilities to support resources in the OASIS Open Library. The document editor recognizes that consensus will never be reached on a much longer list of naming issues about which stakeholders have strong opinions, and disagreements.
The document provides a summary of key issues raised by members of the TAB in its recent production of draft documents on "artifact" guidelines, as well as some issues raised by reviwers of AIR/ASIS, including OASIS Staff. The issues are presented under one of two labels, reflecting the emerging consensus of opinion as of 2006-06, or lack thereof: (a) "Issues Resolved, Near Resolution, or with Substantial Agreement" and (b) "Issues Requiring Further Discussion". We believe that the conclusions reached in "(a)" are reasonable consensus positions, at least suitable for trial application. Further work on issues in "(b)" will seek to discover solutions that need to be provided in order to give guidance to TCs and to OASIS Staff programmers who will create document management software to support the naming guidelines.
As consensus emerges, we anticipate a phase-wise publication of the minimum guidelines and rules on a dynamic, evolving web site. Initially, we are inclined not to characterize these "Guidelines for Filenames, URIs, Namespaces, and Metadata" (under some title) as hardened "policy" but as a collection of guidelines which need to be tested and refined through use by TCs in connection with new document management software. Similarly, the provisional rules and guidelines will be tested by common practice and new use cases.
Not all topics addressed by the TAB during the period 2003-02 through 2006-02 are mentioned in this summary document: additional background, including theoretical matters and formal notations are available in early drafts and supplemental materials, most recently in Artifact Standard Identification Scheme for Metadata 1.0.
Note on nomenclature: this summary document does not feature the term "artifact" (nor "requirements" nor "deliverable"), as feedback from reviewers of ASIS indicated that artifact probably does not represent an central concept, even if some of its defined characteristics are useful. This document uses the more familiar Web terms "resource" and "URI", along with "file" and sometimes "document", "specification", and "directory" (as a hierarchical element matching a URI path component). A taxonomy of resource types (previously in ASIS, "artifact types") will be considered separately as part of the metadata design effort.
- The rules and guidelines presented in this document and its predecessors are concerned chiefly with resources targeted for publication in the OASIS Open Library (http://docs.oasis-open.org/) — and not with resources submitted to a TC's Kavi repository. In many cases, Kavi [as currently installed] does not or cannot support the requirements implicit in this design effort. Nor are the rules envisioned as applicable to resources published on other OASIS-owned Internet domains (e.g., uddi.org, dcml.org, cgmopen.org, psi.org, pkiforum.org, legalxml.org, topicmaps.org [being transferred to ISO/IEC JTC 1/SC 34], ebxml.org, www.oasis-open.org, etc.
- The guidelines are not concerned with names used in XML markup constructs (e.g., names in elements, attributes, entities, PITargets, etc.)
- No retroactive application of rules is envisioned for resources already accessible via the OASIS Open Library; URIs and resources as currently exist will be grandfathered
- While the guidelines are intended to apply to any new resource uploaded to the OASIS Open Library, they are concerned mainly with specification-track documents and related resources (Working Drafts, Committee Drafts, Committee Specifications, Public Review Drafts, OASIS Standards) as defined by the TC Process document, and announced in November 2005.
- All resources installed in the OASIS Open Library are governed by the naming guidelines — whether documents produced by Technical Committees or materials contributed to OASIS from external sources, whether targeted for direct access on the file system or embedded within package files (.ZIP, .tgz).
- The guidelines are intended to fulfill the expectation articulated in the TC Process "2.18 Specification Quality": "All documents and other files produced by the TC, including specifications at any level of approval, must use the OASIS file naming scheme"
"Name Characters" here refers to characters used in URIs — including filenames, directory names, colon- or slash-delimited components within namespace URIs, delimiters, and possibly other URI subcomponents as may be labeled.
Beginning with one of the earliest drafts (Proposed Rules for OASIS Document File Naming, Working Draft 02, 18-February-2003), contributors to the twenty-some versions of the OASIS naming guidelines have agreed that a restricted character inventory for published names would best serve the needs of the organization. Experience using the Kavi system has confirmed that users (unconstrained) are likely to publish documents using problematic characters and character patterns in filenames/URIs, creating risks to interoperability and data integrity. Some such characters require hex (escape) representation because they are "Reserved Characters" in URI syntax, while others present risks because they are meaningful to the shell. Some of these potentially problematic characters include the at-sign (@), ampersand (&), left and right parenthesis, tilde (~), hash/pound-sign (#), dollar-sign ($), left and right square-bracket, plus-sign (+), colon (:), semicolon (;), etc.
While technical solutions are available to minmize problems arising from potentially problematic characters and character sequences, common best practice guidance urges avoiding them altogether; this conclusion has been supported in all twenty-some AIR/ASIS drafts and in the two OASIS member reviews.
- Ignoring usage constraints: The complete character set for naming OASIS resources is [0-9A-Za-z] plus "."(period), "-" (hyphen), "_" (underscore), "/" (slash), "#" (hash), and ":" (colon).
- In most contexts, allowable name characters include [0-9A-Za-z] plus "."(period) and "-" (hyphen)
- Underscore is ("_") allowed in designated contexts where use of hyphen is impractical or undesirable
- Hyphen is the preferred character for use as an internal delimiter between name subelements if an explicit charcter is used; juncture may also be marked by camel case orthography
- The slash ("/") or hash ("#") character is allowable in [HTTP scheme namespace] URIs according to the rules for URIs and XML namespaces.
- Colon (":") is expected in URN scheme namespaces
- Rules and rationalization for the (non-)use of punctuation characters (period, hyphen, underscore) in tcShortName, productName, and similar contexts
- Consideration of the PERCENT (%) character for use in IRIs
- Broad consideration of Unicode character representations for TC documents using non-roman alphabets/scripts (including characters in metadata values, etc) and a range of I18N issues
"Name construction" here refers to the lexical and syntactic structure of names, given the restricted character inventory. Motivations for the constraints include concerns for fidelity of interchange across file systems, minimizing the risks of common text-processing errors, usability (visual clarity), and other data QA. In other cases, arbitrary restriction of unbounded variablity serves the goal of simplicity through uniformity.
- Mixed case in names is generally allowed, including camel case
- Case-sensitive interpretation by the OASIS server and users is to be expected
- Creating two or more file/directory names differing ONLY in case (at the same hierarchical level) is deprecated; for example, in directory FOO, one should not create Bar and BAR
- Componentized filenames (e.g., IETF-style filenames, using hyphen-separated metadata factoids) are allowed but not required
- Filenames and directory names should neither begin nor end with a punctuation character (period, hyphen, underscore).
- Filenames and directory names should not contain multiple consecutive punctuation characters
- Filenames should not have two or more (identical) filename extensions (example: NOT foo.xsd.xsd or bar.pdf.pdf.
- Filenames should normally have a terminal PERIOD + filenameExtension unless the media/mime type is "text/plain" and the filename is one of a recognized set of extensionless names in common use (e.g.,
CATALOG/catalog, README, ChangeLog).
- Filename extensions should conform to industry best practice, matching well-known MIME Media Types for resources commonly shared in Web space open systems
- TC members must not create filenames that compete with any reserved filenames used by the system/server or by OASIS staff for administrative purposes
- TC members should not use naming constructs that, in the judgment of the TC Administration, are likely to infringe, embarrass, confuse, shock, or otherwise fall outside the boundaries of social norm
- URIs for OASIS specifications should not contain the trademarked names of products, companies, and other corporate entities
- If the (trademarked) names of products, companies, and other corporate entities [other than "OASIS" or OASIS-owned products] are to be prohibited or deprecated as filename components, other than possibly in standard filename extensions, how can trademark search be facilitated? Parallel: OASIS TC Process document Section 2.2: "The name of the TC... such name not to have been previously used for an OASIS TC and not to include any trademarks or service marks not owned by OASIS..." and 2.18: "The name of any specification may not include any trademarks or service marks not owned by OASIS..."
- What guidance, if any, should be given about avoiding excessive length (character count) in filenames, in directory names, and [in aggregate] in URIs? Usability is the concern: tabular displays, email clients and email message archiving software that wrap long lines by introducing newLine and thus "break" clickable hyperlinks or lead to subsequent creation of broken hyperlinks; etc.
- (How) should we attempt to extensionally define "reasonable" assignment of names, covering e.g., MUST NOT be misleading, confusing, alarming, embarrassing, nor infringe on known intellectual property rights of others... etc
- URIs for OASIS resources, as well as for HTTP scheme namespace URIs, should begin with the [exact, case-specific] 27 characters: http://docs.oasis-open.org/ unless specified otherwise by the OASIS TC Administration as an allowable exception
- Below the hierarchical level http://docs.oasis-open.org/[tcShortName]/, TCs are allowed relative freedom to create directories for their own use; all such directories and their contents will be publicly viewable via standard indexes and other navigation/browse facilities.
- URI aliasing using stragegies approved by the TC Administration and supported by OASIS IT should be for creating a persistent "Latest Version: " URI at which the most recent document version [of a certain type] may always be found
- URI aliasing using mechanisms other than those explicitly approved and for purposes other than "Latest Version: " URI requires consultation with the TC Administration
- Arbitrary URI aliasing (by any means) is forbidden, including, for example, unauthorized: (a) use of META-refresh elements (b) construction of URIs for canonical OASIS resources by using redirects from other Internet domains (e.g., http://tinyurl.com/, http://purl.oclc.org/)
- Assignment of URIs to resources is considered to be permanent except for a small class of approved exceptions, documented by the TC Administration (e.g., "Latest version: " URI), so files may be revisioned but not overwritten. This rule applies to secondary resources identified by fragment identifier.
- What guidance should be given to TCs about the method of lexical/syntactic construction for URIs that may or (by policy) WILL be used for the "current" or "latest" version-agnostic version of a specification, vis-à-vis the URI for a particular dated version?
- Should OASIS commit to supporting DNS+HTTP resolution of URIs used for functions, or properties, in accordance with or independent of a request from a Technical Committee? See for example http://www.w3.org/2005/08/ws-polling/HoldResponse, belonging to the collection of names in the http://www.w3.org/2005/08/ws-polling namespace. Similarly, a "Destination" property: http://www.w3.org/2005/08/addressing/feature/Destination.
- How should metadata and/or URI conventions be used to model and express relationships between multiple "files" that make up or are used to generate a compound document?
- In the event that TCs embed (meta-)data information in a path element or filename (or both), how can we normalize (synchronize) that information with a resource metadata record (DB), as well as with displayed body text in a document instance, embedded non-displayed markup construct a document instance, etc.?
- By what means can we avoid information redundancy in URIs (e.g., in path and filename portions: http://docs.oasis-open.org/security/saml/v3.0/spec-wd/r01/en/security-saml-v3.0-spec-wd-r01-en.html)
- Should a standard model be enforced for the location of specifications below the hierarchival level of a tcShortName? One proposal was [Hirsch/Clark v05]: "The first segment of the path following that domain name MUST be the TC Short Name specified as metadata for the document by Section 4. Example: "http://docs.oasis-open.org/security/..." The second segment of the path MUST be the Product Name specified as metadata for the document by Section 4, for all documents mentioned in the TC Process" [e.g., Working Draft, Committee Draft, Public Review Draft, Committee Specification, OASIS Standard]" This rule seems to work well for SSTC but not for DITA: http://docs.oasis-open.org/security/saml/ vs. http://docs.oasis-open.org/dita/dita/. This needs further discussion.
- Granting that TCs are allowed discretion in the creation of directories (URI path elements), what model hierarchical structures and "standard" directory names — if any — should be recommended so as to support some uniformity across TCs?
- (How) should we approach the matter of creating of new URIs for legacy resources (conforming to the new rules/guidelines) without violating the Web Architecture "Good practice: Avoiding URI aliases"?
- Any of the three common types of namespace names (URI references) are allowed: hash type, slash type, and simple (no-trailing-delimiter) type
- URN-based namespaces are also allowed
- HTTP scheme namespace URIs should be rooted at http://docs.oasis-open.org/[TC-shortName]/ or (preferably) at [TC-shortName]/[productName]/ — or possibly otherwise, as negotiated with TC Administration
- HTTP scheme namespace URIs must resolve to some informative resource, ideally meeting the requirements for a namespace document; OASIS will supply a default namespace document if the TC designates/supplies no resource for resolution
- URIs intended for use as HTTP scheme URI namespace names should be formally identified by the TC (as early in the specification design process as possible) so that the OASIS TC Administration may check for possible naming collisions, approve the proposed resolution target resource [namespace document], and properly reserve the URI — including possibly reservation of (all) space below the hierarchical level of the candidate NS URI
- In accordance with Disposition of Names in an XML Namespace (edited by Norman Walsh for the W3C Technical Architecture Group - TAG), TCs should provide information about change policies for XML namespaces
- Given the possible dual use of an HTTP scheme URI as (a) a namespace URI and (b) an identifier for a directory node, it seems reasonable to clarify expectations about server behaviors with respect to dereferencing HTTP scheme namespace URIs and about possible conflicts arising from contention/overloading. For example:
- If a TC elects to define a namespace name (URI reference) based upon "Type 3: Simple Namespace HTTP scheme URI," e.g., http://www.w3.org/2000/svg — what should happen when a user dereferences the namespace URI with an appended slash character ("/")?
- Should the TC be prohibited or strongly discouraged from installing files in the directory pathElementN/PathElementM/FinalNS-Element/ in the case where the namespace URI is (slashless) pathElementN/PathElementM/FinalNS-Element?
- Granting that TCs may designate a specific resource that is fetched when an HTTP scheme namespace URI is dereferenced: should OASIS set a policy or recommend a best-practice rule about server behavior with respect to this trait: (a) resolution delivers the designated resource to the client, and browser address window retains the NS URI in the window, vs. (b) resolution has the effect of changing the URI in the browser address window to the URI matching the delivered resource (NOT matching the NS URI)?
- Most issues relating to versioning remain unresolved. There is universal agreement that versioned resources should be identifiable in a sequence as members of a sequence, and that document management interfaces should make version identification easy. From among several proposals about identifier notation (enumerators) and labeling, no proposal has been accepted as the correct solution.
- How should the terms "version", "draft", "revision", and "edition" [see ODF "Second Edition"] be used in a manner consistent with the OASIS TC Process terminology and with users' expectations?
- How may the terms be optimized for precise specification use and also applied to non-specification-track documents?
- How should the terms "version", "draft", "revision", "edition" (or their canonical abbreviations), along with specification-stage identifiers (wd, cd, pr, cs, os) and numeric identifiers (cardinal and ordinal numbers) be used consistently:
- abstractly, in the titles of specifications
- as displayed on specification cover pages [possibly separate from use in titles]
- in URI path elements above the level of the filename
- in filenames
ASIS sections on metadata have been removed in this document, as metadata design has been targeted for work as a separate design effort, to be revisited following the conclusion of OASIS Staff design on specification templates, search requirements, and other functional requirements that are part of the document management system design. Results from this design will be incorporated into the "Guidelines for Filenames, URIs, Namespaces, [and Metadata]" document at a later stage.
A significant conclusion emerged from the two public reviews of AIR (July 2005) and ASIS (February 2006): the OASIS membership does not welcome a policy mandating the use of structured filenames which use hyphen-delimited metadata components, IETF-style. In some cases it may be natural and desirable to use some "metadata" information in filenames, but we heard strong negative reaction against the early proposal to make a componentized schema required. The current plan is to coordinate investigation about metadata requirements around site-wide search functionality, then to align the metadata model(s) with usage in specification templates and (other) markup embedding guidelines. Further support for (optional use of) componentized flenames might be reconsidered later (e.g., when it would make sense to generate sugested filenames from a metadata record.
Design document titles: An incomplete listing for various versions of "AIR", variously titled:
- Specification Template Instructions
- OASIS Document File Naming
- Object Naming Guidelines (ONG)
- Artifact Naming Guidelines (ANG)
- Artifact Identification Guidelines (AIG)
- Artifact Identification Requirements (AIR)
- Artifact Standard Identification Scheme for Metadata (ASIS)
- OASIS Document Policy
- OASIS Deliverable Policy
- OASIS File Naming Scheme
- OASIS URI, Filename, and Metadata Policy
- OASIS Filenaming and URI Rules
See Eve Maler in Proposed Rules for OASIS Document File Naming, February 2003: "Hyphens must be used as separators of the major portions of a file name. Spaces must not be used. Hyphens are recommended between words within the description and extended description portions, though underscores may be used. Hyphens are preferred because they are easier to see in displayed URIs and easier to type. Lowercase spelling is recommended..."
While the collection of naming rules is intended to apply to all resources deposited into the OASIS Open Library, rules may apply variably to different document genres, file formats, and as a function of specification status. Thus, while rules for allowable characters in file and directory names would apply universally, rules governing namespace definition would be applied differently to contributed specifications vs. TC-approved specifications.
The characters "?" (question-mark) and "=" (equals) may be recommended at some future time for use in the query component of a URI, should OASIS provide implementations that use such query elements.
TC members involved in naming are encouraged to consider the context in which URIs are likely to be used; in some print media, the UNDERSCORE character is indistinguishable from other "blank" characters, and in the context of common Web practice, may be ambiguous.
File names reserved for (future) administrative use include any files significant to the Apache server (e.g., .htaccess; *.cgi; *.conf or matching any Apache config files; mime.types) and files used by Staff for uniform browsing/navigation (e.g., index.html, index.htm, etc). A complete list must be provided.
The goal of the naming guidelines is to provide a set of loose constraints under which TCs can adopt naming practices suitable to their application. In boundary cases, where some naming construct is judged problematic for technical, political, or social reasons, the TC Administration will attempt to negotiate an acceptable solution that avoids the problem, but in some cases, may need to exertise authority, which may be appealed by a TC.
See for example DocBook : "Historically, DocBook was in no namespace. Starting with DocBook V5.0, DocBook is in a namespace: http://docbook.org/ns/docbook, the namespace name for DocBook. In time, other modules may also have their own namespace."
Once assigned to a resource, an identifier (URI) should never be retired and re-assigned to some other resource: the relationship between identifier and resource should be considered fixed and unseverable. This applies to primary resources (e.g., a conceptual whole document) and to secondary resources associated with a fragment identifier component of a URI [post-pound # fragment portion]. Some CMS products [Moin Wiki] will rewrite the value of of an (X)HTML ID attribute when a document is saved — breaking URI references that link to internal document components.
Some TCs want to hard-link to XML schemas from namespace URIs rather than to separate "namespace documents". We can honor that option by saying in the rules that the namespace URI MUST resolve rather than return a harsh 404 status code. The WebArch document's section on Namespace documents notes that there are many methods of accomplishing the effect of a namespace document, including documents based upon XML Schema (XSDs), to follow "Good practice": Namespace documents — "The owner of an XML namespace name SHOULD make available material intended for people to read and material optimized for software agents in order to meet the needs of those who will use the namespace vocabulary." The W3C document says: "the following are examples of data formats for namespace documents: OWL Web Ontology Language Reference (OWM), Resource Directory Description Language (RDDL), XML Schema Part 1: Structures (XML Schema), and XHTML 1.1-Module-based XHTML. Each of these formats meets different requirements described above for satisfying the needs of an agent that wants more information about the namespace..." It seems quite reasonable that an HTTP scheme namespace URI (namespace name, URI reference) should resolve to something useful and informative, whether an XML schema or other representation which fulfills the general requirement of "useful information." TCs should be able to indicate the resource to be delivered when the URI is dereferenced; we expect that resource to be located under the TC's web site root. If the TC designates/provides no such resource, OASIS TC Administration would do so.
Several solutions have been offered for a required or recommended practice of identifying "versions" of specifications using words and enumerators. One scheme would apply the term "revision" to any intermediate non-approved documents between major status levels — where '#' is a digit:
- Working Draft ## - any number of working drafts
- Committee Draft ## - a draft approved as CD by ballot)
- Committee Draft ## R## (revision number) - revisions to an approved CD
- Committee Specification ##
- Committee Specification ## R##
- OASIS Standard
Methods for uploading and installing resources in the OASIS Open Library include use of compressed archives or packages like ZIP and tar+gzip. It is expected that filenames and directory names created by a package extract operation will conform to the naming rules just as if the files were uploaded individually. In order to make all resources directly visible to human users (not requiring a download + extract-on-local-machine operation) and accessible to indexing for search purposes, all files in packages will be extracted and installed in the named directories. All package files uploaded to the OASIS Open Library will be retained at the canonical URI.
The OASIS web server(s) used for resources in the OASIS Open Library will respect the authoritative, canonical, exact (mixed-case) spelling used in official OASIS URIs, viz., in the path and filename components of the URI. Protecting the quality of URIs and the identity of URI-addressable resources depends critically upon respecting case: Unicode, used in XML and almost all modern applications, is case-sensitive, so that 'foo' and 'Foo' as identifiers are different. Most XML processing depends upon respect for case (XML schema, XML DTDs, etc). Therefore, using the exact (normative, canonical, authoritative) correct character tokens, including correct case, with respect to subdirectories and files is critical. The Apache server as currently configured is doing the right thing: rejecting requests for approximate URIs. We may provide assistance for 404s but will not deliver mis-identified documents silently.
See Upgrading OASIS document and file management services, posted by Peter Roden November 18, 2005: "We plan exclusively to use the [Internet] domain 'docs.oasis-open.org' for public access to approved work product of its technical committees. The 'docs' subdomain is in optional use today. By 'approved', we mean all work that has been approved under our TC Process rules as a Committee Draft, Public Review Draft, Committee Specification or OASIS Standard..."
Type 1: Slash Namespace HTTP scheme URI
Type 2: Hash Namespace HTTP scheme URI
Type 3: Simple Namespace HTTP scheme URI
Type 4: URN-based Namespaces