Search Web Services Version 1.0

Discussion Document

2 November 2007

 

 

 

 

 

 

 

 

 

 

URIs:

http://docs.oasis-open.org/search-ws/v1.0/DiscussionDocument.doc

http://docs.oasis-open.org/search-ws/v1.0/DiscussionDocument.pdf

http://docs.oasis-open.org/search-ws/v1.0/DiscussionDocument.html

Technical Committee:

OASIS Search Web Services TC

Chair(s):

            Ray Denenberg

            Matthew Dovey

Related work:

This specification replaces or supercedes:

·                     SRU 1.2

 

This specification is related to:

·         ISO 23950

·         NISO Z39.92

 

 

Status:

This document has no official status. It was prepared by the OASIS Search Web Services TC as a strawman proposal, for public review, intended to generate discussion.  It is not a Committee Draft.

 

Purpose of this Document

This specification is based on the SRU (Search Retrieve via URL) specification which can be found at http://www.loc.gov/standards/sru/.  It is expected that this standard, when published, will deviate from SRU. How much it will deviate cannot be predicted at this time. The fact that the SRU spec is used as a starting point for development should not be cause for concern that this might be an effort to fast track SRU.  The committee hopes to preserve the useful features of SRU, but not to preserve those that are not considered useful.

 

The OASIS Technical Committee developing this standard has decided to request OASIS to release this as a discussion document.  Detailed review of this document is premature at this point, but feedback on the functionality and approach is solicited.

 

Open Issues

There are several current open issues before the committee not reflected in the body of the document.

 There is a wiki for the committee at http://wiki.oasis-open.org/search-ws/FrontPage, and an issues list at http://wiki.oasis-open.org/search-ws/issues

These issues are summarized here:

 

  1. Binary representation within records
    The protocol must support the inclusion of binary objects within records.  And external mechanisms exist to provide this support. The issue is whether the standard needs to define an explicit mechanism.

 

  1. Parameterized query support
    The protocol should support parameterized queries. Should they be supported within CQL, should CQL be a special case of parameterized query, or should these two be defined separately.

  2. OpenSearch
    The specification is intended to subsume the OpenSearch functionality. The existing OpenSearch specification is regarded as a legacy specification and this standard will also and show how the protocol interoperates with that spec. This has not been sufficiently addressed in this draft.

 

  1. XML/WSDL
    The committee determined that it is premature to write XML/WSDL  for the protocol, so there is a stub section with a pointer to the current SRU xml. XML/WSDL will be written later.

  2. Operation Parameter
    There is a suggestion  to eliminate the operation parameter, incorporating it instead in the base url, in some fashion. (This is not done in this draft.)  The reason for the suggestion is that this parameter is not consistent with REST principles.

  3. ATOM (or RSS) as a response schema.
    There is a proposal to replace the SRU response schema with ATOM or RSS. The current draft adds a parameter allowing the client to request an alternative schema. There should be one schema singled out in the standard that is mandatory. Currently that would be the SRU response schema, and the proposal is to make ATOM or RSS the single required schema instead.

  4. Scan
    There is a suggestion to eliminate the Scan operation, and instead represent this functionality via search/retrieve.

  5. XCQL
     There is a  suggestion is to eliminate XCQL, which is an XML representation of the CQL query - it is not used in a request, only in the echoed response. Some impementors find it useful to have the query echoed in a parsed form.  However its existence causes confusion.
  6. State
    There is discussion within the committee over how stateful the protocol (as currently defined) is. Some say it is not stateful at all. Others feel that the result set model is stateful.   Actually there are two points of debate: whether the protocol is stateful, and whether it should be.



 

Notices

Copyright © OASIS® 2007. All Rights Reserved.

 

All capitalized terms in the following text have the meanings assigned to them in the OASIS Intellectual Property Rights Policy (the "OASIS IPR Policy"). The full Policy may be found at the OASIS website.

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published, and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this section are included on all such copies and derivative works. However, this document itself may not be modified in any way, including by removing the copyright notice or references to OASIS, except as needed for the purpose of developing any document or deliverable produced by an OASIS Technical Committee (in which case the rules applicable to copyrights, as set forth in the OASIS IPR Policy, must be followed) or as required to translate it into languages other than English.

The limited permissions granted above are perpetual and will not be revoked by OASIS or its successors or assigns.

This document and the information contained herein is provided on an "AS IS" basis and OASIS DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY OWNERSHIP RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

OASIS requests that any OASIS Party or any other party that believes it has patent claims that would necessarily be infringed by implementations of this OASIS Committee Specification or OASIS Standard, to notify OASIS TC Administrator and provide an indication of its willingness to grant patent licenses to such patent claims in a manner consistent with the IPR Mode of the OASIS Technical Committee that produced this specification.

OASIS invites any party to contact the OASIS TC Administrator if it is aware of a claim of ownership of any patent claims that would necessarily be infringed by implementations of this specification by a patent holder that is not willing to provide a license to such patent claims in a manner consistent with the IPR Mode of the OASIS Technical Committee that produced this specification. OASIS may include such claims on its website, but disclaims any obligation to do so.

OASIS takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on OASIS' procedures with respect to rights in any document or deliverable produced by an OASIS Technical Committee can be found on the OASIS website. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this OASIS Committee Specification or OASIS Standard, can be obtained from the OASIS TC Administrator. OASIS makes no representation that any information or list of intellectual property rights will at any time be complete, or that any claims in such list are, in fact, Essential Claims.

The names "OASIS", [insert specific trademarked names, abbreviations, etc. here] are trademarks of OASIS, the owner and developer of this specification, and should be used only to refer to the organization and its official outputs. OASIS welcomes reference to, and implementation and use of, specifications, while reserving the right to enforce its marks against misleading uses. Please see http://www.oasis-open.org/who/trademark.php for above guidance.

Table of Contents

1        Introduction. 7

1.1 Terminology. 7

1.2 Normative References. 7

1.3 Non-Normative References. 7

2        Search Web Service Overview. 8

3        Contextual Query Language. 9

3.1 Query Syntax. 9

3.1.1 Basic Query Structure. 9

3.1.2 Search Clause. 9

3.1.3 Search Term.. 9

3.1.4 Index Name. 10

3.1.5 Relation. 10

3.1.6 Relation Modifiers. 11

3.1.7 Boolean Operators. 11

3.1.8 Boolean Modifiers. 11

3.1.9 Proximity Modifiers. 12

3.1.10 Sorting. 12

3.1.11 Prefix Assignment 13

3.1.12 Case Sensitivity. 13

3.2 BNF. 13

3.3 Context Sets. 15

4        The searchRetrieve operation. 16

4.1 Request Parameters. 16

4.2 Response Parameters. 17

4.3 Version: the “version” Parameter 18

4.4 Records. 18

4.4.1 Record Parameters. 18

4.4.2 Record Packing. 19

4.5 Result Sets. 20

4.5.1 Result Set Model 20

4.5.2 resultSetId. 20

4.5.3 ResultSet Idle Time. 21

4.6 Diagnostics. 21

4.6.1 Diagnostic Categories: Fatal vs. Non-fatal, and Surrogate Vs. Non-Surrogate. 21

4.6.2 Diagnostic Schema. 21

4.7 Extensions: the “extraRequestData’, ‘extraResponseData’, and xtraRecordData’ Parameters. 23

4.8 Echoing the Request: The “echoedSearchRetrieveRequest” Parameter 24

4.8.1 xQuery. 24

4.8.2 baseUrl 24

4.9 Stylesheets: the ‘stylesheet’ Parameter 25

5        Scan Operation. 26

5.1 Request Parameters. 26

5.2 Response Parameters. 27

5.3 Terms. 27

5.4 Example Scan Response. 28

6        The Explain Facility. 30

6.1 Explain Operation. 30

6.1.1 Request Parameters. 30

7        XML and WSDL Files. 31

8        Transports. 32

8.1 HTTP Get Binding. 32

8.1.1 Syntax. 32

8.1.2 Encoding Issues. 32

8.1.3 Server Procedure. 33

8.2 HTTP Post Binding. 33

8.3 SOAP Binding. 34

8.3.1 SOAP Requirements. 34

8.3.2 SOAP Parameter Differences. 34

8.3.3 Extension Parameters  via SOAP. 35

A.      The CQL Context Set 36

A.1 Indexes. 36

A.2 Relations. 37

A.2.1 Implicit Relations. 37

A.2.2 Defined Relations. 38

A.3 Relation Modifiers. 39

A.3.1 Functional Modifiers. 39

A.3.2 Term-format Modifiers. 40

A.3.3 Masking. 41

A.4 Booleans. 43

A.5 Boolean Modifiers. 43

Note about Proximity Units. 44

B.      Diagnostics. 45

C.      NISO Z39.92 (ZeeRex) 58

D.      OpenSearch. 60

D.1 OpenSearch Description Document 60

D.2 OpenSearch URL Template. 61

D.3 OpenSearch Response Elements. 61

E.      Authentication, Authorization, and Access Control 63

E.1 Authentication. 63

E.2 Authorization and Access Control 63

E.3 IP Address. 63

Users may be differentiated by the IP address from which they are connecting to the server. Unfortunately this is unreliable at best due to the increasing use of web proxy systems -- there may be many users all of which appear to be coming from the same IP address due to a proxy. The advantage is that it is completely transparent to the client and hence the user, so for a small service may be appropriate. 63

E.4 Basic Authentication. 63

E.5 Secure Sockets. 64

E.6 Additional Message Data. 64

E.7 Web Services Security and Security Assertion Markup Language (SAML) Security Tokens. 64

 


1      Introduction

1.1 Terminology

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in [RFC2119].

1.2 Normative References

[RFC2119]               S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, http://www.ietf.org/rfc/rfc2119.txt, IETF RFC 2119, March 1997.

           

1.3 Non-Normative References

           

2      Search Web Service Overview

 

The Search web service is a means of opening a database to external enquiry in a standardized manner that facilitates discovery of query and response possibilities and makes it possible for heterogeneous databases to be queried simultaneously with the same or similar queries.  Client software can be easily configured using a standardized XML explain document that is accessible from the base URL or via the explain operation.  In contrast with protocols such as SQL and XQuery, detailed knowledge of a database’s structure is not necessary as the explain document contains parsable information on server defaults, searchable indexes and record schemas that are returned in the response.

 

Context sets can be made for use with the search web service that define standard index names and search attributes thus facilitating multi-database searching via either a single or similar searches.   Profiles can be registered combining context sets and record schemas and so ensure inter-operability in a variety of domains.

 

Two kinds of enquiry access are defined; search via keywords or phrases that returns a result set of records and scan via terms that returns a list of terms in an index.

 

A search or scan can be expressed in a simple URL, enabling a search to be embedded in any web page. The server may send the results with an accompanying XML style sheet, thus the service can be widely used in web pages without any underlying programming. 

 

 

3      Contextual Query Language

CQL, the Contextual Query Language, is a formal language for representing queries to information retrieval systems such as web indexes, bibliographic catalogs and museum collection information. The design objective is that queries be human readable and writable, and that the language be intuitive while maintaining the expressiveness of more complex languages.

Traditionally, query languages have fallen into two camps: Powerful, expressive languages, not easily readable nor writable by non-experts (e.g. SQL, PQF, and XQuery);or simple and intuitive languages not powerful enough to express complex concepts (e.g. CCL and google). CQL tries to combine simplicity and intuitiveness of expression for simple, every day queries, with the richness of more expressive languages to accommodate complex concepts when necessary.

3.1 Query Syntax

3.1.1 Basic Query Structure

A CQL query consists of either a single search clause [example a], or multiple search clauses connected by boolean operators [example b]. It may have a sort specification at the end, following the 'sortBy' keyword [example c]. In addition it may include prefix assignments which assign short names to context set identifiers [example d].

 

Examples:

a.             dc.title = fish

b.            dc.title = fish or dc.creator = sanderson

c.             dc.title = fish sortBy dc.date/sort.ascending

d.            > dc = "info:srw/context-sets/1/dc-v1.1" dc.title any fish

 

3.1.2 Search Clause

A search clause consists of either an index, relation and a search term [example a], or a search term by itself [example b]. If the clause consists of just a term, then the index is treated as 'cql.serverChoice', and the relation is treated as '=' [example c]. (Therefore example b and c are semantically equivalent.)

 

Examples:

  1. dc.title = fish
  2. fish
  3. cql.serverChoice = fish

 

3.1.3 Search Term

Search terms MAY be enclosed in double quotes [example a], though need not be [example b]. Search terms MUST be enclosed in double quotes if they contain any of the following characters: < > = / ( ) and whitespace [example c]. The search term may be an empty string [example d], but must be present in a search clause. The empty search term has no defined semantics.

 

Examples:

  1. "fish"
  2. fish
  3. "squirrels fish"
  4. “”

 

3.1.4 Index Name

An index name always includes a base name [example a] and may also include a prefix [example b], which determines the context set of which the index is a part. The base name and the prefix are separated by a dot character ('.'). If multiple '.' characters are present, then the first should be treated as the prefix/base name delimiter. If the prefix is not supplied, it is determined by the server. Examples:

Examples:

  1. title any Afish dog@
  2. dc.title any Afish dog@

 

3.1.5 Relation

The relation in a search clause specifies the relationship between the index and search term. It also always includes a base name [example a] and may also include a prefix providing a context for the relation [example b]. If a relation does not have a prefix, the context set is 'cql'. If no relation is supplied in a search clause, then = is assumed, which means that the relation is determined by the server.  (As is noted above, if the relation is omitted then the index MUST also be omitted; the relation is assumed to be A=@ and the index is assumed to be cql.serverChoice; that is, the server choses both the index and the relation.)

 

Examples:

  1. dc.title any “fish frog”
    Find records where the title (as defined by the Adc@ context set) contains one of the words :fish@, Afrog@
  2. dc.title cql.any “fish frog”
    This query has the same meaning as the previous, since the default context set for the relation is Acql@.
  3. dc.title cql.all “fish frog”
    Find records where the title contains all of the words :fish@, Afrog@

 

3.1.6 Relation Modifiers

Relations may be modified by one or more relation modifiers. Relation modifiers always include a base name, and may include a prefix for a context set [example a] as above. If a prefix is not supplied, the context set is 'cql'. Relation modifiers are separated from each other and from the relation by forward slash characters('/'). Whitespace may be present on either side of a '/' character, but the relation plus modifiers group may not end in a '/' [example b]. Relation modifiers may also have a comparison symbol and a value. The comparison symbol is any of = < <= > >= <>. The value must obey the same rules for quoting as search terms, above [example c].

Examples:

  1. dc.title any/relevant fish
    T
    he relation modifier Arelevant@ means The server should use a relevancy algorithm for determining matches and the order of the result set. When the relevant modifier is used, the actual relation is often not significant.

 

  1. dc.title any/ relevant /cql.string fish 

    (we need to explain this one or drop it.)

 

  1. title any/rel.algorithm=cori fish
    This example is distinguished from example 1 in which the modifier Arelevant@ is from the CQL context set.  In this case the modifier is Aalgorithm=core@, from the rel context set, in essence meaning use the relevance algorithm Acori@.  A description of this context set is available at  http://srw.cheshire3.org/contextSets/rel/

 

3.1.7 Boolean Operators

Search clauses may be linked by boolean operators. These are: and, or, not and prox [example in 3.1.8]. Note that not is 'and-not' and must not be used as a unary operator. Boolean operators all have the same precedence; they are evaluated left-to-right. Parentheses may be used to override left-to-right evaluation [example b].

 

Examples:

a.     dc.title = “monkey house” and dc.creator = vonnegut

b.    dc.title = “monkey house” not dc.creator = vonnegut

c.     dc.title = fish or dc.creator = sanderson

d.    dc.title = fish or (dc.creator = sanderson and dc.identifier = "id:1234567")

3.1.8 Boolean Modifiers

Booleans may be modified by one or more boolean modifiers, separated as per relation modifiers with '/' characters. Again, boolean modifiers consist of a base name and may include a prefix determining the modifier's context set [example a]. If not supplied, then the context set is 'cql'. As per relation modifiers, they may also have a comparison symbol and a value [example b].

Examples:

  1. dc.title = fish or/rel.combine=sum dc.creator any sanderson

    [We need an explanation here of what relevance means when applied to a boolean (as opposed to a relation). We never have understood this. If we can=t describe it then delete this example.]
  2. dc.title = monkey prox/unit=word/distance>1 dc.title = house
    Find records where both Amonkey@ and Ahouse@ are in the title, separated by at least one intervening word.

 

3.1.9 Proximity Modifiers

Basic proximity modifiers are defined in the CQL context set .[reference]. Proximity units 'word', 'sentence', 'paragraph', and 'element' are defined there and may also be defined in other context sets. Within the CQL set they are explicitly undefined. When defined in another context set they may be assigned specific meaning.

 

Thus compare "prox/unit=word" with "prox/xyz.unit=word". In the first, 'unit' is a prox modifier from the CQL set, and as such its values are undefined, so 'word' is subject to interpretation by the server. In the second, 'unit' is a prox modifier defined by the xyz context set, which may assign the unit 'word' a specific meaning.

 

The context set xyz may define additional units, for example, 'street':

 

 prox/xyz.unit="street"

 

This approach, 'prox/xyz.unit="street"', is chosen rather than 'Prox/unit=xyz.street' for the following reason. In the first case, 'unit' is a modifier defined in the xyz context set, and 'street' is a value defined for that modifier. In the second, 'unit' is a modifier from the cql context set, with a value defined in a different set. so its value would have to be one that is defined in the cql context set. This approach is chosen to avoid pairing a modifier from one set with a value from another, which can lead to unpredictable results.

 

3.1.10 Sorting

Queries may include explicit information on how to sort the result set generated by the search. (See result set model [reference].)

The sort specification is included at the end, and is separated by a 'sortBy' keyword. The specification consists of an ordered list of indexes, potentially with modifiers, to use as keys on which to sort the result set. If multiple keys are given, then the second and subsequent keys should be used to determine the order of items that would otherwise sort together. Each index used as a sort key has the same semantics as when it is used to search.

 

Modifiers may be attached to the index in the same way as to booleans and relations in the main part of the query. These modifiers may be part of any context set, including the CQL context set and the Sort context set [reference]. This is the only time when a modifier may be attached to an index.  If a modifier may be used in this way it should be stated in the description of its semantics.  As many types of search also require specification of term order (for example the <, > and within relations), these modifiers are often specified as relation modifiers.

 

Examples:

  1. "cat" sortBy dc.title
  2. "dinosaur" sortBy dc.date/sort.descending dc.title/sort.ascending

 

3.1.11 Prefix Assignment

 Note: The use of Prefix Maps is expected to be uncommon.

 A Prefix Map may be used to assign context set names to specific identifiers in order to be sure that the server maps them in a desired fashion. It may occur at any place in the query and applies to anything below the map in the query tree. A prefix assignment is specified by: '>' shortname '=' identifier [example a]. The shortname and '=' sign may be omitted, in which case it sets a default context set for indexes [example b].

 

Examples:

a.     > dc = "info:units/direct-current" dc.voltage > 12
This example illustrates that while Adc@ is almost always used as the prefix for the Dublin Core context set, this is not always so, as in this case it is used for the AdeepCustard@ context set.

b.     >  "info:units/direct-current" voltage > 12
This query has the same meaning as example a.

3.1.12 Case Sensitivity

All parts of CQL are case insensitive apart from user supplied search terms, values for modifiers and prefix map identifiers, which may or may not be case sensitive. If any case insensitive part of CQL is specified with  mixed upper and lower case, it is for aesthetic purposes only.

 

3.2 BNF

Following is the Backus Naur Form (BNF) definition for CQL. ( "::=" represents "is defined as".)

 

sortedQuery

::=

prefixAssignment sortedQuery

| scopedClause ['sortby' sortSpec]

sortSpec

::=

sortSpec singleSpec | singleSpec

singleSpec

::=

index [modifierList]

cqlQuery

::=

prefixAssignment cqlQuery

| scopedClause

prefixAssignment

::=

'>' prefix '=' uri

| '>' uri

scopedClause

::=

scopedClause booleanGroup searchClause

| searchClause

booleanGroup

::=

boolean [modifierList]

boolean

::=

'and' | 'or' | 'not' | 'prox'

searchClause

::=

'(' cqlQuery ')'

 | index relation searchTerm

 | searchTerm

relation

::=

comparitor [modifierList]

comparitor

::=

comparitorSymbol | namedComparitor

comparitorSymbol

::=

'=' | '>' | '<' | '>=' | '<=' | '<>' | '=='

namedComparitor

::=

identifier

modifierList

::=

modifierList modifier | modifier

modifier

::=

'/' modifierName [comparitorSymbol modifierValue]

prefix, uri, modifierName, modifierValue, searchTerm, index

::=

term

term

::=

identifier | 'and' | 'or' | 'not' | 'prox' | 'sortby'

identifier

::=

charString1 | charString2

charString1

:=

Any sequence of characters that does not include any of the following:

whitespace

 ( (open parenthesis )

 ) (close parenthesis)

 =

 <

 >

 '"' (double quote)

 /

 If the final sequence is a reserved word, that token is returned instead. Note that '.' (period) may be included, and a sequence of digits is also permitted. Reserved words are 'and', 'or', 'not', and 'prox' (case insensitive). When a reserved word is used in a search term, case is preserved.

charString2

:=

Double quotes enclosing a sequence of any characters except double quote (unless preceded by backslash (\)). Backslash escapes the character following it. The resultant value includes all backslash characters except those releasing a double quote (this allows other systems to interpret the backslash character). The surrounding double quotes are not included.