InstanceToSchema

Description  Download  Running  Design  History

Description

InstanceToSchema is a RELAX NG schema generator from XML instances. It is a command line tool. There is no user interface. It is written in java and needs J2SE 1.3 or 1.4 and a JAXP compliant SAX parser for running.

A typical use case consists to obtain a description of the structure of one or several (combined) XML files. It must be noted that the tool make use of only a little part of the RELAX NG language.

InstanceToSchema is developed inside the xmloperator project and shares its license but is packaged and can be used independently from the XML editor.

Download

GRCDateExecutableSource
1022003-10-30i2s_1_0_2.zip (51Kb)i2s_1_0_2_src.zip (94Kb)

Running

There is one java archive : i2s.jar. The main class is org.xmloperator.i2s.SchemaGenerator.

Command line arguments are the XML instance URIs. The resulting RELAX NG schema is produced as standard output.

Options. The first argument may be preceded by a "-stat" option for adding to each element definition of the resulting schema a comment that contains the occurrence count.

As an example, the following Windows command builds a schema.rng file (stats included) from a test.xml file, by using a Xerces parser :

java.exe -cp i2s.jar;xml-apis.jar;xercesImpl.jar org.xmloperator.i2s.SchemaGenerator -stat test.xml > schema.rng

Design

The software is based on pattern categories. A pattern category represents a set of RELAX NG patterns. The tool work consists in building for each element name a pattern category that is compatible with all the input XML instances and is as precise as possible.

The following pattern category types are implemented :

All these pattern categories consider elements and attributes as independent. However the tool framework doesn't require that. New pattern categories could correlate elements and attributes. Another thing the tool does not is inferencing datatypes.

The tool is suitable for processing large documents. It uses only one SAX parsing pass. The required memory space depends on the element name count and the complexity of patterns, not the document size.

The set of pairs (element name, pattern category) is translated to a RELAX NG simple syntax data model (the same is used by the XML editor), which is converted to a more readable full syntax and writed out with indentation.

History

1.0.2 (2003-10-30)

This release fixes a bug on PatternExtractor : the last character of any piece of text was not taken into account.

Another thanks to Mike for the test case.

1.0.1 (2003-10-22)

This release fixes two bugs on GroupPatternCategoryImpl.

Thanks to Mike Brown whose test case has revealed the two bugs.

1.0.0 (2003-02-13)

First public release.


Last update : 2003-10-30 Copyright (c) 2000 - 2003 The_xmloperator_project