All Packages Class Hierarchy This Package Previous Next Index
java.lang.Object | +----com.jclark.xml.tok.Encoding
Encoding object corresponds to a possible
encoding (a mapping from characters to sequences of bytes).
It provides operations on byte arrays
that represent all or part of a parsed XML entity in that encoding.
The set of ASCII characters excluding $@\^`{}~
have a special status; these are called XML significant
characters.
This class imposes certain restrictions on an encoding:
Several methods operate on byte subarrays. The subarray is specified
by a byte array buf and two integers,
off and end; off
gives the index in buf of the first byte of the subarray
and end gives the
index in buf of the byte immediately after the last byte.
Use the getInitialEncoding method to get an
Encoding object to use to start parsing an entity.
The main operations provided by Encoding are
tokenizeProlog, tokenizeContent and
tokenizeCdataSection;
these are used to divide up an XML entity into tokens.
tokenizeProlog is used for the prolog of an XML document
as well as for the external subset and parameter entities (except
when referenced in an EntityValue);
it can also be used for parsing the Misc* that follows
the document element.
tokenizeContent is used for the document element and for
parsed general entities that are referenced in content
except for CDATA sections.
tokenizeCdataSection is used for CDATA sections, following
the <![CDATA[ up to and including the ]]>.
tokenizeAttributeValue and tokenizeEntityValue
are used to further divide up tokens returned by tokenizeProlog
and tokenizeContent; they are also used to divide up entities
referenced in attribute values or entity values.
]]>.
<![CDATA[.
] in the prolog.
) in the prolog that is not
followed immediately by any of
*, + or ?.
)* in the prolog.
)+ in the prolog.
)? in the prolog.
, in the prolog.
<!-- comment -->.
]]> in the prolog.
<![ in the prolog.
> in the prolog.
<!NAME in the prolog.
<name/>,
that doesn't have any attribute specifications.
<name att="val"/>,
that contains one or more attribute specifications.
</name>.
amp, lt, gt,
quot, apos.
*.
+.
?.
[ in the prolog.
( in the prolog.
| in the prolog.
% in the prolog that does not start
a parameter entity reference.
#NAME in the prolog.
<name>,
that doesn't have any attribute specifications.
<name att="val">,
that contains one or more attribute specifications.
xml).
Encoding corresponding to
the specified IANA character set name.
char,
or zero if different chars are represented by different
numbers of bytes.
Encoding object for use with internal entities.
Encoding for entities encoded with
a single-byte encoding (an encoding in which each byte represents
exactly one character).
public static final int TOK_DATA_CHARS
public static final int TOK_DATA_NEWLINE
public static final int TOK_START_TAG_NO_ATTS
<name>,
that doesn't have any attribute specifications.
public static final int TOK_START_TAG_WITH_ATTS
<name att="val">,
that contains one or more attribute specifications.
public static final int TOK_EMPTY_ELEMENT_NO_ATTS
<name/>,
that doesn't have any attribute specifications.
public static final int TOK_EMPTY_ELEMENT_WITH_ATTS
<name att="val"/>,
that contains one or more attribute specifications.
public static final int TOK_END_TAG
</name>.
public static final int TOK_CDATA_SECT_OPEN
<![CDATA[.
public static final int TOK_CDATA_SECT_CLOSE
]]>.
public static final int TOK_ENTITY_REF
public static final int TOK_MAGIC_ENTITY_REF
amp, lt, gt,
quot, apos.
public static final int TOK_CHAR_REF
public static final int TOK_CHAR_PAIR_REF
public static final int TOK_PI
public static final int TOK_XML_DECL
xml).
public static final int TOK_COMMENT
<!-- comment -->.
This can occur both in the prolog and in content.
public static final int TOK_ATTRIBUTE_VALUE_S
public static final int TOK_PARAM_ENTITY_REF
public static final int TOK_PROLOG_S
public static final int TOK_DECL_OPEN
<!NAME in the prolog.
public static final int TOK_DECL_CLOSE
> in the prolog.
public static final int TOK_NAME
public static final int TOK_NMTOKEN
public static final int TOK_POUND_NAME
#NAME in the prolog.
public static final int TOK_OR
| in the prolog.
public static final int TOK_PERCENT
% in the prolog that does not start
a parameter entity reference.
This can occur in an entity declaration.
public static final int TOK_OPEN_PAREN
( in the prolog.
public static final int TOK_CLOSE_PAREN
) in the prolog that is not
followed immediately by any of
*, + or ?.
public static final int TOK_OPEN_BRACKET
[ in the prolog.
public static final int TOK_CLOSE_BRACKET
] in the prolog.
public static final int TOK_LITERAL
public static final int TOK_NAME_QUESTION
?.
public static final int TOK_NAME_ASTERISK
*.
public static final int TOK_NAME_PLUS
+.
public static final int TOK_COND_SECT_OPEN
<![ in the prolog.
public static final int TOK_COND_SECT_CLOSE
]]> in the prolog.
public static final int TOK_CLOSE_PAREN_QUESTION
)? in the prolog.
public static final int TOK_CLOSE_PAREN_ASTERISK
)* in the prolog.
public static final int TOK_CLOSE_PAREN_PLUS
)+ in the prolog.
public static final int TOK_COMMA
, in the prolog.
public abstract int convert(byte sourceBuf[],
int sourceStart,
int sourceEnd,
char targetBuf[],
int targetStart)
sourceBuf between sourceStart
and sourceEnd are converted to characters and stored
in targetBuf starting at targetStart.
(targetBuf.length - targetStart) * getMinBytesPerChar()
must be at greater than or equal to
sourceEnd - sourceStart.
If getFixedBytesPerChar returns a value greater than 0,
then the return value will be equal to
(sourceEnd - sourceStart)/getFixedBytesPerChar().
targetBuf
public abstract int getFixedBytesPerChar()
char,
or zero if different chars are represented by different
numbers of bytes. The value returned will 0, 1, 2, or 4.
public abstract void movePosition(byte buf[],
int off,
int end,
Position pos)
pos gives the position of the byte at index
off in buf.
On exit, it pos will give the position of the byte at index
end, which must be greater than or equal to off.
The bytes between off and end must encode
one or more complete characters.
A carriage return followed by a line feed will be treated as a single
line delimiter provided that they are given to movePosition
together.
public final int tokenizeCdataSection(byte buf[],
int off,
int end,
Token token) throws EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException
TOK_DATA_CHARS
TOK_DATA_NEWLINE
TOK_CDATA_SECT_CLOSE
Information about the token is stored in token.
After TOK_CDATA_SECT_CLOSE is returned, the application
should use tokenizeContent.
public final int tokenizeContent(byte buf[],
int off,
int end,
ContentToken token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenException
TOK_START_TAG_NO_ATTS
TOK_START_TAG_WITH_ATTS
TOK_EMPTY_ELEMENT_NO_ATTS
TOK_EMPTY_ELEMENT_WITH_ATTS
TOK_END_TAG
TOK_DATA_CHARS
TOK_DATA_NEWLINE
TOK_CDATA_SECT_OPEN
TOK_ENTITY_REF
TOK_MAGIC_ENTITY_REF
TOK_CHAR_REF
TOK_CHAR_PAIR_REF
TOK_PI
TOK_XML_DECL
TOK_COMMENT
Information about the token is stored in token.
When TOK_CDATA_SECT_OPEN is returned,
tokenizeCdataSection should be called until
it returns TOK_CDATA_SECT.
public static final Encoding getInitialEncoding(byte buf[],
int off,
int end,
Token token)
buf of the first byte of the entity
buf following the last available
byte of the entity; end - off must be greater than or equal
to 4 unless the entity has fewer that 4 bytes, in which case it must
be equal to the length of the entity
token.getTokenEnd()
will return off + 2, otherwise it will return
off
public final Encoding getEncoding(String name)
Encoding corresponding to
the specified IANA character set name.
Returns this Encoding if the name is null.
Returns null if the specified encoding is not supported.
Note that there are two distinct Encoding objects
associated with the name UTF-16, one for
each possible byte order; if this Encoding
is UTF-16 with little-endian byte ordering, then
getEncoding("UTF-16") will return this,
otherwise it will return an Encoding for
UTF-16 with big-endian byte ordering.
public final Encoding getSingleByteEncoding(String map)
Encoding for entities encoded with
a single-byte encoding (an encoding in which each byte represents
exactly one character).
map.charAt(b)
specifies the character encoded by byte b; bytes that do
not represent any character should be mapped to ?
public static final Encoding getInternalEncoding()
Encoding object for use with internal entities.
This is a UTF-16 big endian encoding, except that newlines
are assumed to have been normalized into line feed,
so carriage return is treated like a space.
public final int tokenizeProlog(byte buf[],
int off,
int end,
Token token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenException, EndOfPrologException
TOK_PI
TOK_XML_DECL
TOK_COMMENT
TOK_PARAM_ENTITY_REF
TOK_PROLOG_S
TOK_DECL_OPEN
TOK_DECL_CLOSE
TOK_NAME
TOK_NMTOKEN
TOK_POUND_NAME
TOK_OR
TOK_PERCENT
TOK_OPEN_PAREN
TOK_CLOSE_PAREN
TOK_OPEN_BRACKET
TOK_CLOSE_BRACKET
TOK_LITERAL
TOK_NAME_QUESTION
TOK_NAME_ASTERISK
TOK_NAME_PLUS
TOK_COND_SECT_OPEN
TOK_COND_SECT_CLOSE
TOK_CLOSE_PAREN_QUESTION
TOK_CLOSE_PAREN_ASTERISK
TOK_CLOSE_PAREN_PLUS
TOK_COMMA
tokenizeContent should be used on the remainder
of the entity
public final int tokenizeAttributeValue(byte buf[],
int off,
int end,
Token token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenException
TOK_DATA_CHARS
TOK_DATA_NEWLINE
TOK_ATTRIBUTE_VALUE_S
TOK_MAGIC_ENTITY_REF
TOK_ENTITY_REF
TOK_CHAR_REF
TOK_CHAR_PAIR_REF
public final int tokenizeEntityValue(byte buf[],
int off,
int end,
Token token) throws PartialTokenException, InvalidTokenException, EmptyTokenException, ExtensibleTokenException
TOK_DATA_CHARS
TOK_DATA_NEWLINE
TOK_PARAM_ENTITY_REF
TOK_MAGIC_ENTITY_REF
TOK_ENTITY_REF
TOK_CHAR_REF
TOK_CHAR_PAIR_REF
public final int skipIgnoreSect(byte buf[],
int off,
int end) throws PartialTokenException, InvalidTokenException
<![ IGNORE [.
]]>
public final String getPublicId(byte buf[],
int off,
int end) throws InvalidTokenException
public final boolean matchesXMLString(byte buf[],
int off,
int end,
String str)
public final int skipS(byte buf[],
int off,
int end)
end if there is the subarray is all whitespace
public final int getMinBytesPerChar()
All Packages Class Hierarchy This Package Previous Next Index