Class CharacterReference
- All Implemented Interfaces:
CharSequence, Comparable<Segment>
- Direct Known Subclasses:
CharacterEntityReference, NumericCharacterReference
CharacterEntityReference and NumericCharacterReference.
This class, together with its subclasses, contains static methods to perform most required operations without having to instantiate an object.
Instances of this class are useful when the positions of character references in a source document are required, or to replace the found character references with customised text.
CharacterReference instances are obtained using one of the following methods:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intRepresents an invalid unicode code point. -
Method Summary
Modifier and TypeMethodDescriptionfinal voidappendCharTo(Appendable appendable) Appends the character represented by this character reference to the specified appendable object.static Stringdecode(CharSequence encodedText) Decodes the specified HTML encoded text into normal text.static Stringdecode(CharSequence encodedText, boolean insideAttributeValue) Decodes the specified HTML encoded text into normal text.static StringDecodes the specified text after collapsing its white space.static Stringencode(char ch) Encodes the specified character into a character reference if required.static Stringencode(CharSequence unencodedText) Encodes the specified text, escaping special characters into character references.static StringencodeWithWhiteSpaceFormatting(CharSequence unencodedText) Encodes the specified text, preserving line breaks, tabs and spaces for rendering by converting them to markup.chargetChar()Returns the character represented by this character reference.abstract StringReturns the encoded form of this character reference.static StringgetCharacterReferenceString(int codePoint) Returns the encoded form of the specified unicode code point.intReturns the unicode code point represented by this character reference.static intgetCodePointFromCharacterReferenceString(CharSequence characterReferenceText) Parses a single encoded character reference text into a unicode code point.Returns the decimal encoded form of this character reference.static StringgetDecimalCharacterReferenceString(int codePoint) Returns the decimal encoded form of the specified unicode code point.static WritergetEncodingFilterWriter(Writer writer) Returns the hexadecimal encoded form of this character reference.static StringgetHexadecimalCharacterReferenceString(int codePoint) Returns the hexadecimal encoded form of the specified unicode code point.Returns the unicode code point of this character reference in U+ notation.static StringgetUnicodeText(int codePoint) Returns the specified unicode code point in U+ notation.booleanIndicates whether this character reference is terminated by a semicolon (;).static CharacterReferenceparse(CharSequence characterReferenceText) Parses a single encoded character reference text into aCharacterReferenceobject.static Stringreencode(CharSequence encodedText) static final booleanrequiresEncoding(char ch) Indicates whether the specified character would need to be encoded in HTML text.Methods inherited from class Segment
charAt, compareTo, encloses, encloses, equals, getAllCharacterReferences, getAllElements, getAllElements, getAllElements, getAllElements, getAllElements, getAllElementsByClass, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTags, getAllStartTagsByClass, getAllTags, getAllTags, getBegin, getChildElements, getDebugInfo, getEnd, getFirstElement, getFirstElement, getFirstElement, getFirstElement, getFirstElementByClass, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTag, getFirstStartTagByClass, getFormControls, getFormFields, getMaxDepthIndicator, getNodeIterator, getRenderer, getRowColumnVector, getSource, getStyleURISegments, getTextExtractor, getURIAttributes, hashCode, ignoreWhenParsing, isWhiteSpace, isWhiteSpace, length, parseAttributes, subSequence, toStringMethods inherited from interface CharSequence
chars, codePoints, getChars, isEmpty
-
Field Details
-
INVALID_CODE_POINT
public static final int INVALID_CODE_POINTRepresents an invalid unicode code point.This can be the result of parsing a numeric character reference outside of the valid unicode range of 0x000000-0x10FFFF, or any other invalid character reference.
- See Also:
-
-
Method Details
-
getCodePoint
public int getCodePoint()Returns the unicode code point represented by this character reference.- Returns:
- the unicode code point represented by this character reference.
- See Also:
-
getChar
public char getChar()Returns the character represented by this character reference.If this character reference represents a unicode supplimentary code point, any bits outside of the least significant 16 bits of the code point are truncated, yielding an incorrect result.
To ensure that the character is correctly appended to an
Appendableobject such as aWriter, use the code:characterReference.appendCharTo(appendable)
instead of:appendable.append(characterReference.getChar())- Returns:
- the character represented by this character reference.
- See Also:
-
appendCharTo
Appends the character represented by this character reference to the specified appendable object.If this character is a unicode supplementary character, then both the UTF-16 high/low surrogate
charvalues of the of the character are appended, as described in the Unicode character representations section of thejava.lang.Characterclass.If the static
Config.ConvertNonBreakingSpacesproperty is set totrue(the default), then calling this method on a non-breaking space character reference ( ) results in a normal space being appended.- Parameters:
appendable- the object to append this character reference to.- Throws:
IOException
-
isTerminated
public boolean isTerminated()Indicates whether this character reference is terminated by a semicolon (;).Conversely, this library defines an unterminated character reference as one which does not end with a semicolon.
The SGML specification allows unterminated character references in some circumstances, and because the HTML 4.01 specification states simply that "authors may use SGML character references", it follows that they are also valid in HTML documents, although their use is strongly discouraged.
Unterminated character references are not allowed in XHTML documents.
- Returns:
trueif this character reference is terminated by a semicolon, otherwisefalse.- See Also:
-
encode
Encodes the specified text, escaping special characters into character references.Each character is encoded only if the
requiresEncoding(char)method would returntruefor that character, using itsCharacterEntityReferenceif available, or a decimalNumericCharacterReferenceif its unicode code point is greater than U+007F.The only exception to this is an apostrophe (U+0027), which depending on the current setting of the static
Config.IsApostropheEncodedproperty, is either left unencoded (default setting), or encoded as the numeric character reference "'".This method never encodes an apostrophe into its character entity reference
'as this entity is not defined for use in HTML. See the comments in theCharacterEntityReferenceclass for more information.To encode text using only numeric character references, use the
NumericCharacterReference.encode(CharSequence)method instead.- Parameters:
unencodedText- the text to encode.- Returns:
- the encoded string.
- See Also:
-
encode
Encodes the specified character into a character reference if required.The encoding of the character follows the same rules as for each character in the
encode(CharSequence unencodedText)method.- Parameters:
ch- the character to encode.- Returns:
- a character reference if appropriate, otherwise a string containing the original character.
-
encodeWithWhiteSpaceFormatting
Encodes the specified text, preserving line breaks, tabs and spaces for rendering by converting them to markup.This performs the same encoding as the
encode(CharSequence)method, but also performs the following conversions:- Line breaks, being Carriage Return (U+000D) or Line Feed (U+000A) characters, and Form Feed characters (U+000C)
are converted to "
<br />". CR/LF pairs are treated as a single line break. - Multiple consecutive spaces are converted so that every second space is converted to "
" while ensuring the last is always a normal space. - Tab characters (U+0009) are converted as if they were four consecutive spaces.
The conversion of multiple consecutive spaces to alternating space/non-breaking-space allows the correct number of spaces to be rendered, but also allows the line to wrap in the middle of it.
Note that zero-width spaces (U+200B) are converted to the numeric character reference "
​" through the normal encoding process, but IE6 does not render them properly either encoded or unencoded.There is no method provided to reverse this encoding.
- Parameters:
unencodedText- the text to encode.- Returns:
- the encoded string with white space formatting converted to markup.
- See Also:
- Line breaks, being Carriage Return (U+000D) or Line Feed (U+000A) characters, and Form Feed characters (U+000C)
are converted to "
-
decode
Decodes the specified HTML encoded text into normal text.All character entity references and numeric character references are converted to their respective characters.
This is equivalent to
decode(encodedText,false).Unterminated character references are dealt with according to the rules for text outside of attribute values in the current compatibility mode.
If the static
Config.ConvertNonBreakingSpacesproperty is set totrue(the default), then all non-breaking space ( ) character entity references are converted to normal spaces.Although character entity reference names are case sensitive, and in some cases differ from other entity references only by their case, some browsers also recognise them in a case-insensitive way. For this reason, all decoding methods in this library recognise character entity reference names even if they are in the wrong case.
- Parameters:
encodedText- the text to decode.- Returns:
- the decoded string.
- See Also:
-
decode
Decodes the specified HTML encoded text into normal text.All character entity references and numeric character references are converted to their respective characters.
Unterminated character references are dealt with according to the value of the
insideAttributeValueparameter and the current compatibility mode.If the static
Config.ConvertNonBreakingSpacesproperty is set totrue(the default), then all non-breaking space ( ) character entity references are converted to normal spaces.Although character entity reference names are case sensitive, and in some cases differ from other entity references only by their case, some browsers also recognise them in a case-insensitive way. For this reason, all decoding methods in this library recognise character entity reference names even if they are in the wrong case.
- Parameters:
encodedText- the text to decode.insideAttributeValue- specifies whether the encoded text is inside an attribute value.- Returns:
- the decoded string.
- See Also:
-
decodeCollapseWhiteSpace
Decodes the specified text after collapsing its white space.All leading and trailing white space is omitted, and any sections of internal white space are replaced by a single space.
The result is how the text would normally be rendered by a user agent, assuming it does not contain any tags.
If the static
Config.ConvertNonBreakingSpacesproperty is set totrue(the default), then all non-breaking space ( ) character entity references are converted to normal spaces.Unterminated character references are dealt with according to the rules for text outside of attribute values in the current compatibility mode. See the discussion of the
insideAttributeValueparameter of thedecode(CharSequence, boolean insideAttributeValue)method for a more detailed explanation of this topic.- Parameters:
text- the source text- Returns:
- the decoded text with collapsed white space.
- See Also:
-
reencode
Re-encodes the specified text, equivalent to decoding and then encoding again.This process ensures that the specified encoded text does not contain any remaining unencoded characters.
IMPLEMENTATION NOTE: At present this method simply calls the
decodemethod followed by theencodemethod, but a more efficient implementation may be used in future.- Parameters:
encodedText- the text to re-encode.- Returns:
- the re-encoded string.
-
getCharacterReferenceString
Returns the encoded form of this character reference.The exact behaviour of this method depends on the class of this object. See the
CharacterEntityReference.getCharacterReferenceString()andNumericCharacterReference.getCharacterReferenceString()methods for more details.- Examples:
CharacterReference.parse(">").getCharacterReferenceString()returns ">"CharacterReference.parse(">").getCharacterReferenceString()returns "e;"
- Returns:
- the encoded form of this character reference.
- See Also:
-
getCharacterReferenceString
Returns the encoded form of the specified unicode code point.This method returns the character entity reference encoded form of the unicode code point if one exists, otherwise it returns the decimal character reference encoded form.
The only exception to this is an apostrophe (U+0027), which is encoded as the numeric character reference "
'" instead of its character entity reference "'".- Examples:
CharacterReference.getCharacterReferenceString(62)returns ">"CharacterReference.getCharacterReferenceString('>')returns ">"CharacterReference.getCharacterReferenceString('☺')returns "☺"
- Parameters:
codePoint- the unicode code point to encode.- Returns:
- the encoded form of the specified unicode code point.
- See Also:
-
getDecimalCharacterReferenceString
Returns the decimal encoded form of this character reference.This is equivalent to
getDecimalCharacterReferenceString(getCodePoint()).- Example:
CharacterReference.parse(">").getDecimalCharacterReferenceString()returns ">"
- Returns:
- the decimal encoded form of this character reference.
- See Also:
-
getDecimalCharacterReferenceString
Returns the decimal encoded form of the specified unicode code point.- Example:
CharacterReference.getDecimalCharacterReferenceString('>')returns ">"
- Parameters:
codePoint- the unicode code point to encode.- Returns:
- the decimal encoded form of the specified unicode code point.
- See Also:
-
getHexadecimalCharacterReferenceString
Returns the hexadecimal encoded form of this character reference.This is equivalent to
getHexadecimalCharacterReferenceString(getCodePoint()).- Example:
CharacterReference.parse(">").getHexadecimalCharacterReferenceString()returns ">"
- Returns:
- the hexadecimal encoded form of this character reference.
- See Also:
-
getHexadecimalCharacterReferenceString
Returns the hexadecimal encoded form of the specified unicode code point.- Example:
CharacterReference.getHexadecimalCharacterReferenceString('>')returns ">"
- Parameters:
codePoint- the unicode code point to encode.- Returns:
- the hexadecimal encoded form of the specified unicode code point.
- See Also:
-
getUnicodeText
Returns the unicode code point of this character reference in U+ notation.This is equivalent to
getUnicodeText(getCodePoint()).- Example:
CharacterReference.parse(">").getUnicodeText()returns "U+003E"
- Returns:
- the unicode code point of this character reference in U+ notation.
- See Also:
-
getUnicodeText
Returns the specified unicode code point in U+ notation.- Example:
CharacterReference.getUnicodeText('>')returns "U+003E"
- Parameters:
codePoint- the unicode code point.- Returns:
- the specified unicode code point in U+ notation.
-
parse
Parses a single encoded character reference text into aCharacterReferenceobject.The character reference must be at the start of the given text, but may contain other characters at the end. The
getEnd()method can be used on the resulting object to determine at which character position the character reference ended.If the text does not represent a valid character reference, this method returns
null.Unterminated character references are always accepted, regardless of the settings in the current compatibility mode.
To decode all character references in a given text, use the
decode(CharSequence)method instead.- Example:
CharacterReference.parse(">").getChar()returns '>'
- Parameters:
characterReferenceText- the text containing a single encoded character reference.- Returns:
- a
CharacterReferenceobject representing the specified text, ornullif the text does not represent a valid character reference. - See Also:
-
getCodePointFromCharacterReferenceString
Parses a single encoded character reference text into a unicode code point.The character reference must be at the start of the given text, but may contain other characters at the end.
If the text does not represent a valid character reference, this method returns
INVALID_CODE_POINT.This is equivalent to
parse(characterReferenceText).getCodePoint(), except that it returnsINVALID_CODE_POINTif an invalid character reference is specified instead of throwing aNullPointerException.- Example:
CharacterReference.getCodePointFromCharacterReferenceString(">")returns38
- Parameters:
characterReferenceText- the text containing a single encoded character reference.- Returns:
- the unicode code point representing representing the specified text, or
INVALID_CODE_POINTif the text does not represent a valid character reference.
-
requiresEncoding
public static final boolean requiresEncoding(char ch) Indicates whether the specified character would need to be encoded in HTML text.This is the case if a character entity reference exists for the character, or the unicode code point is greater than U+007F.
The only exception to this is an apostrophe (U+0027), which only returns
trueif the staticConfig.IsApostropheEncodedproperty is currently set totrue.- Parameters:
ch- the character to test.- Returns:
trueif the specified character would need to be encoded in HTML text, otherwisefalse.
-
getEncodingFilterWriter
- Parameters:
writer- the destination for the encoded text- Returns:
- a filter
Writerthat encodes all text before passing it through to the specifiedWriter. - See Also:
-