Class StreamedSource
- All Implemented Interfaces:
Closeable, AutoCloseable, Iterable<Segment>
This class provides a means, via the iterator() method, of sequentially parsing every tag, character reference
and plain text segment contained within the source document using a minimum amount of memory.
In contrast, the standard Source class stores the entire source text in memory and caches every tag parsed,
resulting in memory problems when attempting to parse very large files.
The iterator parses and returns each segment as the source text is streamed in.
Previous segments are discarded for garbage collection.
Source documents up to 2GB in size can be processed, a limit which is imposed by the java language because of its use of the int data type to index string operations.
There is however a significant trade-off in functionality when using the StreamedSource class as opposed to the Source class.
The Tag.getElement() method is not supported on tags that are returned by the iterator, nor are any methods that use the Element class in any way.
The Segment.getSource() method is also not supported.
Most of the methods and constructors in this class mirror similarly named methods in the Source class where the same functionality is available.
See the description of the iterator() method for a typical usage example of this class.
In contrast to a Source object, the Reader or InputStream specified in the constructor or created implicitly by the constructor
remains open for the life of the StreamedSource object. If the stream is created internally, it is automatically closed
when the end of the stream is reached or the StreamedSource object is finalized.
However a Reader or InputStream that is specified directly in a constructor is never closed automatically, as it can not be assumed
that the application has no further use for it. It is the user's responsibility to ensure it is closed in this case.
Explicitly calling the close() method on the StreamedSource object ensures that all resources used by it are closed, regardless of whether
they were created internally or supplied externally.
The functionality provided by StreamedSource is similar to a StAX parser,
but with some important benefits:
- The source document does not have to be valid XML. It can be plain HTML, can contain invalid syntax, undefined entities, incorrectly nested elements, server tags, or anything else that is commonly found in "tag soup".
- Every single syntactical construct in the source document's original text is included in the iterator, including the XML declaration, character references, comments, CDATA sections and server tags, each providing the segment's begin and end position in the source document. This allows an exact copy of the original document to be generated, allowing modifications to be made only where they are explicitly required. This is not possible with either SAX or StAX, which to some extent provide interpretations of the content of the XML instead of the syntactial structures used in the original source document.
The following table summarises the differences between the StreamedSource, StAX and SAX interfaces.
Note that some of the available features are documented as optional and may not be supported by all implementations of StAX and SAX.
| Feature | StreamedSource | StAX | SAX |
|---|---|---|---|
| Parse XML | ● | ● | ● |
| Parse entities without DTD | ● | ||
| Automatically validate XML | ● | ● | |
| Parse HTML | ● | ||
| Tolerant of syntax or nesting errors | ● | ||
| Provide begin and end character positions of each event1 | ● | ○ | |
| Provide source text of each event | ● | ||
| Handle server tag events | ● | ||
| Handle XML declaration event | ● | ||
| Handle comment events | ● | ● | ● |
| Handle CDATA section events | ● | ● | ● |
| Handle document type declaration event | ● | ● | ● |
| Handle character reference events | ● | ||
| Allow chunking of plain text | ● | ● | ● |
| Allow chunking of comment text | |||
| Allow chunking of CDATA section text | ● | ||
| Allow specification of maximum buffer size | ● |
Note that the OutputDocument class can not be used to create a modified version of a streamed source document.
Instead, the output document must be constructed manually from the segments provided by the iterator.
StreamedSource objects are not thread safe.
-
Constructor Summary
ConstructorsConstructorDescriptionStreamedSource(InputStream inputStream) Constructs a newStreamedSourceobject by loading the content from the specifiedInputStream.StreamedSource(Reader reader) Constructs a newStreamedSourceobject by loading the content from the specifiedReader.StreamedSource(CharSequence text) Constructs a newStreamedSourceobject from the specified text.StreamedSource(URL url) Constructs a newStreamedSourceobject by loading the content from the specified URL.StreamedSource(URLConnection urlConnection) Constructs a newStreamedSourceobject by loading the content from the specifiedURLConnection. -
Method Summary
Modifier and TypeMethodDescriptionvoidclose()Closes the underlyingReaderorInputStreamand releases any system resources associated with it.protected voidfinalize()Called by the garbage collector on an object when garbage collection determines that there are no more references to the object.intReturns the current size of the internal character buffer.Returns the currentSegmentfrom the iterator().Returns aCharBuffercontaining the source text of the current segment.Returns the character encoding scheme of the source byte stream used to create this object.Returns a concise description of how the encoding of the source document was determined.Returns theLoggerthat handles log messages.Returns the preliminary encoding of the source document together with a concise description of how it was determined.booleanisXML()Indicates whether the source document is likely to be XML.iterator()Returns an iterator over every tag, character reference and plain text segment contained within the source document.setBuffer(char[] buffer) Specifies an existing character array to use for buffering the incoming character stream.setCoalescing(boolean coalescing) Specifies whether an unbroken section of plain text in the source document should always be coalesced into a singleSegmentby the iterator.voidSets theLoggerthat handles log messages.toString()Returns a string representation of the object as generated by the defaultObject.toString()implementation.Methods inherited from class Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, waitMethods inherited from interface Iterable
forEach, spliterator
-
Constructor Details
-
StreamedSource
Constructs a newStreamedSourceobject by loading the content from the specifiedReader.If the specified reader is an instance of
InputStreamReader, thegetEncoding()method of the createdStreamedSourceobject returns the encoding fromInputStreamReader.getEncoding().- Parameters:
reader- thejava.io.Readerfrom which to load the source text.- Throws:
IOException- if an I/O error occurs.
-
StreamedSource
Constructs a newStreamedSourceobject by loading the content from the specifiedInputStream.The algorithm for detecting the character encoding of the source document from the raw bytes of the specified input stream is the same as that for the
Source(URLConnection)constructor of theSourceclass, except that the first step is not possible as there is no Content-Type header to check.If the specified
InputStreamdoes not support themarkmethod, the algorithm that determines the encoding may have to wrap it in aBufferedInputStreamin order to look ahead at the encoding meta data. This extra layer of buffering will then remain in place for the life of theStreamedSource, possibly impacting memory usage and/or degrading performance. It is always preferable to use theStreamedSource(Reader)constructor if the encoding is known in advance.- Parameters:
inputStream- thejava.io.InputStreamfrom which to load the source text.- Throws:
IOException- if an I/O error occurs.- See Also:
-
StreamedSource
Constructs a newStreamedSourceobject by loading the content from the specified URL.This is equivalent to
StreamedSource(url.openConnection()).- Parameters:
url- the URL from which to load the source text.- Throws:
IOException- if an I/O error occurs.- See Also:
-
StreamedSource
Constructs a newStreamedSourceobject by loading the content from the specifiedURLConnection.The algorithm for detecting the character encoding of the source document is identical to that described in the
Source(URLConnection)constructor of theSourceclass.The algorithm that determines the encoding may have to wrap the input stream in a
BufferedInputStreamin order to look ahead at the encoding meta data if the encoding is not specified in the HTTP headers. This extra layer of buffering will then remain in place for the life of theStreamedSource, possibly impacting memory usage and/or degrading performance. It is always preferable to use theStreamedSource(Reader)constructor if the encoding is known in advance.- Parameters:
urlConnection- the URL connection from which to load the source text.- Throws:
IOException- if an I/O error occurs.- See Also:
-
StreamedSource
Constructs a newStreamedSourceobject from the specified text.Although the
CharSequenceargument of this constructor apparently contradicts the notion of streaming in the source text, it can still benefits over the equivalent use of the standardSourceclass.Firstly, using the
StreamedSourceclass to iterate the nodes of an in-memoryCharSequencesource document still requires much less memory than the equivalent operation using the standardSourceclass.Secondly, the specified
CharSequenceobject could possibly implement its own paging mechanism to minimise memory usage.If the specified
CharSequenceis mutable, its state must not be modified while theStreamedSourceis in use.- Parameters:
text- the source text.
-
-
Method Details
-
setBuffer
Specifies an existing character array to use for buffering the incoming character stream.The specified buffer is fixed for the life of the
StreamedSourceobject, in contrast to the default buffer which can be automatically replaced by a larger buffer as needed. This means that if a tag (including a comment or CDATA section) is encountered that is larger than the specified buffer, an unrecoverableBufferOverflowExceptionis thrown. This exception is also thrown ifcoalescinghas been enabled and a plain text segment is encountered that is larger than the specified buffer.In general this method should only be used if there needs to be an absolute maximum memory limit imposed on the parser, where that requirement is more important than the ability to parse any source document successfully.
This method can only be called before the
iterator()method has been called.- Parameters:
buffer- an existing character array to use for buffering the incoming character stream, must not benull.- Returns:
- this
StreamedSourceinstance, allowing multiple property setting methods to be chained in a single statement. - Throws:
IllegalStateException- if theiterator()method has already been called.
-
setCoalescing
Specifies whether an unbroken section of plain text in the source document should always be coalesced into a singleSegmentby the iterator.If this property is set to the default value of
false, and a section of plain text is encountered in the document that is larger than the current buffer size, the text is chunked into multiple consecutive plain text segments in order to minimise memory usage.If this property is set to
truethen chunking is disabled, ensuring that consecutive plain text segments are never generated, but instead forcing the internal buffer to expand to fit the largest section of plain text.Note that
CharacterReferencesegments are always handled separately from plain text, regardless of the value of this property. For this reason, algorithms that process element content almost always have to be designed to expect the text in multiple segments in order to handle character references, so there is usually no advantage in coalescing plain text segments.- Parameters:
coalescing- the new value of the coalescing property.- Returns:
- this
StreamedSourceinstance, allowing multiple property setting methods to be chained in a single statement. - Throws:
IllegalStateException- if theiterator()method has already been called.
-
close
Closes the underlyingReaderorInputStreamand releases any system resources associated with it.If the stream is already closed then invoking this method has no effect.
- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Throws:
IOException- if an I/O error occurs.
-
getEncoding
Returns the character encoding scheme of the source byte stream used to create this object.This method works in essentially the same way as the
Source.getEncoding()method.If the byte stream used to create this object does not support the
markmethod, the algorithm that determines the encoding may have to wrap it in aBufferedInputStreamin order to look ahead at the encoding meta data. This extra layer of buffering will then remain in place for the life of theStreamedSource, possibly impacting memory usage and/or degrading performance. It is always preferable to use theStreamedSource(Reader)constructor if the encoding is known in advance.The
getEncodingSpecificationInfo()method returns a simple description of how the value of this method was determined.- Returns:
- the character encoding scheme of the source byte stream used to create this object, or
nullif the encoding is not known. - See Also:
-
getEncodingSpecificationInfo
Returns a concise description of how the encoding of the source document was determined.The description is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.
- Returns:
- a concise description of how the encoding of the source document was determined.
- See Also:
-
getPreliminaryEncodingInfo
Returns the preliminary encoding of the source document together with a concise description of how it was determined.This method works in essentially the same way as the
Source.getPreliminaryEncodingInfo()method.The description returned by this method is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.
- Returns:
- the preliminary encoding of the source document together with a concise description of how it was determined, or
nullif no preliminary encoding was required. - See Also:
-
iterator
Returns an iterator over every tag, character reference and plain text segment contained within the source document.Plain text is defined as all text that is not part of a
TagorCharacterReference.This results in a sequential walk-through of the entire source document. The end position of each segment should correspond with the begin position of the subsequent segment, unless any of the tags are enclosed by other tags. This could happen if there are server tags present in the document, or in rare circumstances where the document type declaration contains markup declarations.
Each segment generated by the iterator is parsed as the source text is streamed in. Previous segments are discarded for garbage collection.
If a section of plain text is encountered in the document that is larger than the current buffer size, the text is chunked into multiple consecutive plain text segments in order to minimise memory usage. Setting the
Coalescingproperty totruedisables chunking, ensuring that consecutive plain text segments are never generated, but instead forcing the internal buffer to expand to fit the largest section of plain text. Note thatCharacterReferencesegments are always handled separately from plain text, regardless of whether coalescing is enabled. For this reason, algorithms that process element content almost always have to be designed to expect the text in multiple segments in order to handle character references, so there is usually no advantage in coalescing plain text segments.Character references that are found inside tags, such as those present inside attribute values, do not generate separate segments from the iterator.
This method may only be called once on any particular
StreamedSourceinstance.- Example:
-
The following code demonstrates the typical (implied) usage of this method through the
Iterableinterface to make an exact copy of the document fromreadertowriter(assuming no server tags are present):StreamedSource streamedSource=new StreamedSource(reader); for (Segment segment : streamedSource) { if (segment instanceof Tag) { Tag tag=(Tag)segment; // HANDLE TAG // Uncomment the following line to ensure each tag is valid XML: // writer.write(tag.tidy()); continue; } else if (segment instanceof CharacterReference) { CharacterReference characterReference=(CharacterReference)segment; // HANDLE CHARACTER REFERENCE // Uncomment the following line to decode all character references instead of copying them verbatim: // characterReference.appendCharTo(writer); continue; } else { // HANDLE PLAIN TEXT } // unless specific handling has prevented getting to here, simply output the segment as is: writer.write(segment.toString()); }Note that the last line
writer.write(segment.toString())in the above code can be replaced with the following for improved performance:CharBuffer charBuffer=streamedSource.getCurrentSegmentCharBuffer(); writer.write(charBuffer.array(),charBuffer.position(),charBuffer.length());
-
The following code demonstrates how to process the plain text content of a specific element, in this case to print the content of every paragraph element:
StreamedSource streamedSource=new StreamedSource(reader); StringBuilder sb=new StringBuilder(); boolean insideParagraphElement=false; for (Segment segment : streamedSource) { if (segment instanceof Tag) { Tag tag=(Tag)segment; if (tag.getName().equals("p")) { if (tag instanceof StartTag) { insideParagraphElement=true; sb.setLength(0); } else { // tag instanceof EndTag insideParagraphElement=false; System.out.println(sb.toString()); } } } else if (insideParagraphElement) { if (segment instanceof CharacterReference) { ((CharacterReference)segment).appendCharTo(sb); } else { sb.append(segment); } } }
- Specified by:
iteratorin interfaceIterable<Segment>- Returns:
- an iterator over every tag, character reference and plain text segment contained within the source document.
-
getCurrentSegment
Returns the currentSegmentfrom the iterator().This is defined as the last
Segmentreturned from the iterator'snext()method.This method returns
nullif the iterator'snext()method has never been called, or itshasNext()method has returned the valuefalse.- Returns:
- the current
Segmentfrom the iterator().
-
getCurrentSegmentCharBuffer
Returns aCharBuffercontaining the source text of the current segment.The returned
CharBufferprovides a window into the internalchar[]buffer including the position and length that spans the current segment.For example, the following code writes the source text of the current segment to
writer:CharBuffer charBuffer=streamedSource.getCurrentSegmentCharBuffer();
writer.write(charBuffer.array(),charBuffer.position(),charBuffer.length());This may provide a performance benefit over the standard way of accessing the source text of the current segment, which is to use the
CharSequenceinterface of the segment directly, or to callSegment.toString().Because this
CharBufferis a direct window into the internal buffer of theStreamedSource, the contents of theCharBuffer.array()must not be modified, and the array is only guaranteed to hold the segment source text until the iterator'shasNext()ornext()method is next called.- Returns:
- a
CharBuffercontaining the source text of the current segment.
-
isXML
public boolean isXML()Indicates whether the source document is likely to be XML.The algorithm used to determine this is designed to be relatively inexpensive and to provide an accurate result in most normal situations. An exact determination of whether the source document is XML would require a much more complex analysis of the text.
The algorithm is as follows:
- If the document begins with an XML declaration, it is an XML document.
- If the document begins with a document type declaration that contains the text
"
xhtml", it is an XHTML document, and hence also an XML document. - If none of the above conditions are met, assume the document is normal HTML, and therefore not an XML document.
This method can only be called after the
iterator()method has been called.- Returns:
trueif the source document is likely to be XML, otherwisefalse.- Throws:
IllegalStateException- if theiterator()method has not yet been called.
-
setLogger
Sets theLoggerthat handles log messages.Specifying a
nullargument disables logging completely for operations performed on thisStreamedSourceobject.A logger instance is created automatically for each
StreamedSourceobject in the same way as is described in theSource.setLogger(Logger)method.- Parameters:
logger- the logger that will handle log messages, ornullto disable logging.- See Also:
-
getLogger
Returns theLoggerthat handles log messages.A logger instance is created automatically for each
StreamedSourceobject using theLoggerProviderspecified by the staticConfig.LoggerProviderproperty. This can be overridden by calling thesetLogger(Logger)method. The name used for all automatically created logger instances is "net.htmlparser.jericho".- Returns:
- the
Loggerthat handles log messages, ornullif logging is disabled.
-
getBufferSize
public int getBufferSize()Returns the current size of the internal character buffer.This information is generally useful only for investigating memory and performance issues.
- Returns:
- the current size of the internal character buffer.
-
toString
Returns a string representation of the object as generated by the defaultObject.toString()implementation.In contrast to the
Source.toString()implementation, it is generally not possible for this method to return the entire source text. -
finalize
protected void finalize()Called by the garbage collector on an object when garbage collection determines that there are no more references to the object.This implementation calls the
close()method if the underlyingReaderorInputStreamstream was created internally.
-