Class COSParser

    • Field Detail

      • ENDSTREAM

        public static final byte[] ENDSTREAM
      • ENDOBJ

        public static final byte[] ENDOBJ
      • SYSPROP_PARSEMINIMAL

        public static final String SYSPROP_PARSEMINIMAL
        Only parse the PDF file minimally allowing access to basic information.
        See Also:
        Constant Field Values
      • SYSPROP_EOFLOOKUPRANGE

        public static final String SYSPROP_EOFLOOKUPRANGE
        The range within the %%EOF marker will be searched. Useful if there are additional characters after %%EOF within the PDF.
        See Also:
        Constant Field Values
      • EOF_MARKER

        protected static final char[] EOF_MARKER
        EOF-marker.
      • OBJ_MARKER

        protected static final char[] OBJ_MARKER
        obj-marker.
      • fileLen

        protected long fileLen
        file length.
      • initialParseDone

        protected boolean initialParseDone
      • securityHandler

        protected SecurityHandler securityHandler
        The security handler.
      • xrefTrailerResolver

        protected XrefTrailerResolver xrefTrailerResolver
        Collects all Xref/trailer objects and resolves them into single object using startxref reference.
    • Constructor Detail

    • Method Detail

      • setEOFLookupRange

        public void setEOFLookupRange​(int byteCount)
        Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default value DEFAULT_TRAIL_BYTECOUNT.

        We check that new value is at least 16. However for practical use cases this value should not be lower than 1000; even 2000 was found to not be enough in some cases where some trailing garbage like HTML snippets followed the EOF marker.

        In case system property SYSPROP_EOFLOOKUPRANGE is defined this value will be set on initialization but can be overwritten later.

        Parameters:
        byteCount - number of trailing bytes
      • parseXref

        protected COSDictionary parseXref​(long startXRefOffset)
                                   throws IOException
        Parses cross reference tables.
        Parameters:
        startXRefOffset - start offset of the first table
        Returns:
        the trailer dictionary
        Throws:
        IOException - if something went wrong
      • getStartxrefOffset

        protected final long getStartxrefOffset()
                                         throws IOException
        Looks for and parses startxref. We first look for last '%%EOF' marker (within last DEFAULT_TRAIL_BYTECOUNT bytes (or range set via setEOFLookupRange(int)) and go back to find startxref.
        Returns:
        the offset of StartXref
        Throws:
        IOException - If something went wrong.
      • lastIndexOf

        protected int lastIndexOf​(char[] pattern,
                                  byte[] buf,
                                  int endOff)
        Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.
        Parameters:
        pattern - pattern to search for
        buf - buffer to search pattern in
        endOff - offset (exclusive) where lookup starts at
        Returns:
        start offset of pattern within buffer or -1 if pattern could not be found
      • isLenient

        public boolean isLenient()
        Return true if parser is lenient. Meaning auto healing capacity of the parser are used.
        Returns:
        true if parser is lenient
      • setLenient

        public void setLenient​(boolean lenient)
        Change the parser leniency flag. This method can only be called before the parsing of the file.
        Parameters:
        lenient - try to handle malformed PDFs.
      • parseDictObjects

        protected void parseDictObjects​(COSDictionary dict,
                                        COSName... excludeObjects)
                                 throws IOException
        Will parse every object necessary to load a single page from the pdf document. We try our best to order objects according to offset in file before reading to minimize seek operations.
        Parameters:
        dict - the COSObject from the parent pages.
        excludeObjects - dictionary object reference entries with these names will not be parsed
        Throws:
        IOException - if something went wrong
      • parseObjectDynamically

        protected final COSBase parseObjectDynamically​(COSObject obj,
                                                       boolean requireExistingNotCompressedObj)
                                                throws IOException
        This will parse the next object from the stream and add it to the local state.
        Parameters:
        obj - object to be parsed (we only take object number and generation number for lookup start offset)
        requireExistingNotCompressedObj - if true object to be parsed must not be contained within compressed stream
        Returns:
        the parsed object (which is also added to document object)
        Throws:
        IOException - If an IO error occurs.
      • parseObjectDynamically

        protected COSBase parseObjectDynamically​(long objNr,
                                                 int objGenNr,
                                                 boolean requireExistingNotCompressedObj)
                                          throws IOException
        This will parse the next object from the stream and add it to the local state. It's reduced to parsing an indirect object.
        Parameters:
        objNr - object number of object to be parsed
        objGenNr - object generation number of object to be parsed
        requireExistingNotCompressedObj - if true the object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)
        Returns:
        the parsed object (which is also added to document object)
        Throws:
        IOException - If an IO error occurs.
      • parseCOSStream

        protected COSStream parseCOSStream​(COSDictionary dic)
                                    throws IOException
        This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.
        Parameters:
        dic - dictionary that goes with this stream.
        Returns:
        parsed pdf stream.
        Throws:
        IOException - if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.
      • rebuildTrailer

        protected final COSDictionary rebuildTrailer()
                                              throws IOException
        Rebuild the trailer dictionary if startxref can't be found.
        Returns:
        the rebuild trailer dictionary
        Throws:
        IOException - if something went wrong
      • parseStartXref

        protected long parseStartXref()
                               throws IOException
        This will parse the startxref section from the stream. The startxref value is ignored.
        Returns:
        the startxref value or -1 on parsing error on parsing error
        Throws:
        IOException - If an IO error occurs.
      • parseTrailer

        protected boolean parseTrailer()
                                throws IOException
        This will parse the trailer from the stream and add it to the state.
        Returns:
        false on parsing error
        Throws:
        IOException - If an IO error occurs.
      • parsePDFHeader

        protected boolean parsePDFHeader()
                                  throws IOException
        Parse the header of a pdf.
        Returns:
        true if a PDF header was found
        Throws:
        IOException - if something went wrong
      • parseFDFHeader

        protected boolean parseFDFHeader()
                                  throws IOException
        Parse the header of a fdf.
        Returns:
        true if a FDF header was found
        Throws:
        IOException - if something went wrong
      • parseXrefTable

        protected boolean parseXrefTable​(long startByteOffset)
                                  throws IOException
        This will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.
        Parameters:
        startByteOffset - the offset to start at
        Returns:
        false on parsing error
        Throws:
        IOException - If an IO error occurs.
      • parseXrefStream

        public void parseXrefStream​(COSStream stream,
                                    long objByteOffset,
                                    boolean isStandalone)
                             throws IOException
        Fills XRefTrailerResolver with data of given stream. Stream must be of type XRef.
        Parameters:
        stream - the stream to be read
        objByteOffset - the offset to start at
        isStandalone - should be set to true if the stream is not part of a hybrid xref table
        Throws:
        IOException - if there is an error parsing the stream
      • getDocument

        public COSDocument getDocument()
                                throws IOException
        This will get the document that was parsed. parse() must be called before this is called. When you are done with this document you must call close() on it to release resources.
        Returns:
        The document that was parsed.
        Throws:
        IOException - If there is an error getting the document.
      • parseTrailerValuesDynamically

        protected COSBase parseTrailerValuesDynamically​(COSDictionary trailer)
                                                 throws IOException
        Parse the values of the trailer dictionary and return the root object
        Parameters:
        trailer - The trailer dictionary.
        Returns:
        The parsed root object
        Throws:
        IOException - If an IO error occurs or if the root object is missing in the trailer dictionary