Class PDFTextStripperByArea


  • public class PDFTextStripperByArea
    extends PDFTextStripper
    This will extract text from a specified region in the PDF.
    • Constructor Detail

      • PDFTextStripperByArea

        public PDFTextStripperByArea()
                              throws IOException
        Constructor.
        Throws:
        IOException - If there is an error loading properties.
    • Method Detail

      • setShouldSeparateByBeads

        public final void setShouldSeparateByBeads​(boolean aShouldSeparateByBeads)
        This method does nothing in this derived class, because beads and regions are incompatible. Beads are ignored when stripping by area.
        Overrides:
        setShouldSeparateByBeads in class PDFTextStripper
        Parameters:
        aShouldSeparateByBeads - The new grouping of beads.
      • addRegion

        public void addRegion​(String regionName,
                              RectF rect)
        Add a new region to group text by.
        Parameters:
        regionName - The name of the region.
        rect - The rectangle area to retrieve the text from. The y-coordinates are java coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
      • removeRegion

        public void removeRegion​(String regionName)
        Delete a region to group text by. If the region does not exist, this method does nothing.
        Parameters:
        regionName - The name of the region to delete.
      • getRegions

        public List<String> getRegions()
        Get the list of regions that have been setup.
        Returns:
        A list of java.lang.String objects to identify the region names.
      • getTextForRegion

        public String getTextForRegion​(String regionName)
        Get the text for the region, this should be called after extractRegions().
        Parameters:
        regionName - The name of the region to get the text from.
        Returns:
        The text that was identified in that region.
      • extractRegions

        public void extractRegions​(PDPage page)
                            throws IOException
        Process the page to extract the region text.
        Parameters:
        page - The page to extract the regions from.
        Throws:
        IOException - If there is an error while extracting text.
      • processTextPosition

        protected void processTextPosition​(TextPosition text)
        This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
        Overrides:
        processTextPosition in class PDFTextStripper
        Parameters:
        text - The text to process.
      • writePage

        protected void writePage()
                          throws IOException
        This will print the processed page text to the output stream.
        Overrides:
        writePage in class PDFTextStripper
        Throws:
        IOException - If there is an error writing the text.
      • showGlyph

        protected void showGlyph​(Matrix textRenderingMatrix,
                                 PDFont font,
                                 int code,
                                 String unicode,
                                 Vector displacement)
                          throws IOException
        Called when a glyph is to be processed. The heuristic calculations here were originally written by Ben Litchfield for PDFStreamEngine.
        Overrides:
        showGlyph in class PDFStreamEngine
        Parameters:
        textRenderingMatrix - the current text rendering matrix, Trm
        font - the current font
        code - internal PDF character code for the glyph
        unicode - the Unicode text for this glyph, or null if the PDF does provide it
        displacement - the displacement (i.e. advance) of the glyph in text space
        Throws:
        IOException - if the glyph cannot be processed
      • computeFontHeight

        protected float computeFontHeight​(PDFont font)
                                   throws IOException
        Compute the font height. Override this if you want to use own calculations.
        Parameters:
        font - the font.
        Returns:
        the font height.
        Throws:
        IOException - if there is an error while getting the font bounding box.