Package com.tom_roush.pdfbox.text
Class PDFMarkedContentExtractor
- java.lang.Object
-
- com.tom_roush.pdfbox.contentstream.PDFStreamEngine
-
- com.tom_roush.pdfbox.text.PDFMarkedContentExtractor
-
public class PDFMarkedContentExtractor extends PDFStreamEngine
This is an stream engine to extract the marked content of a pdf.
-
-
Constructor Summary
Constructors Constructor Description PDFMarkedContentExtractor()Instantiate a new PDFTextStripper object.PDFMarkedContentExtractor(String encoding)Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidbeginMarkedContentSequence(COSName tag, COSDictionary properties)Called when a marked content group beginsprotected floatcomputeFontHeight(PDFont font)Compute the font height.voidendMarkedContentSequence()Called when a marked content group endsList<PDMarkedContent>getMarkedContents()booleanisSuppressDuplicateOverlappingText()voidprocessPage(PDPage page)This will initialize and process the contents of the stream.protected voidprocessTextPosition(TextPosition text)This will process a TextPosition object and add the text to the list of characters on a page.voidsetSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)By default the class will attempt to remove text that overlaps each other.protected voidshowGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)Called when a glyph is to be processed.voidxobject(PDXObject xobject)-
Methods inherited from class com.tom_roush.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Constructor Detail
-
PDFMarkedContentExtractor
public PDFMarkedContentExtractor() throws IOExceptionInstantiate a new PDFTextStripper object.- Throws:
IOException
-
PDFMarkedContentExtractor
public PDFMarkedContentExtractor(String encoding) throws IOException
Constructor. Will apply encoding-specific conversions to the output text.- Parameters:
encoding- The encoding that the output will be written in.- Throws:
IOException
-
-
Method Detail
-
isSuppressDuplicateOverlappingText
public boolean isSuppressDuplicateOverlappingText()
- Returns:
- the suppressDuplicateOverlappingText setting.
-
setSuppressDuplicateOverlappingText
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
By default the class will attempt to remove text that overlaps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.- Parameters:
suppressDuplicateOverlappingText- The suppressDuplicateOverlappingText setting to set.
-
beginMarkedContentSequence
public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
Description copied from class:PDFStreamEngineCalled when a marked content group begins- Overrides:
beginMarkedContentSequencein classPDFStreamEngine- Parameters:
tag- indicates the role or significance of the sequenceproperties- optional properties
-
endMarkedContentSequence
public void endMarkedContentSequence()
Description copied from class:PDFStreamEngineCalled when a marked content group ends- Overrides:
endMarkedContentSequencein classPDFStreamEngine
-
xobject
public void xobject(PDXObject xobject)
-
processTextPosition
protected void processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Parameters:
text- The text to process.
-
getMarkedContents
public List<PDMarkedContent> getMarkedContents()
-
processPage
public void processPage(PDPage page) throws IOException
This will initialize and process the contents of the stream.- Overrides:
processPagein classPDFStreamEngine- Parameters:
page- the page to process- Throws:
IOException- if there is an error accessing the stream.
-
showGlyph
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
Called when a glyph is to be processed. The heuristic calculations here were originally written by Ben Litchfield for PDFStreamEngine.- Overrides:
showGlyphin classPDFStreamEngine- Parameters:
textRenderingMatrix- the current text rendering matrix, Trmfont- the current fontcode- internal PDF character code for the glyphunicode- the Unicode text for this glyph, or null if the PDF does provide itdisplacement- the displacement (i.e. advance) of the glyph in text space- Throws:
IOException- if the glyph cannot be processed
-
computeFontHeight
protected float computeFontHeight(PDFont font) throws IOException
Compute the font height. Override this if you want to use own calculations.- Parameters:
font- the font.- Returns:
- the font height.
- Throws:
IOException- if there is an error while getting the font bounding box.
-
-