org.opencms.search.extractors
Class CmsExtractorMsOfficeOOXML

java.lang.Object
  extended by org.opencms.search.extractors.A_CmsTextExtractor
      extended by org.opencms.search.extractors.CmsExtractorMsOfficeOOXML
All Implemented Interfaces:
I_CmsTextExtractor

public final class CmsExtractorMsOfficeOOXML
extends A_CmsTextExtractor

Extracts text data from a VFS resource that is an OOXML MS Office document.

Supported formats are MS Word (.docx), MS PowerPoint (.pptx) and MS Excel (.xlsx).

The OLE 2 format was introduced in Microsoft Office version 97 and was the default format until Office version 2007 and the new XML-based OOXML format.

Since:
8.0.1

Method Summary
 I_CmsExtractionResult extractText(java.io.InputStream in)
          Extracts the text and meta information from the document on the input stream.
static I_CmsTextExtractor getExtractor()
          Returns an instance of this text extractor.
 
Methods inherited from class org.opencms.search.extractors.A_CmsTextExtractor
combineContentItem, extractText, extractText, extractText, extractText, removeControlChars
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

getExtractor

public static I_CmsTextExtractor getExtractor()
Returns an instance of this text extractor.

Returns:
an instance of this text extractor

extractText

public I_CmsExtractionResult extractText(java.io.InputStream in)
                                  throws java.lang.Exception
Description copied from interface: I_CmsTextExtractor
Extracts the text and meta information from the document on the input stream.

The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided input stream automatically.

Delivers is the same result as calling I_CmsTextExtractor.extractText(InputStream, String) when String == null.

Specified by:
extractText in interface I_CmsTextExtractor
Overrides:
extractText in class A_CmsTextExtractor
Parameters:
in - the input stream for the document to extract the text from
Returns:
the extracted text and meta information
Throws:
java.lang.Exception - if the text extration fails
See Also:
I_CmsTextExtractor.extractText(java.io.InputStream, java.lang.String)