Class ApacheTikaDocumentParser
java.lang.Object
dev.langchain4j.data.document.parser.apache.tika.ApacheTikaDocumentParser
- All Implemented Interfaces:
dev.langchain4j.data.document.DocumentParser
public class ApacheTikaDocumentParser
extends Object
implements dev.langchain4j.data.document.DocumentParser
Parses files into
Documents using Apache Tika library, automatically detecting the file format.
This parser supports various file formats, including PDF, DOC, PPT, XLS.
For detailed information on supported formats,
please refer to the Apache Tika documentation.-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final Supplier<ContentHandler> static final Supplier<org.apache.tika.metadata.Metadata> static final Supplier<org.apache.tika.parser.ParseContext> static final Supplier<org.apache.tika.parser.Parser> -
Constructor Summary
ConstructorsConstructorDescriptionCreates an instance of anApacheTikaDocumentParserwith the default Tika components.ApacheTikaDocumentParser(Supplier<org.apache.tika.parser.Parser> parserSupplier, Supplier<ContentHandler> contentHandlerSupplier, Supplier<org.apache.tika.metadata.Metadata> metadataSupplier, Supplier<org.apache.tika.parser.ParseContext> parseContextSupplier) Creates an instance of anApacheTikaDocumentParserwith the provided suppliers for Tika components.ApacheTikaDocumentParser(org.apache.tika.parser.Parser parser, ContentHandler contentHandler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext parseContext) Deprecated.Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files. -
Method Summary
Modifier and TypeMethodDescriptiondev.langchain4j.data.document.Documentparse(InputStream inputStream)
-
Field Details
-
DEFAULT_PARSER_SUPPLIER
-
DEFAULT_METADATA_SUPPLIER
-
DEFAULT_PARSE_CONTEXT_SUPPLIER
-
DEFAULT_CONTENT_HANDLER_SUPPLIER
-
-
Constructor Details
-
ApacheTikaDocumentParser
public ApacheTikaDocumentParser()Creates an instance of anApacheTikaDocumentParserwith the default Tika components. It usesAutoDetectParser,BodyContentHandlerwithout write limit, emptyMetadataand emptyParseContext. -
ApacheTikaDocumentParser
@Deprecated public ApacheTikaDocumentParser(org.apache.tika.parser.Parser parser, ContentHandler contentHandler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext parseContext) Deprecated.Use the constructor with suppliers for Tika components if you intend to use this parser for multiple files.Creates an instance of anApacheTikaDocumentParserwith the provided Tika components. If some of the components are not provided (null, the defaults will be used.- Parameters:
parser- Tika parser to use. Default:AutoDetectParsercontentHandler- Tika content handler. Default:BodyContentHandlerwithout write limitmetadata- Tika metadata. Default: emptyMetadataparseContext- Tika parse context. Default: emptyParseContext
-
ApacheTikaDocumentParser
public ApacheTikaDocumentParser(Supplier<org.apache.tika.parser.Parser> parserSupplier, Supplier<ContentHandler> contentHandlerSupplier, Supplier<org.apache.tika.metadata.Metadata> metadataSupplier, Supplier<org.apache.tika.parser.ParseContext> parseContextSupplier) Creates an instance of anApacheTikaDocumentParserwith the provided suppliers for Tika components. If some of the suppliers are not provided (null), the defaults will be used.- Parameters:
parserSupplier- Supplier for Tika parser to use. Default:AutoDetectParsercontentHandlerSupplier- Supplier for Tika content handler. Default:BodyContentHandlerwithout write limitmetadataSupplier- Supplier for Tika metadata. Default: emptyMetadataparseContextSupplier- Supplier for Tika parse context. Default: emptyParseContext
-
-
Method Details
-
parse
- Specified by:
parsein interfacedev.langchain4j.data.document.DocumentParser
-