Class Icu4jEncodingDetector

  • All Implemented Interfaces:
    Serializable, org.apache.tika.detect.EncodingDetector

    public class Icu4jEncodingDetector
    extends Object
    implements org.apache.tika.detect.EncodingDetector
    See Also:
    Serialized Form
    • Constructor Detail

      • Icu4jEncodingDetector

        public Icu4jEncodingDetector()
    • Method Detail

      • detect

        public Charset detect​(InputStream input,
                              org.apache.tika.metadata.Metadata metadata)
                       throws IOException
        Specified by:
        detect in interface org.apache.tika.detect.EncodingDetector
        Throws:
        IOException
      • isStripMarkup

        public boolean isStripMarkup()
      • setStripMarkup

        @Field
        public void setStripMarkup​(boolean stripMarkup)
        Whether or not to attempt to strip html-ish markup from the stream before sending it to the underlying detector.

        The underlying detector may still apply its own stripping if this is set to false.

        Parameters:
        stripMarkup - whether or not to attempt to strip markup before sending the stream to the underlying detector
      • getMarkLimit

        public int getMarkLimit()
      • setMarkLimit

        @Field
        public void setMarkLimit​(int markLimit)
        How far into the stream to read for charset detection. Default is 12000.
        Parameters:
        markLimit -
      • getMarkLimt

        public int getMarkLimt()
      • setIgnoreCharsets

        @Field
        public void setIgnoreCharsets​(List<String> charsetsToIgnore)
      • getIgnoreCharsets

        public List<String> getIgnoreCharsets()