Class HtmlEncodingDetector

  • All Implemented Interfaces:
    Serializable, org.apache.tika.detect.EncodingDetector

    public class HtmlEncodingDetector
    extends Object
    implements org.apache.tika.detect.EncodingDetector
    Character encoding detector for determining the character encoding of a HTML document based on the potential charset parameter found in a Content-Type http-equiv meta tag somewhere near the beginning. Especially useful for determining the type among multiple closely related encodings (ISO-8859-*) for which other types of encoding detection are unreliable.
    Since:
    Apache Tika 1.2
    See Also:
    Serialized Form
    • Constructor Detail

      • HtmlEncodingDetector

        public HtmlEncodingDetector()
    • Method Detail

      • detect

        public Charset detect​(InputStream input,
                              org.apache.tika.metadata.Metadata metadata)
                       throws IOException
        Specified by:
        detect in interface org.apache.tika.detect.EncodingDetector
        Throws:
        IOException
      • getMarkLimit

        public int getMarkLimit()
      • setMarkLimit

        @Field
        public void setMarkLimit​(int markLimit)
        How far into the stream to read for charset detection. Default is 8192.
        Parameters:
        markLimit -