Interface IEncoder

  • All Known Subinterfaces:
    ILayerProvider
    All Known Implementing Classes:
    CDATAEncoder, CsvEncoder, DefaultEncoder, DTDEncoder, EncoderManager, HtmlEncoder, JSONEncoder, LayerProvider, MarkdownEncoder, MIFEncoder, MosesTextEncoder, OpenXMLEncoder, PHPContentEncoder, POEncoder, PropertiesEncoder, RegexEncoder, TEXEncoder, TSEncoder, XMLEncoder, YamlEncoder

    public interface IEncoder
    Provides common methods to encode/escape text to a specific format.

    Important: Each class implementing this interface must have a Nullary Constructor, so the object can be instantiated using the Class.fromName() methods by the EncoderManager.

    1. Filters (and subfilters) decode any special sequences for their format. The goal is 100% Unicode inside Okapi. This includes normalizing newlines to \n.
    2. The exception to #1 is Skeleton and Code.data. This content should remain unaltered to the extent possible. For example, XML processors will decode everything and this is out of our control.
    3. For non-problematic formats that use GenericFilterWriter/GenericSkeleton, only an IEncoder implementation is needed. Encoding is handled automatically in this case based on MimeTypeMapper.
    4. IEncoder implementation should reside in the encoders package in core.
    5. The IEncoder implementation should take into account EncoderContext. Normally the encoder shouldn't be run on SKELETON or INLINE content - or only run with a small subset of cases as compared to TEXT (the goal is to keep SKELETON/INLINE as close to the original as possible).
    6. Special IParameters can be passed to the IEncoder if more configuration is needed.
    7. QuoteMode is also provided to help guide logic around double and single quotes. Some encoders take more parameters in their constructor (XMLEncoder)
    8. For "problematic" formats an IFilterWriter should be implemented. This will give the full context of the TextUnit and surrounding Events. Encoding can be applied with more nuance. However an IEncoder can still be implemented for default cases.
    9. ALL encoder logic should reside within the IFilterWriter and/or IEncoder implementations. Not handled in ad hoc ways.
    • Method Detail

      • reset

        void reset()
        Reset state in this encoder in preparation for processing new content.
      • setOptions

        void setOptions​(IParameters params,
                        String encoding,
                        String lineBreak)
        Sets the options for this encoder.
        Parameters:
        params - the parameters object with all the configuration information specific to this encoder.
        encoding - the name of the charset encoding to use.
        lineBreak - the type of line break to use.
      • encode

        String encode​(String text,
                      EncoderContext context)
        Encodes a given text with this encoder.
        Parameters:
        text - the text to encode.
        context - the context of the text: 0=text, 1=skeleton, 2=inline.
        Returns:
        the encoded text.
      • encode

        String encode​(int codePoint,
                      EncoderContext context)
        Encodes a given code-point with this encoding. If this method is called from a loop it is assumed that the code point is tested by the caller to know if it is a supplemental one or not and and any index update to skip the low surrogate part of the pair is done on the caller side.
        Parameters:
        codePoint - the code-point to encode.
        context - the context of the character: 0=text, 1=skeleton, 2=inline.
        Returns:
        the encoded character (as a string since it can be now made up of more than one character).
      • encode

        String encode​(char value,
                      EncoderContext context)
        Encodes a given character with this encoding.
        Parameters:
        value - the character to encode.
        context - the context of the character: 0=text, 1=skeleton, 2=inline.
        Returns:
        the encoded character 9as a string since it can be now made up of more than one character).
      • toNative

        default String toNative​(String propertyName,
                                String value)
        Converts any property values from its standard representation to the native representation for this encoder.
        Parameters:
        propertyName - the name of the property.
        value - the standard value to convert.
        Returns:
        the native representation of the given value.
      • getLineBreak

        default String getLineBreak()
        Gets the line-break to use for this encoder.
        Returns:
        the line-break used for this encoder.
      • getEncoding

        default String getEncoding()
        Gets the name of the charset encoding to use.
        Returns:
        the charset encoding used for this encoder.
      • getCharsetEncoder

        default CharsetEncoder getCharsetEncoder()
        Gets the character set encoder used for this encoder.
        Returns:
        the character set encoder used for this encoder. This can be null.
      • getParameters

        IParameters getParameters()
        Gets the parameters object with all the configuration information specific to this encoder.
        Returns:
        the parameters object used for this encoder. This can be null.