UTF8-hul Module

1 Introduction

The UTF8-hul module recognizes and validates content streams encoded with the Unicode UTF-8 encoding.

The module is invoked by the:

jhove ... -m UTF8-hul ...

command line option.

This module can be configured with the following parameters:

  • withTextMD=true to ask for the output of a textMD block in the text technical properties.

Coverage

  • UTF-8 encoded content streams [Unicode]

Well-Formedness

The following criteria must be met by an UTF8 content streams for JHOVE to consider it well-formed:

  • The stream consists of an optional three-octet encoded Byte Order Mark (BOM) character, 0xEFBBBF, followed by an arbitrary number of the following one- to four-octet sequences:

       
    Single octet: 0xxxxxxx
    Two octets: 110yyyyy 10xxxxxx
    Three octets: 1110zzzz 10yyyyyy 10yyyyyy
    Four octets: 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
  • The presence of an initial Byte Order Mark (BOM) character in the form of any of the following two- or four-octet sequences automatically taints the content stream as non-well-formed UTF-8:

    Two octets:0xEF 0xFF UTF-16 big-endian encoding
    0xFFFE UTF-16 little-endian encoding
    Four octets: 0x0000FEFF UCS-4 big-endian encoding
    0xFFFE0000 UCS-4 little-endian encoding

Validity

The following criteria must be met by an UTF-8 encoded file for JHOVE to consider it valid:

  • The UTF-8 encoded file is well-formed

Representation Information

The MIME type is reported as: text/plain; charset=UTF-8

In addition to the standard JHOVE representation information, the module defines the following properties:

  • Property “UTF8Metadata” of type PROPERTY and arity LIST
    • Property “Characters” of type LONG and arity SCALAR containing the number of characters
    • Property “UnicodeBlocks” of type STRING and arity LIST containing Unicode 6.0.0 code blocks [Unicode Code Blocks]
    • Property “LineEndings” of type STRING and arity LIST containing: CR, CRLF, or LF
    • If withTextMD, Property “TextMDMetadata” of type TextMDMetadata and arity SCALAR