gov.lanl.archive.rewrite.charset
Class CharsetDetector

java.lang.Object
  extended by gov.lanl.archive.rewrite.charset.CharsetDetector
Direct Known Subclasses:
StandardCharsetDetector

public abstract class CharsetDetector
extends Object

Abstract class containing common methods for determining the character encoding of a text Resource, most of which should be refactored into a Util package.

Author:
brad

Field Summary
protected static String CHARSET_TOKEN
           
static String DEFAULT_CHARSET
          the default charset name to use when giving up
protected static String HTTP_CONTENT_TYPE_HEADER
           
protected static int MAX_CHARSET_READAHEAD
           
 
Constructor Summary
CharsetDetector()
           
 
Method Summary
protected  String contentTypeToCharset(String contentType)
           
abstract  String getCharset(InputStream resource, String ctype)
           
protected  String getCharsetFromBytes(InputStream resource)
          Attempts to figure out the character set of the document using the excellent juniversalchardet library.
protected  String getCharsetFromHeaders(String ctype)
          Attempt to divine the character encoding of the document from the Content-Type HTTP header (with a "charset=")
protected  String getCharsetFromMeta(InputStream resource)
          Attempt to find a META tag in the HTML that hints at the character set used to write the document.
protected  boolean isCharsetSupported(String charsetName)
           
protected  String mapCharset(String orig)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAX_CHARSET_READAHEAD

protected static final int MAX_CHARSET_READAHEAD
See Also:
Constant Field Values

CHARSET_TOKEN

protected static final String CHARSET_TOKEN
See Also:
Constant Field Values

HTTP_CONTENT_TYPE_HEADER

protected static final String HTTP_CONTENT_TYPE_HEADER
See Also:
Constant Field Values

DEFAULT_CHARSET

public static final String DEFAULT_CHARSET
the default charset name to use when giving up

See Also:
Constant Field Values
Constructor Detail

CharsetDetector

public CharsetDetector()
Method Detail

isCharsetSupported

protected boolean isCharsetSupported(String charsetName)

mapCharset

protected String mapCharset(String orig)

contentTypeToCharset

protected String contentTypeToCharset(String contentType)

getCharsetFromHeaders

protected String getCharsetFromHeaders(String ctype)
                                throws IOException
Attempt to divine the character encoding of the document from the Content-Type HTTP header (with a "charset=")

Parameters:
resource -
Returns:
String character set found or null if the header was not present
Throws:
IOException

getCharsetFromMeta

protected String getCharsetFromMeta(InputStream resource)
                             throws IOException
Attempt to find a META tag in the HTML that hints at the character set used to write the document.

Parameters:
resource -
Returns:
String character set found from META tags in the HTML
Throws:
IOException

getCharsetFromBytes

protected String getCharsetFromBytes(InputStream resource)
                              throws IOException
Attempts to figure out the character set of the document using the excellent juniversalchardet library.

Parameters:
resource -
Returns:
String character encoding found, or null if nothing looked good.
Throws:
IOException

getCharset

public abstract String getCharset(InputStream resource,
                                  String ctype)
                           throws IOException
Parameters:
resource - (presumably text) Resource to determine the charset
request - WaybackRequest which may contain additional hints to processing
Returns:
String charset name for the Resource
Throws:
IOException - if there are problems reading the Resource


Copyright © 2013. All Rights Reserved.