|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectgov.lanl.archive.rewrite.charset.CharsetDetector
public abstract class CharsetDetector
Abstract class containing common methods for determining the character encoding of a text Resource, most of which should be refactored into a Util package.
Field Summary | |
---|---|
protected static String |
CHARSET_TOKEN
|
static String |
DEFAULT_CHARSET
the default charset name to use when giving up |
protected static String |
HTTP_CONTENT_TYPE_HEADER
|
protected static int |
MAX_CHARSET_READAHEAD
|
Constructor Summary | |
---|---|
CharsetDetector()
|
Method Summary | |
---|---|
protected String |
contentTypeToCharset(String contentType)
|
abstract String |
getCharset(InputStream resource,
String ctype)
|
protected String |
getCharsetFromBytes(InputStream resource)
Attempts to figure out the character set of the document using the excellent juniversalchardet library. |
protected String |
getCharsetFromHeaders(String ctype)
Attempt to divine the character encoding of the document from the Content-Type HTTP header (with a "charset=") |
protected String |
getCharsetFromMeta(InputStream resource)
Attempt to find a META tag in the HTML that hints at the character set used to write the document. |
protected boolean |
isCharsetSupported(String charsetName)
|
protected String |
mapCharset(String orig)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final int MAX_CHARSET_READAHEAD
protected static final String CHARSET_TOKEN
protected static final String HTTP_CONTENT_TYPE_HEADER
public static final String DEFAULT_CHARSET
Constructor Detail |
---|
public CharsetDetector()
Method Detail |
---|
protected boolean isCharsetSupported(String charsetName)
protected String mapCharset(String orig)
protected String contentTypeToCharset(String contentType)
protected String getCharsetFromHeaders(String ctype) throws IOException
resource
-
IOException
protected String getCharsetFromMeta(InputStream resource) throws IOException
resource
-
IOException
protected String getCharsetFromBytes(InputStream resource) throws IOException
resource
-
IOException
public abstract String getCharset(InputStream resource, String ctype) throws IOException
resource
- (presumably text) Resource to determine the charsetrequest
- WaybackRequest which may contain additional hints to
processing
IOException
- if there are problems reading the Resource
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |