|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectgov.lanl.archive.rewrite.charset.CharsetDetector
public abstract class CharsetDetector
Abstract class containing common methods for determining the character encoding of a text Resource, most of which should be refactored into a Util package.
| Field Summary | |
|---|---|
protected static String |
CHARSET_TOKEN
|
static String |
DEFAULT_CHARSET
the default charset name to use when giving up |
protected static String |
HTTP_CONTENT_TYPE_HEADER
|
protected static int |
MAX_CHARSET_READAHEAD
|
| Constructor Summary | |
|---|---|
CharsetDetector()
|
|
| Method Summary | |
|---|---|
protected String |
contentTypeToCharset(String contentType)
|
abstract String |
getCharset(InputStream resource,
String ctype)
|
protected String |
getCharsetFromBytes(InputStream resource)
Attempts to figure out the character set of the document using the excellent juniversalchardet library. |
protected String |
getCharsetFromHeaders(String ctype)
Attempt to divine the character encoding of the document from the Content-Type HTTP header (with a "charset=") |
protected String |
getCharsetFromMeta(InputStream resource)
Attempt to find a META tag in the HTML that hints at the character set used to write the document. |
protected boolean |
isCharsetSupported(String charsetName)
|
protected String |
mapCharset(String orig)
|
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
protected static final int MAX_CHARSET_READAHEAD
protected static final String CHARSET_TOKEN
protected static final String HTTP_CONTENT_TYPE_HEADER
public static final String DEFAULT_CHARSET
| Constructor Detail |
|---|
public CharsetDetector()
| Method Detail |
|---|
protected boolean isCharsetSupported(String charsetName)
protected String mapCharset(String orig)
protected String contentTypeToCharset(String contentType)
protected String getCharsetFromHeaders(String ctype)
throws IOException
resource -
IOException
protected String getCharsetFromMeta(InputStream resource)
throws IOException
resource -
IOException
protected String getCharsetFromBytes(InputStream resource)
throws IOException
resource -
IOException
public abstract String getCharset(InputStream resource,
String ctype)
throws IOException
resource - (presumably text) Resource to determine the charsetrequest - WaybackRequest which may contain additional hints to
processing
IOException - if there are problems reading the Resource
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||