Package CHEM :: Package DB :: Package rdb :: Module BeautifulSoup :: Class UnicodeDammit
[hide private]
[frames] | no frames]

Class UnicodeDammit



A class for detecting the encoding of a *ML document and converting it to a Unicode string. If the source encoding is windows-1252, can replace MS smart quotes with their HTML or XML equivalents.

Instance Methods [hide private]
 
__init__(self, markup, overrideEncodings=[], smartQuotesTo='xml')
 
_subMSChar(self, orig)
Changes a MS smart quote character to an XML or HTML entity.
 
_convertFrom(self, proposed)
 
_toUnicode(self, data, encoding)
Given a string and its encoding, decodes the string into Unicode.
 
_detectEncoding(self, xml_data)
Given a document, tries to detect its XML encoding.
 
find_codec(self, charset)
 
_codec(self, charset)
 
_ebcdic_to_ascii(self, s)
Class Variables [hide private]
  CHARSET_ALIASES = {'macintosh': 'mac-roman', 'x-sjis': 'shift-...
  EBCDIC_TO_ASCII_MAP = <CHEM.DB.rdb.search.NameRxnPatternMatchi...
  MS_CHARS = {'\x80': ('euro', '20AC'), '\x81': ' ', '\x82': ('s...
Method Details [hide private]

_toUnicode(self, data, encoding)

 
Given a string and its encoding, decodes the string into Unicode. %encoding is a string recognized by encodings.aliases

Class Variable Details [hide private]

CHARSET_ALIASES

Value:
{'macintosh': 'mac-roman', 'x-sjis': 'shift-jis'}

EBCDIC_TO_ASCII_MAP

Value:
None

MS_CHARS

Value:
{'\x80': ('euro', '20AC'),
 '\x81': ' ',
 '\x82': ('sbquo', '201A'),
 '\x83': ('fnof', '192'),
 '\x84': ('bdquo', '201E'),
 '\x85': ('hellip', '2026'),
 '\x86': ('dagger', '2020'),
 '\x87': ('Dagger', '2021'),
...