数学符号比例较高网站的解码问题

在爬取数学符号较多的网站时如果使用Python chardet模块来识别网页编码格式容易出现错误，原因是网站中的一些拉丁字母会让chardet将编码识别为数学字符不全的编码中。以下代码为例，网站设置的编码为utf-8但是由于使用文字识别功能，一些特殊的数学符号也被自动加入到了网页中，导致Python解码失败。

url="https://www.chegg.com/homework-help/questions-and-answers/statistics-and-probability-archive-2020-april-03?page=4"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"}
req = request.Request(url, headers=headers)
response = request.urlopen(req)
html = response.read()
charset = chardet.detect(html)
print(charset)
html = html.decode(charset["encoding"])
'''
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 179517: character maps to <undefined>
{'encoding': 'Windows-1254', 'confidence': 0.444726265432767, 'language': 'Turkish'}
'''

可以看到chardet将编码格式识别为Windows-1254，但confidence只有0.44，事实上也不是正确的编码。Windows-1254是土耳其语下的字符集，同样也包含了拉丁文字母，也是会chardet被识别出来的原因。但是这个字符集中不包含诸如micro, superscript, fraction这样的字符所以导致了仍然不能完整地解码数学相关的页面。

解决方案

对于这种包含了数学或科学主题的网站，使用utf-8是完全没有意义的，建议直接就是用比较完整的拉丁语系字符库来解码这类网站。ISO-8859-1（Latin-1）和cp1125（IBM CP1125 乌克兰语）。目前也就是在处理包含大量数学符号的网站中会刻意使用这俩编码，因为它们都和UTF-8不兼容，所以建议还是只在特例中单独使用作为异常处理的解决方案。

url="https://www.chegg.com/homework-help/questions-and-answers/statistics-and-probability-archive-2020-april-03?page=4"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"}
req = request.Request(url, headers=headers)
response = request.urlopen(req)
html = response.read()
html = html.decode("ISO-8859-1")
print(html)
'''
OCTYPE html>
<html lang="en" class="wf-active">
    <!--  7c630212f682 | twig:1.41.0-DEV | cluster: cheggstudy | fullConf: prod-cheggstudy -->
<head>
        <meta name="format-detection" content="telephone=no">
'''

转载请标注来源