blocks|key|1252643|text|丢弃所有不能解释为ASCII的字符：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1252644|def+remove_non_ascii(s):
++++return+"".join(c+for+c+in+s+if+ord(c)<128)|code-block|syntax|javascript|1252645|请记住，这可以保证与UTF-8编码一起工作(因为多字节字符中的所有字节都将最高位设置为1)。|1252646|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

Throw out all characters that can't be interpreted as ASCII:
<pre><code>def remove_non_ascii(s):
 return &quot;&quot;.join(c for c in s if ord(c)&lt;128)
</code></pre>
Keep in mind that this is guaranteed to work with the UTF-8 encoding (because all bytes in multi-byte characters have the highest bit set to 1).

blocks|key|1252549|text|>>>+unicode_string+=+u"hello+aåbäcö"
>>>+unicode_string.encode("ascii",+"ignore")
'hello+abc'|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|1252550|unstyled|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|G|8|@]|9|@]|A|$B|C]]|$1|D|3|-4|5|E|7|H|8|@]|9|@]|A|$]]]|F|$]]

<pre><code>&gt;&gt;&gt; unicode_string = u"hello aåbäcö"
&gt;&gt;&gt; unicode_string.encode("ascii", "ignore")
'hello abc'
</code></pre>

blocks|key|1467610|text|以下代码将用问号替换所有非ASCII字符。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1467611|"".join([x+if+ord(x)+<+128+else+'?'+for+x+in+s])|code-block|syntax|javascript|1467612|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

The following code will replace all non ASCII characters with question marks.

<pre><code>"".join([x if ord(x) &lt; 128 else '?' for x in s])
</code></pre>

blocks|key|1467539|text|使用正则表达式：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1467540|import+re

strip_unicode+=+re.compile("([%5E-_a-zA-Z0-9!@#%25&=,/'\";:~`\$\%5E\*\%2B\[\]\.\{\}\%7C\?\<\>\\]%2B%7C[%5E\s]%2B)")
print+strip_unicode.sub('',+u'6Â+918Â+417Â+712')|code-block|syntax|javascript|1467541|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Using Regex:

<pre><code>import re

strip_unicode = re.compile("([^-_a-zA-Z0-9!@#%&amp;=,/'\";:~`\$\^\*\+\[\]\.\{\}\|\?\&lt;\&gt;\\]+|[^\s]+)")
print strip_unicode.sub('', u'6Â 918Â 417Â 712')
</code></pre>

blocks|key|1252756|text|回答太晚了，但原始字符串是UTF-8格式，而'\xc2\xa0‘是UTF-8格式的不间断空格。只需将原始字符串解码为s.decode('utf-8')+(\xa0在错误解码为Windows1252或拉丁语-1时显示为空格：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1252757|示例(Python+3)|1252758|s+=+b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1'))+#+incorrectly+decoded
u+=+s.decode('utf8')+#+correctly+decoded
print(u)
print(u.replace('\N{NO-BREAK+SPACE}','_'))
print(u.replace('\xa0','-'))+#+\xa0+is+Unicode+for+NO-BREAK+SPACE|code-block|syntax|javascript|1252759|输出|1252760|6Â 918Â 417Â 712
6 918 417 712
6_918_417_712
6-918-417-712|1252761|entityMap^0|1M|H|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@$9|T|A|U|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|V|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|W|8|@]|D|@]|E|$K|L]]|$1|M|3|N|5|6|7|X|8|@]|D|@]|E|$]]|$1|O|3|P|5|J|7|Y|8|@]|D|@]|E|$K|L]]|$1|Q|3|-4|5|6|7|Z|8|@]|D|@]|E|$]]]|R|$]]

Way too late for an answer, but the original string was in UTF-8 and '\xc2\xa0' is UTF-8 for NO-BREAK SPACE. Simply decode the original string as <code>s.decode('utf-8')</code> (\xa0 displays as a space when decoded incorrectly as Windows-1252 or latin-1:

<h3>Example (Python 3)</h3>

<pre><code>s = b'6\xc2\xa0918\xc2\xa0417\xc2\xa0712'
print(s.decode('latin-1')) # incorrectly decoded
u = s.decode('utf8') # correctly decoded
print(u)
print(u.replace('\N{NO-BREAK SPACE}','_'))
print(u.replace('\xa0','-')) # \xa0 is Unicode for NO-BREAK SPACE
</code></pre>

<h3>Output</h3>

<pre><code>6Â 918Â 417Â 712
6 918 417 712
6_918_417_712
6-918-417-712
</code></pre>

blocks|key|1252600|text|#!/usr/bin/env+python
#+-*-+coding:+utf-8+-*-

s+=+u"6Â+918Â+417Â+712"
s+=+s.replace(u"Â",+"")+
print+s|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|1252601|这将打印出6+918+417+712|unstyled|offset|length|style|CODE|1252602|entityMap^0|0|5|D|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$B|C]]|$1|D|3|E|5|F|7|N|8|@$G|O|H|P|I|J]]|9|@]|A|$]]|$1|K|3|-4|5|F|7|Q|8|@]|9|@]|A|$]]]|L|$]]

<pre><code>#!/usr/bin/env python
# -*- coding: utf-8 -*-

s = u"6Â 918Â 417Â 712"
s = s.replace(u"Â", "") 
print s
</code></pre>

This will print out <code>6 918 417 712</code>

blocks|key|1467372|text|s.replace(u'Â+',+'')++++++++++++++#+u+before+string+is+important|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|1467373|并使您的.py文件成为unicode。|unstyled|offset|length|style|CODE|1467374|entityMap^0|0|4|3|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$B|C]]|$1|D|3|E|5|F|7|N|8|@$G|O|H|P|I|J]]|9|@]|A|$]]|$1|K|3|-4|5|F|7|Q|8|@]|9|@]|A|$]]]|L|$]]

<pre><code>s.replace(u'Â ', '') # u before string is important
</code></pre>

and make your <code>.py</code> file unicode.

blocks|key|1252692|text|这是一个下流的技巧，但可能行得通。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1252693|s2+=+""
for+i+in+s:
++++if+ord(i)+<+128:
++++++++s2+%2B=+i|code-block|syntax|javascript|1252694|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

This is a dirty hack, but may work.

<pre><code>s2 = ""
for i in s:
 if ord(i) &lt; 128:
 s2 += i
</code></pre>

blocks|key|1252849|text|我的两便士外加美味的汤，|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1252850|string='<span+style="width:+0px>+dirty+text+begin+(+ĀĒēāæśḍṣ+<0xa0>+)+dtext+end+'
string=string.encode().decode('ascii',errors='ignore')
print(string)|code-block|syntax|javascript|1252851|将会给予|1252852|<span+style="width:+0px>+dirty+text+begin+(+++)+dtext+end+|1252853|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|N|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|O|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|K|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|L|$]]

my 2 pennies with beautiful soup,
<pre><code>string='&lt;span style=&quot;width: 0px&gt; dirty text begin ( ĀĒēāæśḍṣ &lt;0xa0&gt; ) dtext end &lt;/span&gt;&lt;/span&gt;'
string=string.encode().decode('ascii',errors='ignore')
print(string)
</code></pre>
will give
<pre><code>&lt;span style=&quot;width: 0px&gt; dirty text begin ( ) dtext end &lt;/span&gt;&lt;/span&gt;
</code></pre>

I have a string that looks like so:

<pre><code>6Â 918Â 417Â 712
</code></pre>

The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called <code>s</code>, we get:

<pre><code>s.replace('Â ', '')
</code></pre>

That should do the trick. But of course it complains that the non-ASCII character <code>'\xc2'</code> in file blabla.py is not encoded.

I never quite could understand how to switch between different encodings.

Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header:

<pre><code>#!/usr/bin/python2.4
# -*- coding: utf-8 -*-
</code></pre>

The code:

<pre><code>f = urllib.urlopen(url)

soup = BeautifulSoup(f)

s = soup.find('div', {'id':'main_count'})

#making a print 's' here goes well. it shows 6Â 918Â 417Â 712

s.replace('Â ','')

save_main_count(s)
</code></pre>

It gets no further than <code>s.replace</code>...

How to make the python interpreter correctly handle non-ASCII characters in string operations?

Python

我有一个字符串，看起来像这样：6Â 918Â 417Â 712修剪这个字符串的最简单的方法(就我所理解的Python)是，简单地说这个字符串在一个名为s的变量中，我们得到：s.replace('Â ', '')这应该能起到作用。但是，它当然会抱怨文件blabla.py中的非ASCII码字符'\xc2'没有编码。我从来都不太理解如何在不同的编码之间切换。这是代码，它确实和上面的一样，但是现在它在上下

问如何让python解释器正确处理字符串操作中的非ASCII字符？
EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何让python解释器正确处理字符串操作中的非ASCII字符？EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何让python解释器正确处理字符串操作中的非ASCII字符？
EN