blocks|key|1015789|text|在使用scan之前，请确保请求页面的Content-Type标头为text/html，因为可能存在指向图像等不是以UTF8编码的内容的链接。如果您在类似<link>元素的内容中选择了一个href，那么该页面也可以是非html的。如何检查根据您使用的HTTP库而有所不同。然后，确保结果只是带有String#ascii_only?的ascii+(不是UTF-8，因为超文本标记语言应该只使用ascii，实体可以用其他方式)。如果这两个测试都通过，则可以安全地使用scan。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1015790|entityMap^0|3|4|I|C|X|9|24|6|2L|4|42|I|6E|4|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]|$9|K|A|L|B|C]|$9|M|A|N|B|C]|$9|O|A|P|B|C]|$9|Q|A|R|B|C]|$9|S|A|T|B|C]|$9|U|A|V|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|W|8|@]|D|@]|E|$]]]|G|$]]

Before you use <code>scan</code>, make sure that the requested page's <code>Content-Type</code> header is <code>text/html</code>, since there can be links to things like images which are not encoded in UTF-8. The page could also be non-html if you picked up a <code>href</code> in something like a <code>&lt;link&gt;</code> element. How to check this varies on what HTTP library you are using. Then, make sure the result is only ascii with <code>String#ascii_only?</code> (not UTF-8 because HTML is only supposed to be using ascii, entities can be used otherwise). If both of those tests pass, it is safe to use <code>scan</code>.

blocks|key|1015847|text|我建议您使用HTML解析器。就找最快的吧。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1015848|解析HTML并不像看起来那么简单。|1015849|浏览器解析无效的UTF-8序列，在UTF-8HTML文档中，只需放置"�“符号即可。因此，一旦解析了HTML中的无效UTF-8序列，得到的文本就是一个有效的字符串。|1015850|即使在属性值中，也必须解码像amp这样的HTML实体|1015851|下面是一个很好的问题，总结了为什么不能用正则表达式可靠地解析HTML语言：RegEx+match+open+tags+except+XHTML+self-contained+tags|offset|length|1015852|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags^0|0|0|0|0|11|1I|0|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|U|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|V|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|W|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|X|8|@]|9|@$J|Y|K|Z|1|10]]|A|$]]|$1|L|3|-4|5|6|7|11|8|@]|9|@]|A|$]]]|M|$N|$5|O|P|Q|A|$R|S]]]]

I recommend you to use a HTML parser. Just find the fastest one.

Parsing HTML is not as easy as it may seem.

Browsers parse invalid UTF-8 sequences, in UTF-8 HTML documents, just putting the "�" symbol. So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string.

Even inside attribute values you have to decode HTML entities like amp

Here is a great question that sums up why you can not reliably parse HTML with a regular expression:
<a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags">RegEx match open tags except XHTML self-contained tags</a>

blocks|key|1230609|text|我遇到过字符串，它有英语，俄语和一些其他字母的混合，这导致了异常。我只需要俄语和英语，这对我来说是有效的：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1230610|ec1+=+Encoding::Converter.new+"UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2+=+Encoding::Converter.new+"Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t+=+ec2.convert+ec1.convert+t|code-block|syntax|javascript|1230611|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. I need only Russian and English, and this currently works for me:

<pre><code>ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=&gt;:replace,:undef=&gt;:replace,:replace=&gt;""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=&gt;:replace,:undef=&gt;:replace,:replace=&gt;""
t = ec2.convert ec1.convert t
</code></pre>

blocks|key|1015960|text|我目前的解决方案是运行：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1015961|my_string.unpack("C*").pack("U*")|code-block|syntax|javascript|1015962|这至少会摆脱我的主要问题--异常|1015963|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

My current solution is to run: 

<pre><code>my_string.unpack("C*").pack("U*")
</code></pre>

This will at least get rid of the exceptions which was my main problem

blocks|key|1230707|text|在Ruby1.9.3中，可以使用String.encode“忽略”无效的UTF-8序列。下面是一个可以在1.8+(iconv)和1.9+(String#encode)中使用的代码片段：|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1230708|require+'iconv'+unless+String.method_defined?(:encode)
if+String.method_defined?(:encode)
++file_contents.encode!('UTF-8',+'UTF-8',+:invalid+=>+:replace)
else
++ic+=+Iconv.new('UTF-8',+'UTF-8//IGNORE')
++file_contents+=+ic.iconv(file_contents)
end|code-block|syntax|javascript|1230709|或者，如果您有非常麻烦的输入，您可以执行从UTF-8到UTF-16再到UTF-8的双重转换：|1230710|require+'iconv'+unless+String.method_defined?(:encode)
if+String.method_defined?(:encode)
++file_contents.encode!('UTF-16',+'UTF-8',+:invalid+=>+:replace,+:replace+=>+'')
++file_contents.encode!('UTF-8',+'UTF-16')
else
++ic+=+Iconv.new('UTF-8',+'UTF-8//IGNORE')
++file_contents+=+ic.iconv(file_contents)
end|1230711|entityMap|0|LINK|mutability|MUTABLE|url|http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/|1|http://www.ruby-doc.org/core-1.9.3/String.html#method-i-encode-21^0|1L|5|0|1X|D|1|0|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@$A|X|B|Y|1|Z]|$A|10|B|11|1|12]]|C|$]]|$1|D|3|E|5|F|7|13|8|@]|9|@]|C|$G|H]]|$1|I|3|J|5|6|7|14|8|@]|9|@]|C|$]]|$1|K|3|L|5|F|7|15|8|@]|9|@]|C|$G|H]]|$1|M|3|-4|5|6|7|16|8|@]|9|@]|C|$]]]|N|$O|$5|P|Q|R|C|$S|T]]|U|$5|P|Q|R|C|$S|V]]]]

In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (<a href="http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/" rel="noreferrer">iconv</a>) and 1.9 (<a href="http://www.ruby-doc.org/core-1.9.3/String.html#method-i-encode-21" rel="noreferrer">String#encode</a>) :

<pre><code>require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
 file_contents.encode!('UTF-8', 'UTF-8', :invalid =&gt; :replace)
else
 ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
 file_contents = ic.iconv(file_contents)
end
</code></pre>

or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:

<pre><code>require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
 file_contents.encode!('UTF-16', 'UTF-8', :invalid =&gt; :replace, :replace =&gt; '')
 file_contents.encode!('UTF-8', 'UTF-16')
else
 ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
 file_contents = ic.iconv(file_contents)
end
</code></pre>

blocks|key|1016043|text|虽然Nakilon的解决方案有效，但至少在克服错误方面，在我的例子中，我将这个来自Microsoft+Excel的奇怪的f-ed字符转换为CSV，它在ruby中注册为(get+this)+cyrillic+K，在ruby中是一个粗体K。为了解决这个问题，我使用了'iso-8859-1‘即。CSV.parse(f,+:encoding+=>+"iso-8859-1")，它将我古怪的cyrillic+K变成了一个更易于管理的/\xCA/，然后我可以用string.gsub!(/\xCA/,+'')删除它|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1016044|entityMap^0|41|13|5W|6|69|O|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]|$9|K|A|L|B|C]|$9|M|A|N|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|O|8|@]|D|@]|E|$]]]|G|$]]

While Nakilon's solution works, at least as far as getting past the error, in my case, I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a (get this) cyrillic K which in ruby was a bolded K. To fix this I used 'iso-8859-1' viz. <code>CSV.parse(f, :encoding =&gt; "iso-8859-1")</code>, which turned my freaky deaky cyrillic K's into a much more manageable <code>/\xCA/</code>, which I could then remove with <code>string.gsub!(/\xCA/, '')</code>

blocks|key|1230778|text|这似乎起作用了：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1230779|def+sanitize_utf8(string)
++return+nil+if+string.nil?
++return+string+if+string.valid_encoding?
++string.chars.select+{+%7Cc%7C+c.valid_encoding?+}.join
end|code-block|syntax|javascript|1230780|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

This seems to work:

<pre><code>def sanitize_utf8(string)
 return nil if string.nil?
 return string if string.valid_encoding?
 string.chars.select { |c| c.valid_encoding? }.join
end
</code></pre>

blocks|key|1230804|text|被接受的答案或其他答案对我都有效。我找到了this+post，它建议|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1230805|string.encode!('UTF-8',+'binary',+invalid:+:replace,+undef:+:replace,+replace:+'')|code-block|syntax|javascript|1230806|这为我解决了这个问题。|1230807|entityMap|0|LINK|mutability|MUTABLE|url|http://robots.thoughtbot.com/post/42664369166/fight-back-utf-8-invalid-byte-sequences^0|L|9|0|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@$A|T|B|U|1|V]]|C|$]]|$1|D|3|E|5|F|7|W|8|@]|9|@]|C|$G|H]]|$1|I|3|J|5|6|7|X|8|@]|9|@]|C|$]]|$1|K|3|-4|5|6|7|Y|8|@]|9|@]|C|$]]]|L|$M|$5|N|O|P|C|$Q|R]]]]

The accepted answer nor the other answer work for me. I found <a href="http://robots.thoughtbot.com/post/42664369166/fight-back-utf-8-invalid-byte-sequences">this post</a> which suggested 

<pre><code>string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
</code></pre>

This fixed the problem for me.

blocks|key|1016121|text|如果你不“关心”数据，你可以这样做：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1016122|search_params+=+params[:search].valid_encoding?+?+params[:search].gsub(/\W%2B/,+'')+:+"nothing"|offset|length|style|CODE|1016123|我只是使用了valid_encoding?来通过它。我的是一个搜索域，所以我一次又一次地发现了同样的奇怪之处，所以我使用了这样的东西:只是为了让系统不崩溃。因为我不能控制用户体验在发送这个信息之前自动验证(比如自动反馈说“哑巴起来！”)我可以把它放进去，去掉它，然后返回空白结果。|1016124|entityMap^0|0|0|2L|0|6|F|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|M|8|@$D|N|E|O|F|G]]|9|@]|A|$]]|$1|H|3|I|5|6|7|P|8|@$D|Q|E|R|F|G]]|9|@]|A|$]]|$1|J|3|-4|5|6|7|S|8|@]|9|@]|A|$]]]|K|$]]

If you don't "care" about the data you can just do something like: 

<code>search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"</code>

I just used <code>valid_encoding?</code> to get passed it. Mine is a search field, and so i was finding the same weirdness over and over so I used something like: just to have the system not break. Since i don't control the user experience to autovalidate prior to sending this info (like auto feedback to say "dummy up!") I can just take it in, strip it out and return blank results.

blocks|key|1016149|text|试试这个：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1016150|def+to_utf8(str)
++str+=+str.force_encoding('UTF-8')
++return+str+if+str.valid_encoding?
++str.encode("UTF-8",+'binary',+invalid:+:replace,+undef:+:replace,+replace:+'')
end|code-block|syntax|javascript|1016151|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Try this:

<pre><code>def to_utf8(str)
 str = str.force_encoding('UTF-8')
 return str if str.valid_encoding?
 str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end
</code></pre>

blocks|key|1230887|text|attachment+=+file.read

begin
+++#+Try+it+as+UTF-8+directly
+++cleaned+=+attachment.dup.force_encoding('UTF-8')
+++unless+cleaned.valid_encoding?
+++++#+Some+of+it+might+be+old+Windows+code+page
+++++cleaned+=+attachment.encode(+'UTF-8',+'Windows-1252'+)
+++end
+++attachment+=+cleaned
+rescue+EncodingError
+++#+Force+it+to+UTF-8,+throwing+out+invalid+bits
+++attachment+=+attachment.force_encoding("ISO-8859-1").encode("utf-8",+replace:+nil)
+end|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|1230888|unstyled|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|G|8|@]|9|@]|A|$B|C]]|$1|D|3|-4|5|E|7|H|8|@]|9|@]|A|$]]]|F|$]]

<pre><code>attachment = file.read

begin
 # Try it as UTF-8 directly
 cleaned = attachment.dup.force_encoding('UTF-8')
 unless cleaned.valid_encoding?
 # Some of it might be old Windows code page
 cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
 end
 attachment = cleaned
 rescue EncodingError
 # Force it to UTF-8, throwing out invalid bits
 attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
 end
</code></pre>

I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites. 
When trying to extract links, I decided to just use <code>.scan(/href="(.*?)"/i)</code> instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "<code>invalid byte sequence in UTF-8</code>" errors. 
From what I understood, the <code>net/http</code> library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged. 
What would be the best way to actually work with that incoming data? I tried <code>.encode</code> with the replace and invalid options set, but no success so far...

ruby 1.9: invalid byte sequence in UTF-8

我正在用Ruby (1.9)编写一个爬虫，它会消耗来自许多随机站点的大量HTML。当尝试提取链接时，我决定只使用.scan(/href="(.*?)"/i)而不是nokogiri/hpricot (主要的加速)。问题是我现在收到了很多"invalid byte sequence in UTF-8“错误。据我所知，net...

问ruby 1.9: UTF-8中的字节序列无效
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问ruby 1.9: UTF-8中的字节序列无效EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问ruby 1.9: UTF-8中的字节序列无效
EN