试图与希伯来字符UTF-8 TSV文件在HDInsight集群与火花在Linux和我得到编码错误,有什么建议吗?
这是我的电火花笔记本代码:
from pyspark.sql import *
# Create an RDD from sample data
transactionsText = sc.textFile("/people.txt")
header = transactionsText.first()
# Create a schema for our data
Entry = Row('id','name','age')
# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header) .map(lambda l: l.encode('utf-8').split("\t"))
transactions = transactionsParts.map(lambda p: Entry(str(p[0]),str(p[1]),int(p[2])))
# Infer the schema and create a table
transactionsTable = sqlContext.createDataFrame(transactions)
# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM transactionsTempTable")
# The results of SQL queries are RDDs and support all the normal RDD operations.
names = results.map(lambda p: "name: " + p.name)
for name in names.collect():
print(name)
错误:
'ascii‘编解码器不能在位置6-11编码字符:序数不在范围(128)回溯(最近一次调用):UnicodeEncodeError:'ascii’编解码器不能在第6-11位置编码字符:序数不在范围内(128)
希伯来语文本文件内容:
id name age
1 גיא 37
2 maor 32
3 danny 55
当我尝试英文文件时,它工作得很好:
英文文本文件内容:
id name age
1 guy 37
2 maor 32
3 danny 55
输出:
name: guy
name: maor
name: danny
发布于 2016-06-10 02:41:23
如果使用希伯来语文本运行以下代码:
from pyspark.sql import *
path = "/people.txt"
transactionsText = sc.textFile(path)
header = transactionsText.first()
# Create a schema for our data
Entry = Row('id','name','age')
# Parse the data and create a schema
transactionsParts = transactionsText.filter(lambda x:x !=header).map(lambda l: l.split("\t"))
transactions = transactionsParts.map(lambda p: Entry(unicode(p[0]), unicode(p[1]), unicode(p[2])))
transactions.collect()
您会注意到,您获得的名称作为unicode
类型的列表:
[Row(id=u'1', name=u'\u05d2\u05d9\u05d0', age=u'37'), Row(id=u'2', name=u'maor', age=u'32 '), Row(id=u'3', name=u'danny', age=u'55')]
现在,我们将向事务RDD注册一个表:
table_name = "transactionsTempTable"
# Infer the schema and create a table
transactionsDf = sqlContext.createDataFrame(transactions)
transactionsDf.registerTempTable(table_name)
# SQL can be run over DataFrames that have been registered as a table.
results = sqlContext.sql("SELECT name FROM {}".format(table_name))
results.collect()
您会注意到,从DataFrame
中返回的Python DataFrame
中的所有字符串都是Pythonunicode
类型的:
[Row(name=u'\u05d2\u05d9\u05d0'), Row(name=u'maor'), Row(name=u'danny')]
现在正在运行:
%%sql
SELECT * FROM transactionsTempTable
将得到预期的结果:
name: גיא
name: maor
name: danny
请注意,如果您想对这些名称做一些工作,您可能希望使用它们作为unicode
字符串。来自这篇文章
当您处理文本操作(查找字符串中的字符数或在单词边界上切割字符串)时,您应该处理unicode字符串,因为它们抽象字符的方式适合于将它们看作是您将在页面上看到的字母序列。当处理I/O、读取磁盘、打印到终端、通过网络链接发送一些东西等时,您应该处理字节str,因为这些设备需要处理那些字节代表抽象字符的具体实现。
https://stackoverflow.com/questions/37698276
复制相似问题