嗨,我正在通过shell脚本运行sparkr程序。我将输入文件指向本地意味着它工作正常,但是当我指向hdfs时意味着它抛出错误。
Exception in thread "delete Spark local dirs" java.lang.NullPointerException
Exception in thread "delete Spark local dirs" java.lang.NullPointerException
at org.apache.spark.storage.DiskBlockManager.org$apache$spark$s
我无法在scala中并行化一个列表,获取java.lang.NullPointerException
messages.foreachRDD( rdd => {
for(avroLine <- rdd){
val record = Injection.injection.invert(avroLine.getBytes).get
val field1Value = record.get("username")
val jsonStrings=Seq(record.toString())
我正在尝试将Spark RDD保存为gzipped文本文件(或多个文本文件)到S3存储桶中。S3存储桶挂载到dbfs。我正在尝试使用以下命令保存该文件:
rddDataset.saveAsTextFile("/mnt/mymount/myfolder/")
但是当我尝试这样做的时候,我一直收到错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 32 in stage 18.0 failed 4 times, most recent failure: Lost task 32.3
sessionIdList的类型为:
scala> sessionIdList
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
当我尝试运行下面的代码时:
val x = sc.parallelize(List(1,2,3))
val cartesianComp = x.cartesian(x).map(x => (x))
val kDistanceNeighbourhood = sessionIdList.map(s => {
ca
我在Apache spark (pyspark)中训练了一个逻辑回归模型,并用它评估了一些测试数据……像这样..。
# Split into train and test sets
train, test = data.randomSplit([.8, .2], seed=1337)
# Train a model
model = LogisticRegressionWithLBFGS.train(train)
# Print the coefficients
print(model.weights)
# Evaluate the test data
predictions =
我正在尝试并行运行大量的k-means。我有一个房间和它的大量数据,我想计算每个房间的集群。所以我有
roomsSignals[(room:String, signals:List[org.apache.spark.mllib.linalg.Vector]]
roomsSignals.map{l=>
val data=sc.parallelize(l.signals)
val clusterCenters=2
val model = KMeans.train(data, clusterCenters, 5)
model.clusterCenters.map { r =>
我试图在图上传递一些消息来计算递归特性。当我定义一个顶点是aggregateMessages输出的图时,我会得到一个错误。上下文代码
> val newGraph = Graph(newVertices, edges)
newGraph: org.apache.spark.graphx.Graph[List[Double],Int] = org.apache.spark.graphx.impl.GraphImpl@2091594b
//This is the RDD that causes the problem
> val result = newGraph.aggregat
我有RDD,我想循环它。我确实喜欢这样:
pointsMap.foreach({ p =>
val pointsWithCoordinatesWithDistance = pointsMap.leftOuterJoin(xCoordinatesWithDistance)
pointsWithCoordinatesWithDistance.foreach(println)
println("---")
})
然而,NullPointerException正在发生:
java.lang.NullPointerException
at org.apache.
Spark在代码中不会给出非常详细的错误消息,但为了将来的参考,这个问题适用于任何得到Null指针异常的人,看起来像这样:
java.lang.NullPointerException
at org.apache.spark.rdd.RDD.take(RDD.scala:850)
at org.apache.spark.rdd.RDD.first(RDD.scala:862)
at modelBuilding$$anonfun$3.apply(modelBuilding.scala:46)
at modelBuilding$$anonfun$3.apply(mo
在dataproc中,可以多次看到日志中的错误,但是作业不会退出并继续运行多个小时。
任何解决这个问题的帮助都是非常感谢的。
作业运行的数据也非常小。
有时,在重新运行后,代码作业运行良好。但是它随机地处理了这个问题
Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
executor.scala:318)
at org.apache.spark
我正在使用spark完成一些小步骤,我的练习是将一个JSON文件加载到RDD中,选择一个列,然后使用distinct来获得惟一的值。我过滤的列包含多个值(CSV行),必须拆分。
val sqlContext = spark.sqlContext
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new HiveContext(sc)
import hiveCtx.implicits._
val bizDF = hiveCtx.jsonFile("/home/xpto/Documents/PersonalProjects
我有一个如下所示的json文件。
{"name":"method2","name1":"test","parameter1":"C:/Users/test/Desktop/Online.csv","parameter2": 1.0}
我正在加载我的json文件。
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("C:/Users/test/Deskto