社区首页 >专栏 >Spark 1.6.0 (Scala 2.11)版本的编译与安装部署

Spark 1.6.0 (Scala 2.11)版本的编译与安装部署

发布2022-05-07 14:11:32
发布2022-05-07 14:11:32

2016年元月4号, spark 在其官网上公开了1.6.0版本,于是进行下载和编译.


对于scala的编译,还是只需要一条语句。build/sbt -Dscala=2.11 -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver assembly。

对spark 1.6中的新特性进行测试: (DataSet)


Spark Core/SQL

  • API Updates
    • SPARK-9999  Dataset API - A new Spark API, similar to RDDs, that allows users to work with custom objects and lambda functions while still gaining the benefits of the Spark SQL execution engine.
    • SPARK-10810 Session Management - Different users can share a cluster while having different configuration and temporary tables.
    • SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
    • SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
    • SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a per-operator basis for memory usage and spilled data size.
    • SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unnest arbitrary numbers of columns
    • SPARK-4849  Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
    • SPARK-11778  - DataFrameReader.table supports specifying database name. For example, sqlContext.read.table(“dbName.tableName”) can be used to create a DataFrame from a table called “tableName” in the database “dbName”.
    • SPARK-10947  - With schema inference from JSON into a Dataframe, users can set primitivesAsString to true (in data source options) to infer all primitive value types as Strings. The default value of primitivesAsString is false.
  • Performance
    • SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
    • SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
    • SPARK-9241  Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
    • SPARK-9858  Adaptive query execution - Initial support for automatically selecting the number of reducers for joins and aggregations.
    • SPARK-10978 Avoiding double filters in Data Source API - When implementing a data source with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
    • SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
    • SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
    • SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead

Spark Streaming

  • API Updates
    • SPARK-2629  New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedes updateStateByKey in functionality and performance.
    • SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
    • SPARK-10891 Kinesis message handler function - Allows arbitrary function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
    • SPARK-6328  Python Streaming Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
  • UI Improvements
    • Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
    • Made output operations visible in the streaming tab as progress bars.


  • New algorithms/models
    • SPARK-8518  Survival analysis - Log-linear model for survival analysis
    • SPARK-9834  Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
    • SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming framework
    • SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
    • SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
  • API improvements
    • ML Pipelines
      • SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.ml algorithms
      • SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
    • R API
      • SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
      • SPARK-9681  Feature interactions in R formula - Interaction operator “:” in R formula
    • Python API - Many improvements to Python API to approach feature parity
  • Misc improvements
  • Documentation improvements
    • SPARK-7751  @since versions - Documentation includes initial version when classes and methods were added
    • SPARK-11337 Testable example code - Automated testing for code in user guide examples
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2016-01-06,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

0 条评论
  • Spark Core/SQL
  • Spark Streaming
  • MLlib
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档