大数据ETL实践探索 ---- 笔试面试考点

流川疯

发布于 2020-10-09 11:21:34

5030

发布于 2020-10-09 11:21:34

spark

4.Spark作为计算框架的优势是什么？

1、Spark的中间数据放到内存中，对于迭代运算效率更高 2、Spark比Hadoop更通用 3、Spark提供了统一的编程接口 4、容错性– 在分布式数据集计算时通过checkpoint来实现容错 5、可用性– Spark通过提供丰富的Scala, Java，Python API及交互式Shell来提高可用性

hadoop

hdfs

解释MapReduce中的Partition和Shuffle？在MapReduce过程中需要将任务进行分片，Shuffle:是描述数据从map端输入到reduce的过程,在hadoop中,大部分map task和reducetask是在不同的node执行,重要开销是网络开销和磁盘IO开销,因此,shuffle的作用主要是:完整的从map task端传输到reduce端;跨节点传输数据时,尽可能的减少对带宽的消耗

2.请列出你所知道的大数据应用的中间件及用途，例如 hdfs 分布式文件系统？

A. Hdfs是广泛使用的hadoop生态圈中的分布式文件系统，很多其他组件都是依赖于hdfs进行实现，比如hadoop 的map reduce算法，hbase。 HDFS就像一个传统的分级文件系统。可以创建、删除、移动或重命名文件 HDFS: Hadoop分布式文件系统(Distributed File System)

Spark的rdd也是一个非常有用的中间件，它为spark各类组件提供在内存中表示数据的基本存储格式。

(b)MapReduce：MapReduce是处理大量半结构化数据集合的编程模型 ©HBase: 类似Google BigTable的分布式NoSQL列数据库。 (d)Hive：数据仓库工具，由Facebook贡献。 (e)Zookeeper：分布式锁设施，提供类似Google Chubby的功能，由Facebook贡献。

hdfs shell

e.g. cat 命令

Usage: hadoop fs -cat [-ignoreCrc] URI [URI ...]

Copies source paths to stdout.

Options

The -ignoreCrc option disables checkshum verification.
Example:

hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -cat file:///file3 /user/hadoop/file4
Exit Code:

Returns 0 on success and -1 on error.

FileSystemShell.html

YARN

下面的话来自官网，可以默写并背诵： The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager has two main components: Scheduler and ApplicationsManager.

The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a resource Container which incorporates elements such as memory, cpu, disk, network etc.

The Scheduler has a pluggable policy which is responsible for partitioning the cluster resources among the various queues, applications etc. The current schedulers such as the CapacityScheduler and the FairScheduler would be some examples of plug-ins.

The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure. The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.

MapReduce in hadoop-2.x maintains API compatibility with previous stable release (hadoop-1.x). This means that all MapReduce jobs should still run unchanged on top of YARN with just a recompile.

YARN supports the notion of resource reservation via the ReservationSystem, a component that allows users to specify a profile of resources over-time and temporal constraints (e.g., deadlines), and reserve resources to ensure the predictable execution of important jobs.The ReservationSystem tracks resources over-time, performs admission control for reservations, and dynamically instruct the underlying scheduler to ensure that the reservation is fullfilled.

In order to scale YARN beyond few thousands nodes, YARN supports the notion of Federation via the YARN Federation feature. Federation allows to transparently wire together multiple yarn (sub-)clusters, and make them appear as a single massive cluster. This can be used to achieve larger scale, and/or to allow multiple independent clusters to be used together for very large jobs, or for tenants who have capacity across all of them.

Apache Hadoop YARN architecture

参考资料

HDFS Architecture

hive

你理解的Hive和传统数据库有什么不同？各有什么试用场景。

1、数据存储位置。Hive是建立在Hadoop之上的，所有的Hive的数据都是存储在HDFS中的。而数据库则可以将数据保存在块设备或本地文件系统中。 2、数据格式。Hive中没有定义专门的数据格式，由用户指定，需要指定三个属性：列分隔符，行分隔符，以及读取文件数据的方法。数据库中，存储引擎定义了自己的数据格式。所有数据都会按照一定的组织存储。 3、数据更新。Hive的内容是读多写少的，因此，不支持对数据的改写和删除，数据都在加载的时候中确定好的。数据库中的数据通常是需要经常进行修改。 4、执行延迟。Hive在查询数据的时候，需要扫描整个表（或分区），因此延迟较高，只有在处理大数据是才有优势。数据库在处理小数据是执行延迟较低。 5、索引。Hive没有，数据库有 6、执行。Hive是MapReduce，数据库是Executor 7、可扩展性。Hive高，数据库低 8、数据规模。Hive大，数据库小

Hive的实用场景如下： 1、Data Ingestion (数据摄取) 2、Data Discovery(数据发现) 3、Data analytics(数据分析) 4、Data Visualization & Collaboration(数据可视化和协同开发)