hadoop源码解析之hdfs内部结构分析

大数据技术与应用实战

发布于 2020-09-15 14:26:54

1K0

发布于 2020-09-15 14:26:54

文件目录管理
- i-node介绍
- INodeFile
- INodeDirectory
块管理
- 数据块BlockInfoContiguous
- 集群中所有的块的管理
- LightWeightGSet 介绍

概述

hdfs的内部的文件和目录是如何以树的结构存储的，每个文件对应的块是如何存储的，每个块对应的怎么对应到每一个datanode的，这些结构在hdfs的内部源码是用哪些变量存储的，整体结构是怎么连接起来的，下面我们通过Hadoop的最新稳定版代码（2.7.3）来学习一下。

命名空间Namesystem

org.apache.hadoop.hdfs.server.namenode.FSNamesystem类是hdfs的核心，存储了hdfs的命名空间相关的很多东西，具体来说就是为DataNode做簿记工作，所有datanode的请求都经过FSNamesystem类来处理。

具体他做了哪些重要的工作，我们来看相关注释


/***************************************************
 * FSNamesystem does the actual bookkeeping work for the
 * DataNode.
 *
 * It tracks several important tables.
 *
        文件名 -> 数据块(存放在磁盘，也就是FSImage和日志中)
 * 1)  valid fsname --> blocklist  (kept on disk, logged)
        合法的数据块列表(上面关系的逆关系)
 * 2)  Set of all valid blocks (inverted #1)
        数据块 -> DataNode(只保存在内存中，根据DataNode发过来的信息动态建立)
 * 3)  block --> machinelist (kept in memory, rebuilt dynamically from reports)
       DataNode上保存的数据块(上面关系的逆关系)
 * 4)  machine --> blocklist (inverted #2)
      通过LRU算法保存最近发送过心跳信息的DataNode
 * 5)  LRU cache of updated-heartbeat machines
 ***************************************************/

文件目录管理

  /** The namespace tree. */
  FSDirectory dir;
  private final BlockManager blockManager;

在namenode中，对hdfs的目录以及下面的子目录、文件的操作相当的复杂，会涉及到内存、硬盘的相关操作，还需要和很多对象打交道，FSDirectory主要就是对这个内部的复杂的操作做了一个简单的封装，对外提供一个简单的操作接口。

FSDirectory中很多重要的操作都和ClientProcotol协议中定义的相关方法一一对应,如addBlock等.

FSDirectory对象中有一个非常重要的INodeDirectory类型的变量rootDir，这个变量在内存中保存这个整个hdfs的目录树，下面我们对INode相关的知识做一下介绍。

i-node介绍

linux i-node介绍

在Linux中，inode是一个非常重要的设计，具体我这里就不介绍了，大家可以参考下这个文章。

http://www.ruanyifeng.com/blog/2011/12/inode.html

hdfs的 INode介绍

org.apache.hadoop.hdfs.server.namenode.INode类是整个文件的一个抽象，其子类INodeDirectory和INodeFile分别表示目录和相应的文件，我们来看下具体的类之间的关系

这里写图片描述

其中，INode的直接子类INodeWithAdditionalFields存储了一个文件或者目录共有的一些东西，如id、名字、访问权限、访问时间等

  /** The inode id. */
  final private long id;
  /**
   *  The inode name is in java UTF8 encoding; 
   *  The name in HdfsFileStatus should keep the same encoding as this.
   *  if this encoding is changed, implicitly getFileInfo and listStatus in
   *  clientProtocol are changed; The decoding at the client
   *  side should change accordingly.
   */
  private byte[] name = null;
  /** 
   * Permission encoded using {@link PermissionStatusFormat}.
   * Codes other than {@link #clonePermissionStatus(INodeWithAdditionalFields)}
   * and {@link #updatePermissionStatus(PermissionStatusFormat, long)}
   * should not modify it.
   */
  private long permission = 0L;
  /** The last modification time*/
  private long modificationTime = 0L;
  /** The last access time*/
  private long accessTime = 0L;

INodeFile

在INodeFile类的内部，有一个非常重要的变量blocks，以一个数组的的形式保存着这个文件对应的块，具体的BlockInfoContiguous的相关信息将在下节来讲述。

private BlockInfoContiguous[] blocks;

INodeDirectory

INodeDirectory重要就是表示的hdfs中的目录树，和其他文件系统类似，一个目录最重要的就是他的子目录，在该类中，用一个INode类型的集合变量children表示。

private List<INode> children = null;

块管理

hdfs中所有块的管理通过FSNamesystem中的BlockManager变量来管理，主要可以提供一些常用的的服务，比如通过blockid查询块，某一个块位于哪个datanode等等

private final BlockManager blockManager;

BlockManager通过BlocksMap来存储信息


private final Namesystem namesystem;

private final DatanodeManager datanodeManager;
  

 /**
   * Mapping: Block -> { BlockCollection, datanodes, self ref }
   * Updated only in response to client-sent information.
 */
final BlocksMap blocksMap;

数据块BlockInfoContiguous

接下来我们说说hdfs中块的最重要的一个类，这个类里最核心的变量是一个Object的数组triplets。


  /**
   * This array contains triplets of references. For each i-th storage, the
   * block belongs to triplets[3*i] is the reference to the
   * {@link DatanodeStorageInfo} and triplets[3*i+1] and triplets[3*i+2] are
   * references to the previous and the next blocks, respectively, in the list
   * of blocks belonging to this storage.
   * 
   * Using previous and next in Object triplets is done instead of a
   * {@link LinkedList} list to efficiently use memory. With LinkedList the cost
   * per replica is 42 bytes (LinkedList#Entry object per replica) versus 16
   * bytes using the triplets.
   */
  private Object[] triplets;

  /**
   * Construct an entry for blocksmap
   * @param replication the block's replication factor
   */
  public BlockInfoContiguous(short replication) {
    this.triplets = new Object[3*replication];
    this.bc = null;
  }

我们通过代码看到，object数组的长度就是3*replication（3乘以block的副本数），

triplets[i]：存储DatanodeStorageInfo对象，Block所在的DataNode;

triplets[i+1]：存储BlockInfoContiguous对象，该DataNode上该block前一个Block；

triplets[i+2]：存储BlockInfoContiguous，该DataNode上该block后一个Block；

这样通过存储块在DataNode上的上一个和下一个块实现了一个双向链表。就能快速的遍历该datanode上的所有的block，其实就是一个linkedlist，但是按照注释中的说法，要比linkedlist节省空间,对于hdfs这些一个巨大的文件系统来说，每个块存储多一点空间，整体上将会是巨大的开销。

集群中所有的块的管理

集群中所有的块的管理通过BlockManager中BlocksMap类型的对象blocksMap来管理

BlocksMap通过GSet存储映射

private GSet<Block, BlockInfoContiguous> blocks;

GSet是自定义的一个set集合,如下代码所示，E必须是K的子类

/**
 * A {@link GSet} is set,
 * which supports the {@link #get(Object)} operation.
 * The {@link #get(Object)} operation uses a key to lookup an element.
 * 
 * Null element is not supported.
 * 
 * @param <K> The type of the keys.
 * @param <E> The type of the elements, which must be a subclass of the keys.
 */
@InterfaceAudience.Private
public interface GSet<K, E extends K> extends Iterable<E>

LightWeightGSet 介绍

保存hdfs的块的管理用的是GSet的子类LightWeightGSet. LightWeightGSet的特点如其注释所示：

/**
 
  一个低内存的GSet的实现类，使用数据存储数据，使用链表解决冲突
 * A low memory footprint {@link GSet} implementation,
 * which uses an array for storing the elements
 * and linked lists for collision resolution.
 *
 没有重新哈希操作
 * No rehash will be performed.
 * Therefore, the internal array will never be resized.
 *
 不支持空元素
 * This class does not support null element.
 *
 非线程安全
 * This class is not thread safe.
 *
 * @param <K> Key type for looking up the elements
 * @param <E> Element type, which must be
 *       (1) a subclass of K, and  K的子类
 *       (2) implementing {@link LinkedElement} interface. 实现LinkedElement接口
 */
@InterfaceAudience.Private
public class LightWeightGSet<K, E extends K> implements GSet<K, E> {

其实它的内部实现有点像hashmap，大概的原理如图所示

这里写图片描述

内部是一个数组，数组的每一个元素是一个链表。看一下里面的一些变量


  //最大长度，2的30次方
  static final int MAX_ARRAY_LENGTH = 1 << 30; //prevent int overflow problem
  static final int MIN_ARRAY_LENGTH = 1;

  /**
   * An internal array of entries, which are the rows of the hash table.
   * The size must be a power of two.
   */
  private final LinkedElement[] entries;//用来存数据的数组
  /** A mask for computing the array index from the hash value of an element. */
  private final int hash_mask;//hash掩码
  /** The size of the set (not the entry array). */
  private int size = 0;//长度
  /** Modification version for fail-fast.
   * @see ConcurrentModificationException
   */
  private int modification = 0;

既然是集合，我们来看其中最重要的put和get方法


  @Override
  public E put(final E element) {
    //validate element
    if (element == null) {
      throw new NullPointerException("Null element is not supported.");
    }
    if (!(element instanceof LinkedElement)) {
      throw new HadoopIllegalArgumentException(
          "!(element instanceof LinkedElement), element.getClass()="
          + element.getClass());
    }
    final LinkedElement e = (LinkedElement)element;

     //查找索引，也就是找到数据要存储的数组的位置
    //find index
    final int index = getIndex(element);

    //remove if it already exists 存在则删除
    final E existing = remove(index, element);

    //insert the element to the head of the linked list
    modification++;
    size++;
    e.setNext(entries[index]);
    entries[index] = e;

    return existing;
  }

put方法在开始做了两个判断，首先不能为空，其次必须是LinkedElement类的子类，然后通过getIndex方法获取元素要存的位置。

  private int getIndex(final K key) {
    return key.hashCode() & hash_mask;
  }

获取位置的方法，用hash_mask和key的hashcode进行了与运算，在构造方法里，我们看到hash_mask的值是数据的长度减去一，hash_mask = entries.length - 1，这样通过与操作，将key的高位全部舍去，这样能保证获取的key的存放位置一定会落在数组的长度之内。

然后通过下面的代码将新插入的元素放到了链表的头部，

e.setNext(entries[index]);
entries[index] = e;

get方法相对简单一些了,先找到数组的索引，然后再遍历链表，找到则返回，没有的话就返回null，代码如下

   @Override
  public E get(final K key) {
    //validate key
    if (key == null) {
      throw new NullPointerException("key == null");
    }

    //find element
    final int index = getIndex(key);
    for(LinkedElement e = entries[index]; e != null; e = e.getNext()) {
      if (e.equals(key)) {
        return convert(e);
      }
    }
    //element not found
    return null;
  }

DatanodeStorageInfo 数据节点存储

在介绍BlockInfoContiguous的时候，我们知道object数组第一项存储的是该块所在的datanode信息，即DatanodeStorageInfo类型的对象。


/**
 * A Datanode has one or more storages. A storage in the Datanode is represented
 * by this class.
 */
public class DatanodeStorageInfo{
    ........................
    
  private final DatanodeDescriptor dn;
  private final String storageID;
  private StorageType storageType; //存储类型，disk，ssd等
  private State state;

  private long capacity; 
  private long dfsUsed;
  private volatile long remaining;
  private long blockPoolUsed;

  private volatile BlockInfoContiguous blockList = null;
  private int numBlocks = 0;

  // The ID of the last full block report which updated this storage.
  private long lastBlockReportId = 0;

  /** The number of block reports received */
  private int blockReportCount = 0;
  
  ..............................
    
}

通过注释，我们了解到一个datanode有不止一个存储，该类所表示的是datanode相关的存储信息。

其中DatanodeDescriptor对象是namenode对datanode的抽象，里面封装了一些datanode基本信息.

这里写图片描述

DatanodeDescriptor相关的类的继承关系如上图，DatanodeID封装了datanode相关的一些基本信息，如ip，host等。

  private String ipAddr;     // IP address
  private String hostName;   // hostname claimed by datanode
  private String peerHostName; // hostname from the actual connection
  private int xferPort;      // data streaming port
  private int infoPort;      // info server port
  private int infoSecurePort; // info server port
  private int ipcPort;       // IPC server port
  private String xferAddr;