FID
:在lustre文件系统中每个分片对象都会有唯一的fid,这也是lustre文件系统用来表示每个对象唯一性的。 // lustre中fid的定义
struct lu_fid {
__u64 f_seq;
// fid中的序列号
__u32 f_oid;
// 版本号
__u32 f_ver;
} __attribute__((packed));
Object Index(OI)
:OI 是用作映射Lustre文件系统全局中FID(一个文件或者文件被striping就的对象,都是具有唯一的fid)后端存储的唯一标识.后端的ldiskfs中fid映射的就是 <inode编号,generation序列号>
。如果OST上的OI映射关系缺失,会导致文件分片信息不可见,因为MDS得到fid去对应的OST上找不到对象的分片信息。如果OST上的OI表损坏会导致应用访问到位置的对象数据,从而导致上层应用出现不可知的行为。FID-In-Dirent(directory entry)
:在lustre 2.x的架构中当文件被创建时候,FID会被创建,FID会作为父目录的Name Entry的一部分。这个就是FID-In-Dirent。当发起在MDT上readdir从Directory Page中获取FID,从而避免lookup对象的LMA(lustre metadata attribte。当执行du或者ls命令时候性能比较快。为了加速访问目录,每个目录(Dirent)对象Name后面添加了对象的FID。如果FID-In-Dirent缺失会导致读取对象额外的FID信息,如果FID-In-Dirent顺怀会导致访问不可预期的数据。linkEA
:link扩展属性,当文件被创建或者硬链接时候,父目录名称和FID被记录在扩展属性中(link extented attribute),这个属性是存储在mdt上。
LMA
:lustre metadata attributes,记录lustre中特殊的属性,比如HSM状态,自身的fid信息,这些信息都会存储在文件被strip后的对象属性上。// 获取当前数据文件的strip信息,database.data文件的objid=2,在osd index=0上
[root@CentOS-Lustre-Client ~]$ lfs getstripe -v /mnt/lustre/database.dat
/mnt/lustre/database.dat
lmm_magic: 0x0BD10BD0
lmm_seq: 0x200000401
lmm_object_id: 0x1
lmm_fid: [0x200000401:0x1:0x0]
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 0
obdidx objid objid group
0 2 0x2 0
// 获取该文件上的所有属性,link属性存储对象父目录和对象fid的信息;lma是存储对象的自身fid的信息;
lov扩展属性记录当前object布局(位于那个ost上,那个文件上等信息)
[root@CentOS-Lustre-Client ~]$ getfattr -d -m ".*" /mnt/lustre/database.dat
getfattr: Removing leading '/' from absolute path names
# file: mnt/lustre/database.dat
lustre.lov=0s0AvRCwEAAAABAAAAAAAAAAEEAAACAAAAAAAQAAEAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
trusted.link=0s3/HqEQEAAAA2AAAAAAAAAAAAAAAAAAAAAB4AAAACAAAABwAAAAEAAAAAZGF0YWJhc2UuZGF0
trusted.lma=0sAAAAAAAAAAABBAAAAgAAAAEAAAAAAAAA
trusted.lov=0s0AvRCwEAAAABAAAAAAAAAAEEAAACAAAAAAAQAAEAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
trusted.som=0sBAAAAAAAAAAAACADAAAAAEBnAQAAAAAA
// 在ost index=1上查看这个文件分片的信息
[root@CentOS-Lustre-OSS-1 /mnt/ost/O/0]$ debugfs -c -R "stat /O/0/d$((2 % 32))/2" /dev/sdb
debugfs 1.46.2.wc3 (18-Jun-2021)
/dev/sdb: catastrophic mode - not reading inode or group bitmaps
Inode: 231 Type: regular Mode: 0666 Flags: 0x80000
Generation: 1620110536 Version: 0x00000001:00000010
User: 0 Group: 0 Project: 0 Size: 52428800
File ACL: 0
Links: 1 Blockcount: 102400
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x619e3c69:00000000 -- Wed Nov 24 08:21:45 2021
atime: 0x00000000:00000000 -- Wed Dec 31 19:00:00 1969
mtime: 0x619e3c69:00000000 -- Wed Nov 24 08:21:45 2021
crtime: 0x619e30ee:8508d540 -- Wed Nov 24 07:32:46 2021
Size of extra inode fields: 32
Extended attributes:
lma: fid=[0x100000000:0x2:0x0] compat=8 incompat=0
fid: parent=[0x200000401:0x1:0x0] stripe=0 stripe_size=1048576 stripe_count=1 layout_version=0 range=0
EXTENTS:
(0-6143):102400-108543, (6144-8191):109568-111615, (8192-12799):112640-117247
// ost上2这个分片就是文件D的一个分片
[root@CentOS-Lustre-OSS-1 /mnt/ost/O/0/d2]$ stat 2
File: 2
Size: 52428800 Blocks: 102400 IO Block: 4096 regular file
Device: 810h/2064d Inode: 231 Links: 1
Access: (0666/-rw-rw-rw-) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 1969-12-31 19:00:00.000000000 -0500
Modify: 2021-11-24 08:21:45.000000000 -0500
Change: 2021-11-24 08:21:45.000000000 -0500
Birth: -
// 把后端的ost挂载后,进入ost的挂载目录,获取这个分片的信息,这个分片存储了该文件的父目录的fid和lma这个扩展属性,lma包括了该文件的fid信息
[root@CentOS-Lustre-OSS-1 /mnt/ost/O/0/d2]$ getfattr -d -m ".*" 2
# file: 2
trusted.fid=0sAQQAAAIAAAABAAAAAAAAAAAAEAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==
trusted.lma=0sCAAAAAAAAAAAAAAAAQAAAAIAAAAAAAAA
FID-In-LMA
: lustre的对象都有唯一的FID,这个FID是存储在扩展属性上的(XATTR_NAME_LMA).根据FID检查对象是否存在,如果不存在返回-EREMCHG
错误。同时FID-In-LMA可以用来重建lustre的OI表linkEA
:在MDS后端的存储MDT上存储了对象的位置信息,object的位置信息包括了name
和parent FID
,这个作为XATTR_NAME_LINKEA
属性存储存储在inode中。给董任何的FID都全路径的从root根目录查找目标文件。linkEA可以用来构建lustre的整个命名空间-parent FID
:每个OST上的对象存储了本对象FID的父MDT-Object,MDT-Object是以XATTR_NAME_FID
扩展属性的方式存储在OST上的object。OST上的object的parent FID
用来重构MDT上的MDT-Object上的LOV的扩展属性
// 这里的fid就是以XATTR_NAME_FID属性保存了父MDT-object信息
[root@CentOS-Lustre-OSS-1 /mnt/ost/O/0/d2]$ getfattr -d -m ".*" 2
# file: 2
trusted.fid=0sAQQAAAIAAAABAAAAAAAAAAAAEAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==
trusted.lma=0sCAAAAAAAAAAAAAAAAQAAAAIAAAAAAAAA
OI Scrub组件
:OI Scrub维护ldiskfs的IO Mapping的一致性,当OST损坏用来重建OI Mapping,可以修复OST上文件OI表丢失/释放无效FID对象的磁盘空间/修复OSTS上的FIDlayout LFSCK组件
:这个是用来校验每个object在MDT和OST上的laytout一致性,在疫情的情况下如果处于不一致的状态,会自动触发修复那些不一致的数据。MDT上的object通过LOV EA来引用stripes object(ost上的objects).每个ost上对象通过parent FID
来引用mdt-obejct.在MDT上lfsck会去检验所有的mdt-object的lov的扩展属性。在OST上lfsck上会记录所有没有被校验的孤儿ost的对象。namespace LFSCK组件
:namesapce lfsck则是关注整个lustre文件系统的命名空间,它是跨单个或者多个MDT工作。用来修复全局或者本地MDS内的命名空间一致性。遍历MDT内命名空间,检查每个MDT-Object是否引用了link EA.保证每个entry都在正确的MDT-object上。// FSNAME 在实例中是bigfs,MDTNAME=MDT0000
lctl get_param -n mdd.$FSNAME-$MDTNAME.lfsck_layout
lctl get_param -n mdd.$FSNAME-$MDTNAME.lfsck_namespace
lctl get_param -n obdfilter.$FSNAME-$OSTNAME.lfsck_layout
lctl get_param -n osd-ldiskfs.$FSNAME-$TARGETNAME.oi_scrub
// lfsck的日志会以debug模式记录,可以使用如下命令查看
[root@CentOS-Lustre-OSS-1 ~]$ lctl debug_kernel /tmp/debug.lfsck
// 仅仅过滤出lfsck的日志
[root@CentOS-Lustre-OSS-1 ~]$ lctl set_param printk=+lfsck
printk=+lfsck
// 发起lfsck任务后可以根据如下命令进行查看
[root@CentOS-Lustre-OSS-1 ~]$ lctl get_param -n osd-ldiskfs.bigfs-OST0000.oi_scrub
// 使用说明
/proc/fs/lustre/osd-ldiskfs/${FSNAME}-MDTxxxx/oi_scrub
/proc/fs/lustre/osd-ldiskfs/${FSNAME}-OSTxxxx/oi_scrub
// 客户端查看当前的osts信息
[root@CentOS-Lustre-Client ~]$ lfs osts
OBDS:
0: bigfs-OST0000_UUID ACTIVE
1: bigfs-OST0001_UUID ACTIVE
// 在index=0的ost上执行scrub
[root@CentOS-Lustre-OSS-1 /proc/fs/lustre]$ lctl lfsck_start -t scrub -M bigfs-OST0000
Started LFSCK on the device bigfs-OST0000: scrub
// 停止lfsck任务
[root@CentOS-Lustre-OSS-1 ~]$ lctl lfsck_stop -M bigfs-OST0000
Stopped LFSCK on the device bigfs-OST0000.
// 使用例子
[root@CentOS-Lustre-OSS-1 /proc/fs/lustre]$cat /proc/fs/lustre/osd-ldiskfs/bigfs-OST0000/oi_scrub
/proc/fs/lustre/mdd/${FSNAME}-MDTxxxx/lfsck_layout /proc/fs/lustre/obdfilter/${FSNAME}-OSTxxxx/lfsck_layout
/proc/fs/lustre/mdd/${FSNAME}-MDTxxxx/lfsck_namespace
// lustre中定义的部分核心属性
#define XATTR_NAME_LOV "trusted.lov"
#define XATTR_NAME_LMA "trusted.lma"
#define XATTR_NAME_LMV "trusted.lmv"
#define XATTR_NAME_DEFAULT_LMV "trusted.dmv"
#define XATTR_NAME_LINK "trusted.link"
#define XATTR_NAME_FID "trusted.fid"
#define XATTR_NAME_VERSION "trusted.version"
#define XATTR_NAME_SOM "trusted.som"
#define XATTR_NAME_HSM "trusted.hsm"
#define XATTR_NAME_LFSCK_BITMAP "trusted.lfsck_bitmap"
#define XATTR_NAME_DUMMY "trusted.dummy"
/***************XATTR_NAME_LMV 属性的值的定义*********************/
/* LMV layout EA, and it will be stored both in master and slave object */
struct lmv_mds_md_v1 {
__u32 lmv_magic;
// 当前对象的stript的个数
__u32 lmv_stripe_count;
__u32 lmv_master_mdt_index; /* On master object, it is master
* MDT index, on slave object, it
* is stripe index of the slave obj */
__u32 lmv_hash_type; /* dir stripe policy, i.e. indicate
* which hash function to be used,
* Note: only lower 16 bits is being
* used for now. Higher 16 bits will
* be used to mark the object status,
* for example migrating or dead. */
__u32 lmv_layout_version; /* increased each time layout changed,
* by directory migration, restripe
* and LFSCK. */
__u32 lmv_migrate_offset; /* once this is set, it means this
* directory is been migrated, stripes
* before this offset belong to target,
* from this to source. */
__u32 lmv_migrate_hash; /* hash type of source stripes of
* migrating directory */
__u32 lmv_padding2;
__u64 lmv_padding3;
// pool名称
char lmv_pool_name[LOV_MAXPOOLNAME + 1];
// 每个数据对象的fid
struct lu_fid lmv_stripe_fids[0];
};
/*******************XATTR_NAME_LINK属性值的定义***********/
struct linkea_data {
/**
* Buffer to keep link EA body.
*/
struct lu_buf *ld_buf;
/**
* The matched header, entry and its lenght in the EA
*/
struct link_ea_header *ld_leh;
struct link_ea_entry *ld_lee;
int ld_reclen;
};
// 对象父目录的fid和对象的名称cname
sname = lod_name_get(env, stripe_name, strlen(stripe_name));
int linkea_links_new(struct linkea_data *ldata, struct lu_buf *buf,
const struct lu_name *cname, const struct lu_fid *pfid)
{
int rc;
rc = linkea_data_new(ldata, buf);
if (!rc)
rc = linkea_add_buf(ldata, cname, pfid, false);
return rc;
}
/********************XATTR_NAME_LMA定义的值***************************/
// ost上每个object定义的lma的属性存储的数据类型
struct lustre_mdt_attrs {
__u32 lma_compat;
__u32 lma_incompat;
struct lu_fid lma_self_fid;
};
// ost上的每个数据对象的属性定义,使用getstript命令获取的信息
struct lustre_ost_attrs {
// 当前对象定义的lma
struct lustre_mdt_attrs loa_lma;
// 父目录的fid
struct lu_fid loa_parent_fid;
__u32 loa_stripe_size;
__u32 loa_comp_id;
__u64 loa_comp_start;
__u64 loa_comp_end;
};
/*******************************XATTR_NAME_LOV属性值的定义*********************/
// 每个ost上的对象的设置XATTR_NAME_LOV属性的value,这些信息都是保存在mdt中
#define lov_user_ost_data lov_user_ost_data_v1
struct lov_user_ost_data_v1 { /* per-stripe data structure */
struct ost_id l_ost_oi; /* OST object ID */
__u32 l_ost_gen; /* generation of this OST index */
__u32 l_ost_idx; /* OST index in LOV */
} __attribute__((packed));
#define lov_user_md lov_user_md_v1
struct lov_user_md_v1 { /* LOV EA user data (host-endian) */
__u32 lmm_magic; /* magic number = LOV_USER_MAGIC_V1 */
__u32 lmm_pattern; /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */
struct ost_id lmm_oi; /* MDT parent inode id/seq (id/0 for 1.x) */
__u32 lmm_stripe_size; /* size of stripe in bytes */
__u16 lmm_stripe_count; /* num stripes in use for this object */
union {
__u16 lmm_stripe_offset; /* starting stripe offset in
* lmm_objects, use when writing */
__u16 lmm_layout_gen; /* layout generation number
* used when reading */
};
struct lov_user_ost_data_v1 lmm_objects[0]; /* per-stripe data */
} __attribute__((packed, __may_alias__));
/**************************XATTR_NAME_FID 属性值定义********************/
struct lu_fid {
/**
* FID sequence. Sequence is a unit of migration: all files (objects)
* with FIDs from a given sequence are stored on the same server.
* Lustre should support 2^64 objects, so even if each sequence
* has only a single object we can still enumerate 2^64 objects.
**/
__u64 f_seq;
/* FID number within sequence. */
__u32 f_oid;
/**
* FID version, used to distinguish different versions (in the sense
* of snapshots, etc.) of the same file system object. Not currently
* used.
**/
__u32 f_ver;
} __attribute__((packed));
//
struct lod_object {
/* common fields for both files and directories */
struct dt_object ldo_obj;
struct mutex ldo_layout_mutex;
union {
/* file stripe (LOV) */
struct {
__u32 ldo_layout_gen;
/* Layout component count for a regular file.
* It equals to 1 for non-composite layout. */
__u16 ldo_comp_cnt;
/* Layout mirror count for a PFLR file.
* It's 0 for files with non-composite layout. */
__u16 ldo_mirror_count;
struct lod_mirror_entry *ldo_mirrors;
__u32 ldo_is_composite:1,
ldo_flr_state:2,
ldo_comp_cached:1,
ldo_is_foreign:1;
};
/* directory stripe (LMV) */
struct {
/* Slave stripe count for striped directory. */
__u16 ldo_dir_stripe_count;
/* How many stripes allocated for a striped directory */
__u16 ldo_dir_stripes_allocated;
__u32 ldo_dir_stripe_offset;
__u32 ldo_dir_hash_type;
__u32 ldo_dir_migrate_offset;
__u32 ldo_dir_migrate_hash;
__u32 ldo_dir_layout_version;
/* Is a slave stripe of striped directory? */
__u32 ldo_dir_slave_stripe:1,
ldo_dir_striped:1,
/* the stripe has been loaded */
ldo_dir_stripe_loaded:1,
/* foreign directory */
ldo_dir_is_foreign;
/*
* This default LMV is parent default LMV, which will be
* used in child creation, and it's not cached, so this
* field is invalid after create, make sure it's used by
* lod_dir_striping_create_internal() only.
*/
struct lod_default_striping *ldo_def_striping;
};
};
union {
struct {
/* foreign/raw format LOV */
char *ldo_foreign_lov;
size_t ldo_foreign_lov_size;
};
struct {
/* foreign/raw format LMV */
char *ldo_foreign_lmv;
size_t ldo_foreign_lmv_size;
};
struct {
/* file stripe (LOV) */
struct lod_layout_component *ldo_comp_entries;
/* slave stripes of striped directory (LMV) */
struct dt_object **ldo_stripe;
};
};
};