首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >数据仓库和数据湖更新的最优索引策略

数据仓库和数据湖更新的最优索引策略
EN

Database Administration用户
提问于 2022-06-15 14:38:44
回答 2查看 254关注 0票数 2

我们有一个sql服务器数据库,用作数据池和数据仓库。数据库中的每个表都有一些标准化的定义,因为我们现在有大约600个表,所以维护需要一定程度的自动化。

加载每个表的一般过程是首先将表的副本加载到changeLog模式中的哈希表中(有时只有更改的记录(如果我们能够确定更改的记录是什么),然后将changeLog表与目标表进行比较。目标表用于报告,因此这种changeLog方法允许我们持久化目标表,并且只应用最小的更新/插入操作。

每个目标表都有一个唯一的键/业务密钥,可以通过配置表来标识,并且每个表中的名称都是相同的标准化审计列。审计栏告诉我们

  1. 当记录被添加到数据仓库时
  2. 当数据仓库中的记录最后更新时
  3. 是否已删除源中的记录。
  4. 使用HASHBYTES(‘SHA2 2_256’,CONCAT())更改的记录标识符

更改后的记录标识符以前是CHECKSUM(),但我们发现CHECKSUM()的碰撞率太高,无法信赖。我刚刚在每个表中添加了一个HASHBYTES()列并填充了它。

我将HASHBYTES()列创建为VARBINARY(MAX)。现在,每次加载表时,我们都可以通过将changeLog表中计算的新HASHBYTES()值与目标表中持久化的HASHBYTES()值进行比较,判断是否需要更新记录。

我立即注意到,从INT ()切换到VARBINARY(MAX) HASHBYTES()导致更新检查过程明显减慢。我在每个校验和列上没有索引,但在我刚才添加的HASHBYTES列上没有索引。每个表的唯一键也有聚集索引。

  1. 什么是理想的索引添加,以检查更新?
  2. 是否有一个标准化的索引,我可以添加到每个表?
  3. VARBINARY(最大)是合适的数据类型,还是我可以安全地将它降到更小的大小?

希望这是一个足够的背景,使这个问题有意义。我需要尽快加快这一进程。

编辑:我正在添加一个大型的SQL脚本作为示例,它具有表的changeLog版本和目标版本的示例表定义以及为更新目标版本而运行的查询。

代码语言:javascript
运行
复制
--OBJECT DEFINITIONS
CREATE TABLE [changeLog].[DimSalesOffice](
    [Sales Office Code] [varchar](3) NOT NULL,
    [CCN_Key] [uniqueidentifier] NOT NULL,
    [Sales Office Name (Short)] [nvarchar](14) NOT NULL,
    [Sales Office Name (Long)] [nvarchar](50) NOT NULL,
    [Sales Office City] [nvarchar](50) NOT NULL,
    [Sales Office StateProvince] [nvarchar](50) NOT NULL,
    [Sales Office Postal Code] [nvarchar](20) NOT NULL,
    [Sales Office Country Code] [varchar](3) NOT NULL,
    [Sales Office Address] [nvarchar](65) NOT NULL,
    [Sales Office Address (Line 2)] [nvarchar](65) NOT NULL,
    [Sales Office Name (Short - Native Language)] [nvarchar](14) NOT NULL,
    [Sales Office Name (Long - Native Language)] [nvarchar](50) NOT NULL,
    [Sales Office Address (Native Language)] [nvarchar](65) NOT NULL,
    [Sales Office Address (Line 2 - Native Language)] [nvarchar](65) NOT NULL,
    [Sales Office City (Native Language)] [nvarchar](100) NOT NULL,
    [Sales Office StateProvince (Native Language)] [nvarchar](100) NOT NULL,
    [Sales Office Region] [nvarchar](100) NULL,
    [Native Language Code] [varchar](2) NOT NULL
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[DimSalesOffice](
    [SalesOffice_Key] [uniqueidentifier] NOT NULL,
    [Sales Office Code] [varchar](3) NOT NULL,
    [CCN_Key] [uniqueidentifier] NOT NULL,
    [Sales Office Name (Short)] [nvarchar](14) NOT NULL,
    [Sales Office Name (Long)] [nvarchar](50) NOT NULL,
    [Sales Office City] [nvarchar](50) NOT NULL,
    [Sales Office StateProvince] [nvarchar](50) NOT NULL,
    [Sales Office Postal Code] [nvarchar](20) NOT NULL,
    [Sales Office Country Code] [varchar](3) NOT NULL,
    [Sales Office Address] [nvarchar](65) NOT NULL,
    [Sales Office Address (Line 2)] [nvarchar](65) NOT NULL,
    [Sales Office Name (Short - Native Language)] [nvarchar](14) NOT NULL,
    [Sales Office Name (Long - Native Language)] [nvarchar](50) NOT NULL,
    [Sales Office Address (Native Language)] [nvarchar](65) NOT NULL,
    [Sales Office Address (Line 2 - Native Language)] [nvarchar](65) NOT NULL,
    [Sales Office City (Native Language)] [nvarchar](100) NOT NULL,
    [Sales Office StateProvince (Native Language)] [nvarchar](100) NOT NULL,
    [Sales Office Region] [nvarchar](100) NULL,
    [Native Language Code] [varchar](2) NOT NULL,
    [DW_CreatedOn] [datetime2](7) NULL,
    [DW_ModifiedOn] [datetime2](7) NULL,
    [DW_IsDeleted?] [bit] NULL,
    [DW_Checksum] [int] NULL,
    [Source_ModifiedOn] [datetime2](7) NULL,
    [DW_Hashbytes] [varbinary](max) NULL,
PRIMARY KEY NONCLUSTERED 
(
    [SalesOffice_Key] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[DimSalesOffice] ADD  DEFAULT (newsequentialid()) FOR [SalesOffice_Key]
GO

CREATE UNIQUE CLUSTERED INDEX [IX_UK_DimSalesOffice] ON [dbo].[DimSalesOffice]
(
    [Sales Office Code] ASC
)
GO

--MERGE QUERY
DECLARE @InsertRecordCount INT, @UpdateRecordCount INT;

/*****UPDATE*****/
UPDATE [dbo].[DimSalesOffice] SET
    [CCN_Key] = [Source].[CCN_Key],
    [Sales Office Name (Short)] = [Source].[Sales Office Name (Short)],
    [Sales Office Name (Long)] = [Source].[Sales Office Name (Long)],
    [Sales Office City] = [Source].[Sales Office City],
    [Sales Office StateProvince] = [Source].[Sales Office StateProvince],
    [Sales Office Postal Code] = [Source].[Sales Office Postal Code],
    [Sales Office Country Code] = [Source].[Sales Office Country Code],
    [Sales Office Address] = [Source].[Sales Office Address],
    [Sales Office Address (Line 2)] = [Source].[Sales Office Address (Line 2)],
    [Sales Office Name (Short - Native Language)] = [Source].[Sales Office Name (Short - Native Language)],
    [Sales Office Name (Long - Native Language)] = [Source].[Sales Office Name (Long - Native Language)],
    [Sales Office Address (Native Language)] = [Source].[Sales Office Address (Native Language)],
    [Sales Office Address (Line 2 - Native Language)] = [Source].[Sales Office Address (Line 2 - Native Language)],
    [Sales Office City (Native Language)] = [Source].[Sales Office City (Native Language)],
    [Sales Office StateProvince (Native Language)] = [Source].[Sales Office StateProvince (Native Language)],
    [Sales Office Region] = [Source].[Sales Office Region],
    [Native Language Code] = [Source].[Native Language Code],
    [DW_Checksum] =
        CHECKSUM(
            [Source].[CCN_Key],
            [Source].[Sales Office Name (Short)],
            [Source].[Sales Office Name (Long)],
            [Source].[Sales Office City],
            [Source].[Sales Office StateProvince],
            [Source].[Sales Office Postal Code],
            [Source].[Sales Office Country Code],
            [Source].[Sales Office Address],
            [Source].[Sales Office Address (Line 2)],
            [Source].[Sales Office Name (Short - Native Language)],
            [Source].[Sales Office Name (Long - Native Language)],
            [Source].[Sales Office Address (Native Language)],
            [Source].[Sales Office Address (Line 2 - Native Language)],
            [Source].[Sales Office City (Native Language)],
            [Source].[Sales Office StateProvince (Native Language)],
            [Source].[Sales Office Region],
            [Source].[Native Language Code],
            0
        ),
    [DW_Hashbytes] = 
        HASHBYTES(
            'SHA2_256',
            ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
            + '0'
        ),
    [Source_ModifiedOn] = NULL,
    [DW_ModifiedOn] = GETUTCDATE(),
    [DW_IsDeleted?] = 0
FROM [changeLog].[DimSalesOffice] [Source]
JOIN [dbo].[DimSalesOffice]
    ON [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
    AND ISNULL([DimSalesOffice].[DW_Hashbytes], HASHBYTES('SHA2_256', '')) <> HASHBYTES(
            'SHA2_256',
            ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
            + '0'
        )
SET @UpdateRecordCount = @@ROWCOUNT;

/*****Soft Deletes*****/
UPDATE [dbo].[DimSalesOffice] SET
    [DW_Checksum] = 0,
    [DW_Hashbytes] = 
        HASHBYTES(
            'SHA2_256',
            ISNULL(CAST([DimSalesOffice].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
            + '1'
        ),
    [Source_ModifiedOn] = NULL,
    [DW_ModifiedOn] = GETUTCDATE(),
    [DW_IsDeleted?] = 1
FROM [dbo].[DimSalesOffice]
WHERE NOT EXISTS
    (
        SELECT 1
        FROM [changeLog].[DimSalesOffice] [Source]
        WHERE [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
    )
SET @UpdateRecordCount = @UpdateRecordCount + @@ROWCOUNT;

/*****INSERT*****/
INSERT INTO [dbo].[DimSalesOffice]
    (
        [Sales Office Code],
        [CCN_Key],
        [Sales Office Name (Short)],
        [Sales Office Name (Long)],
        [Sales Office City],
        [Sales Office StateProvince],
        [Sales Office Postal Code],
        [Sales Office Country Code],
        [Sales Office Address],
        [Sales Office Address (Line 2)],
        [Sales Office Name (Short - Native Language)],
        [Sales Office Name (Long - Native Language)],
        [Sales Office Address (Native Language)],
        [Sales Office Address (Line 2 - Native Language)],
        [Sales Office City (Native Language)],
        [Sales Office StateProvince (Native Language)],
        [Sales Office Region],
        [Native Language Code],
        [DW_Checksum],
        [DW_Hashbytes],
        [Source_ModifiedOn],
        [DW_ModifiedOn],
        [DW_IsDeleted?],
        [DW_CreatedOn]
    )
SELECT
    [Source].[Sales Office Code],
    [Source].[CCN_Key],
    [Source].[Sales Office Name (Short)],
    [Source].[Sales Office Name (Long)],
    [Source].[Sales Office City],
    [Source].[Sales Office StateProvince],
    [Source].[Sales Office Postal Code],
    [Source].[Sales Office Country Code],
    [Source].[Sales Office Address],
    [Source].[Sales Office Address (Line 2)],
    [Source].[Sales Office Name (Short - Native Language)],
    [Source].[Sales Office Name (Long - Native Language)],
    [Source].[Sales Office Address (Native Language)],
    [Source].[Sales Office Address (Line 2 - Native Language)],
    [Source].[Sales Office City (Native Language)],
    [Source].[Sales Office StateProvince (Native Language)],
    [Source].[Sales Office Region],
    [Source].[Native Language Code],
    [DW_Checksum] = 
        CHECKSUM(
            [Source].[CCN_Key],
            [Source].[Sales Office Name (Short)],
            [Source].[Sales Office Name (Long)],
            [Source].[Sales Office City],
            [Source].[Sales Office StateProvince],
            [Source].[Sales Office Postal Code],
            [Source].[Sales Office Country Code],
            [Source].[Sales Office Address],
            [Source].[Sales Office Address (Line 2)],
            [Source].[Sales Office Name (Short - Native Language)],
            [Source].[Sales Office Name (Long - Native Language)],
            [Source].[Sales Office Address (Native Language)],
            [Source].[Sales Office Address (Line 2 - Native Language)],
            [Source].[Sales Office City (Native Language)],
            [Source].[Sales Office StateProvince (Native Language)],
            [Source].[Sales Office Region],
            [Source].[Native Language Code],
            0
        ),
    [DW_Hashbytes] = 
        HASHBYTES(
            'SHA2_256',
            ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
            + '0'
        ),
    [Source_ModifiedOn] = NULL,
    [DW_ModifiedOn] = GETUTCDATE(),
    [DW_IsDeleted?] = 0,
    [DW_CreatedOn] = GETUTCDATE()
FROM [changeLog].[DimSalesOffice] [Source]
WHERE NOT EXISTS
    (
        SELECT 1
        FROM [dbo].[DimSalesOffice]
        WHERE [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
    )
SET @InsertRecordCount = @@ROWCOUNT;

SELECT [Update Record Count] = @UpdateRecordCount, [Insert Record Count] = @InsertRecordCount;
EN

回答 2

Database Administration用户

回答已采纳

发布于 2022-06-16 12:08:07

相反的回答,因为它有更多的语言意义:

  1. VARBINARY(最大)是合适的数据类型,还是我可以安全地将它降到更小的大小?

正如我在注释中提到的,SHA2_256 means它将输出哈希为256位,也就是32字节(8位1字节)的长度,这意味着您所需要的最大VARBINARY大小是VARBINARY(32)。在HASBYTES()的docs中提到了这一点:

输出符合算法标准: MD2、MD4和MD5为128位(16字节);SHA和SHA1为160位(20字节);SHA2_256为256位(32字节),SHA2_512为512位(64字节)。

因此,是的,您可以安全地将VARBINARY的大小降到VARBINARY(32)

  1. 什么是理想的索引添加,以检查更新?
  2. 是否有一个标准化的索引,我可以添加到每个表?

是的,在缩小VARBINARY字段的大小之后,可以将其添加到表中的索引中。我推荐一个非聚集索引,该索引以主键字段为首,然后在定义中将HASHBYTES()计算字段放在第二位。

之所以使其非聚集,是因为您通常希望避免索引中的热列,特别是与每个非聚集索引一起存储的聚集索引。频繁更新的热列会导致大量写入索引,而在聚集索引的情况下,这些写入也必须发生在每个非聚集索引上(因为聚集索引存储在其中)。所有列上的行散列肯定会频繁更改。

使用主键字段进行引导是有意义的,因为您首先希望通过该字段连接以匹配相同的行,然后通过HASHBYTES()计算字段来检查它们是否不同。

如何改进哈希字节计算?我在考虑建立一个计算列的索引。

是的,您可以使用计算列(甚至不需要持久化,但可以是)并对其进行索引,或者您也可以创建索引视图,这两者我都做过了。我首先为计算列拍摄,因为它比索引视图稍微灵活一些,将表本身的行中的实际行哈希耦合起来,并且是要管理的少一个对象。

票数 1
EN

Database Administration用户

发布于 2022-06-16 01:00:02

HASHBYTES的输出长度取决于所使用的算法。SHA2_256产生256个位或32个字节。这在文件里。将列声明为binary(32)是可以的。我目前工作的系统就是这样做的。

票数 1
EN
页面原文内容由Database Administration提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://dba.stackexchange.com/questions/313401

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档