我有数据采集和数据仓库,包含大约5-10 TBs的数据在Azure ADLS gen2,CSV和Delta格式.ADLS的性能/Tier=标准/热,replication=GRS,type=StorageV2。
备份ADLS gen2数据的最佳方法是什么?
考虑因素:
data_container挂载到archive_container,并尝试使用Databrick的dbutils.fs.cp复制数据,比Azure的工作速度还要慢:在大10 Notes30 DBU集群上运行3GB/10分钟。为什么?发布于 2022-02-16 04:57:27
对于原始数据/文件夹备份,我使用Microsoft数据移动服务将blob目录从ADLS Gen2复制到存储帐户。
为此,创建一个每日时间触发器函数来执行blob目录的增量副本。您可以配置类似的东西。
创建一个新文件夹,每个星期一(日期)完全备份,并将增量更改保存到周日。一个月后,删除旧的备份文件夹。
这是我的实现。
public async Task<string> CopyBlobDirectoryAsync(BlobConfiguration sourceBlobConfiguration, BlobConfiguration destBlobConfiguration, string blobDirectoryName)
{
CloudBlobDirectory sourceBlobDir = await GetCloudBlobDirectoryAsync(sourceBlobConfiguration.ConnectionString, sourceBlobConfiguration.ContainerName, blobDirectoryName);
CloudBlobDirectory destBlobDir = await GetCloudBlobDirectoryAsync(destBlobConfiguration.ConnectionString, destBlobConfiguration.ContainerName, destBlobConfiguration.BlobDirectoryPath + "/" + blobDirectoryName);
// You can also replace the source directory with a CloudFileDirectory instance to copy data from Azure File Storage. If so:
// 1. If recursive is set to true, SearchPattern is not supported. Data movement library simply transfer all azure files
// under the source CloudFileDirectory and its sub-directories.
CopyDirectoryOptions options = new CopyDirectoryOptions()
{
Recursive = true
};
DirectoryTransferContext context = new DirectoryTransferContext();
context.FileTransferred += FileTransferredCallback;
context.FileFailed += FileFailedCallback;
context.FileSkipped += FileSkippedCallback;
// Create CancellationTokenSource used to cancel the transfer
CancellationTokenSource cancellationSource = new CancellationTokenSource();
TransferStatus trasferStatus = await TransferManager.CopyDirectoryAsync(sourceBlobDir, destBlobDir, CopyMethod.ServiceSideAsyncCopy, options, context, cancellationSource.Token);
return TransferStatusToString(blobDirectoryName, trasferStatus);
}https://stackoverflow.com/questions/64693799
复制相似问题