entityMap|0|type|LINK|mutability|MUTABLE|data|url|https://s3fs.readthedocs.io/en/latest/#credentials|1|https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem|2|http://docs.dask.org/en/latest/remote-data-services.html|blocks|key|e2ml5|text|从s3加载数据的后端是s3fs，它有一个关于credentials+here的部分，主要指向boto3的文档。|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|ch72o|简短的答案是，提供S3凭据的方法有很多，其中一些是自动的(文件放在正确的位置，或环境变量-必须对所有工作人员都可访问，或集群元数据服务)。|526ha|或者，您可以在调用中直接提供您的密钥/密码，但这当然意味着您必须信任您的执行平台和工作者之间的通信|2c7ml|df+=+dd.read_csv('s3://mybucket/some-big.csv',++storage_options+=+{'key':+mykey,+'secret':+mysecret})|code-block|syntax|javascript|e5mrn|可以在API+docs中找到使用s3fs时可以在storage_options中传递的参数集。|style|CODE|g4vl|一般参考http://docs.dask.org/en/latest/remote-data-services.html|8gd1q^0|Y|4|0|0|0|0|0|O|F|3|8|1|0|4|1K|2|0^^$0|$1|$2|3|4|5|6|$7|8]]|9|$2|3|4|5|6|$7|A]]|B|$2|3|4|5|6|$7|C]]]|D|@$E|F|G|H|2|I|J|14|K|@]|L|@$M|15|N|16|E|17]]|6|$]]|$E|O|G|P|2|I|J|18|K|@]|L|@]|6|$]]|$E|Q|G|R|2|I|J|19|K|@]|L|@]|6|$]]|$E|S|G|T|2|U|J|1A|K|@]|L|@]|6|$V|W]]|$E|X|G|Y|2|I|J|1B|K|@$M|1C|N|1D|Z|10]]|L|@$M|1E|N|1F|E|1G]]|6|$]]|$E|11|G|12|2|I|J|1H|K|@]|L|@$M|1I|N|1J|E|1K]]|6|$]]|$E|13|G|-4|2|I|J|1L|K|@]|L|@]|6|$]]]]

The backend which loads the data from s3 is s3fs, and it has a section on credentials <a href="https://s3fs.readthedocs.io/en/latest/#credentials" rel="noreferrer">here</a>, which mostly points you to boto3's documentation.

The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service). 

Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers 

<pre><code>df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'key': mykey, 'secret': mysecret})
</code></pre>

The set of parameters you can pass in <code>storage_options</code> when using s3fs can be found in the <a href="https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem" rel="noreferrer">API docs</a>.

General reference <a href="http://docs.dask.org/en/latest/remote-data-services.html" rel="noreferrer">http://docs.dask.org/en/latest/remote-data-services.html</a>

entityMap|0|type|LINK|mutability|MUTABLE|data|url|http://docs.dask.org/en/latest/remote-data-services.html|blocks|key|bd4d7|text|如果您在您的虚拟私有云(VPC)中，s3可能已经获得了凭证，并且您可以在没有密钥的情况下读取文件：|unstyled|depth|inlineStyleRanges|entityRanges|73e1d|import+dask.dataframe+as+dd
df+=+dd.read_csv('s3://<bucket>/<path+to+file>.csv')|code-block|syntax|javascript|3qonf|如果您没有获得凭证，则可以使用storage_options参数并传递密钥对(密钥和密钥)：|offset|length|style|CODE|4s3u4|import+dask.dataframe+as+dd
storage_options+=+{'key':+<s3+key>,+'secret':+<s3+secret>}
df+=+dd.read_csv('s3://<bucket>/<path+to+file>.csv',+storage_options=storage_options)|fr9gp|dask的完整文档可以在here上找到|bc6ig^0|0|0|F|F|0|0|C|4|0|0^^$0|$1|$2|3|4|5|6|$7|8]]]|9|@$A|B|C|D|2|E|F|Y|G|@]|H|@]|6|$]]|$A|I|C|J|2|K|F|Z|G|@]|H|@]|6|$L|M]]|$A|N|C|O|2|E|F|10|G|@$P|11|Q|12|R|S]]|H|@]|6|$]]|$A|T|C|U|2|K|F|13|G|@]|H|@]|6|$L|M]]|$A|V|C|W|2|E|F|14|G|@]|H|@$P|15|Q|16|A|17]]|6|$]]|$A|X|C|-4|2|E|F|18|G|@]|H|@]|6|$]]]]

If you're within your virtual private cloud (VPC) s3 will likely already be credentialed and you can read the file in without a key:

<pre><code>import dask.dataframe as dd
df = dd.read_csv('s3://&lt;bucket&gt;/&lt;path to file&gt;.csv')
</code></pre>

If you aren't credentialed, you can use the <code>storage_options</code> parameter and pass a key pair (key and secret):

<pre><code>import dask.dataframe as dd
storage_options = {'key': &lt;s3 key&gt;, 'secret': &lt;s3 secret&gt;}
df = dd.read_csv('s3://&lt;bucket&gt;/&lt;path to file&gt;.csv', storage_options=storage_options)
</code></pre>

Full documentation from dask can be found <a href="http://docs.dask.org/en/latest/remote-data-services.html" rel="nofollow noreferrer">here</a>

entityMap|0|type|LINK|mutability|MUTABLE|data|url|https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html|blocks|key|7r718|text|幕后的Dask使用boto3，所以你可以用boto3支持的所有方式来设置你的密钥，例如，基于角色的导出AWS_PROFILE=xxxx，或者通过你的环境变量显式地导出访问密钥和密钥。我建议不要硬编码你的密钥，至少你会因为一个错误而将你的代码暴露给公众。|unstyled|depth|inlineStyleRanges|entityRanges|5hvon|$+export+AWS_PROFILE=your_aws_cli_profile_name|code-block|syntax|javascript|cqi4t|或|6aq85|offset|length|7igin|对于s3，您可以使用通配符匹配来获取多个分块文件|5qa6q|import+dask.dataframe+as+dd

#+Given+N+number+of+csv+files+located+inside+s3+read+and+compute+total+record+len

s3_url+=+'s3://<bucket_name>/dask-tutorial/data/accounts.*.csv'

df+=+dd.read_csv(s3_url)

print(df.head())

print(len(df))|fr46p^0|0|0|0|0|2A|0|0|0|0^^$0|$1|$2|3|4|5|6|$7|8]]]|9|@$A|B|C|D|2|E|F|X|G|@]|H|@]|6|$]]|$A|I|C|J|2|K|F|Y|G|@]|H|@]|6|$L|M]]|$A|N|C|O|2|E|F|Z|G|@]|H|@]|6|$]]|$A|P|C|8|2|E|F|10|G|@]|H|@$Q|11|R|12|A|13]]|6|$]]|$A|S|C|T|2|E|F|14|G|@]|H|@]|6|$]]|$A|U|C|V|2|K|F|15|G|@]|H|@]|6|$L|M]]|$A|W|C|-4|2|E|F|16|G|@]|H|@]|6|$]]]]

Dask under the hood uses boto3 so you can pretty much setup your keys in all the ways boto3 supports e.g role-based export AWS_PROFILE=xxxx or explicitly exporting access key and secret via your environment variables. I would advise against hard-coding your keys least you expose your code to the public by a mistake. 

<pre><code>$ export AWS_PROFILE=your_aws_cli_profile_name
</code></pre>

or 

<a href="https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html" rel="nofollow noreferrer">https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html</a>

For s3 you can use wildcard match to fetch multiple chunked files

<pre><code>import dask.dataframe as dd

# Given N number of csv files located inside s3 read and compute total record len

s3_url = 's3://&lt;bucket_name&gt;/dask-tutorial/data/accounts.*.csv'

df = dd.read_csv(s3_url)

print(df.head())

print(len(df))
</code></pre>

I can load the data only if I change the "anon" parameter to True after making the file public.

<pre><code>df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'anon':False})
</code></pre>

This is not recommended for obvious reasons. How do I load the data from S3 securely?

Loading data from S3 to dask dataframe

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

 只有在公开文件后将"anon“参数更改为True时，才能加载数据。 df = dd.read_csv('s3://mybucket/some-big.csv',  storage_options = {'anon':False}) 由于显而易见的原因，不建议这样做。如何安全地从S3加载数据？ 

问将数据从S3加载到dask数据帧
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将数据从S3加载到dask数据帧EN