blocks|key|2178371|text|如果你想使数据在Hive中可用，以便在其上执行大部分聚合，我建议使用spark的以下方法之一。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2178372|如果您有多行json文件|2178373|var+df+=++spark.read.json(sc.wholeTextFiles("hdfs://ypur/hdfs/path/*.json").values)
df.write.format("parquet").mode("overwrite").saveAsTable("yourhivedb.tablename")|code-block|syntax|javascript|2178374|如果您有单行的json文件|2178375|val+df+=+spark.read.json("hdfs://ypur/hdfs/path/*.json")
df.write.format("parquet").mode("overwrite").saveAsTable("yourhivedb.tablename")|2178376|Spark会自动为你推断出表格模式。如果您使用的是cloudera发行版，您将能够使用impala读取数据(取决于您的cloudera版本，它可能不支持复杂的结构)|2178377|entityMap^0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|R|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|S|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|6|7|T|8|@]|9|@]|A|$]]|$1|K|3|L|5|F|7|U|8|@]|9|@]|A|$G|H]]|$1|M|3|N|5|6|7|V|8|@]|9|@]|A|$]]|$1|O|3|-4|5|6|7|W|8|@]|9|@]|A|$]]]|P|$]]

If you want to make the data available in Hive to perform mostly aggregations on top of it, I would suggest 1 of the following method using spark.

If you have multiple-line json files 

<pre><code>var df = spark.read.json(sc.wholeTextFiles("hdfs://ypur/hdfs/path/*.json").values)
df.write.format("parquet").mode("overwrite").saveAsTable("yourhivedb.tablename")
</code></pre>

If you have single-line json files

<pre><code>val df = spark.read.json("hdfs://ypur/hdfs/path/*.json")
df.write.format("parquet").mode("overwrite").saveAsTable("yourhivedb.tablename")
</code></pre>

Spark will automatically infer the table schema for you. If you are using cloudera distribution you will be able to read the data using impala (depending on your cloudera version it may not support complex structures)

blocks|key|1808225|text|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1808226|我想把JSON插入Hadoop|blockquote|1808227|1808228|1808229|你只要把它放到HDFS中...由于您拥有一段时间内的数据，因此您需要创建分区以供Hive读取|1808230|jsondata/dt=20180619/foo.json
jsondata/dt=20180620/bar.json|code-block|syntax|javascript|1808231|我是否需要使用hive并为我的|1808232|创建Avro方案？|1808233|1808234|1808235|不是的。不知道你在哪里混淆了Avro和JSON。现在，如果您可以将JSON转换为具有模式的已定义Avro，那么这将有助于改进Hive查询，因为查询结构化二进制比解析JSON文本更好。|1808236|1808237|是否需要将JSON作为字符串插入到特定列中？|1808238|1808239|1808240|不推荐使用。你可以，但是你不能通过蜂窝的JSON+Serde+support查询它|offset|length|1808241|别忘了上面的结构，你需要用到PARTITIONED+BY+(dt+STRING)。为了在表中为现有文件创建分区，您需要手动(每天)运行一个MSCK+REPAIR+TABLE命令|style|CODE|1808242|JSON有|1808243|作为字符串(来自kafka)|1808244|1808245|1808246|不要使用Spark来解决这个问题(至少不要重复发明轮子)。我的建议是使用Confluent的HDFS+Kafka+Connect，它支持Hive表的创建。|1808247|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/rcongiu/Hive-JSON-Serde/blob/develop/README.md^0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|K|I|0|0|E|Q|1X|H|0|0|0|0|0|0^^$0|@$1|2|3|-4|4|5|6|1N|7|@]|8|@]|9|$]]|$1|A|3|B|4|C|6|1O|7|@]|8|@]|9|$]]|$1|D|3|-4|4|5|6|1P|7|@]|8|@]|9|$]]|$1|E|3|-4|4|5|6|1Q|7|@]|8|@]|9|$]]|$1|F|3|G|4|5|6|1R|7|@]|8|@]|9|$]]|$1|H|3|I|4|J|6|1S|7|@]|8|@]|9|$K|L]]|$1|M|3|N|4|5|6|1T|7|@]|8|@]|9|$]]|$1|O|3|P|4|C|6|1U|7|@]|8|@]|9|$]]|$1|Q|3|-4|4|5|6|1V|7|@]|8|@]|9|$]]|$1|R|3|-4|4|5|6|1W|7|@]|8|@]|9|$]]|$1|S|3|T|4|5|6|1X|7|@]|8|@]|9|$]]|$1|U|3|-4|4|5|6|1Y|7|@]|8|@]|9|$]]|$1|V|3|W|4|C|6|1Z|7|@]|8|@]|9|$]]|$1|X|3|-4|4|5|6|20|7|@]|8|@]|9|$]]|$1|Y|3|-4|4|5|6|21|7|@]|8|@]|9|$]]|$1|Z|3|10|4|5|6|22|7|@]|8|@$11|23|12|24|1|25]]|9|$]]|$1|13|3|14|4|5|6|26|7|@$11|27|12|28|15|16]|$11|29|12|2A|15|16]]|8|@]|9|$]]|$1|17|3|18|4|5|6|2B|7|@]|8|@]|9|$]]|$1|19|3|1A|4|C|6|2C|7|@]|8|@]|9|$]]|$1|1B|3|-4|4|5|6|2D|7|@]|8|@]|9|$]]|$1|1C|3|-4|4|5|6|2E|7|@]|8|@]|9|$]]|$1|1D|3|1E|4|5|6|2F|7|@]|8|@]|9|$]]|$1|1F|3|-4|4|5|6|2G|7|@]|8|@]|9|$]]]|1G|$1H|$4|1I|1J|1K|9|$1L|1M]]]]

<blockquote>
 I want to insert the JSON to Hadoop
</blockquote>

You just put it in HDFS... Since you have data over a time period, you'll want to create partitions for Hive to read

<pre><code>jsondata/dt=20180619/foo.json
jsondata/dt=20180620/bar.json
</code></pre>

<blockquote>
 Do I need to use hive and create Avro scheme to my JSON?
</blockquote>

Nope. Not sure where you got mixed up between Avro and JSON. Now, if you could convert the JSON into defined Avro with a schema, then that would help improve Hive queries since querying structured binary is better than parsing JSON text. 

<blockquote>
 do I need to insert the JSON as a string to a specific column?
</blockquote>

Not recommended. You could, but then you cannot query it, via Hive's <a href="https://github.com/rcongiu/Hive-JSON-Serde/blob/develop/README.md" rel="nofollow noreferrer">JSON Serde support</a>

Don't forget with the above structure you'll need <code>PARTITIONED BY (dt STRING)</code>. And in order for partitions to be created on the table for existing files, you'll need to manually (and daily) run an <code>MSCK REPAIR TABLE</code> command

<blockquote>
 i have JSON as string (from kafka)
</blockquote>

Don't use Spark for that (at least, don't reinvent the wheel). My suggestion would be to use Confluent's HDFS Kafka Connect that comes with Hive table creation support.

I have a lot of data (JSON string) per day (around 150-200B).

I want to insert the JSON to Hadoop, what is the best way to do it (I need a fast insert and a fast query on JSON fields)?

Do I need to use hive and create Avro scheme to my JSON? Or do I need to insert the JSON as a string to a specific column?

Insert JSON into Hadoop

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我每天都有很多数据(JSON字符串)(大约150-200B)。我想将JSON插入到Hadoop中，最好的方法是什么(我需要快速插入和快速查询JSON字段)？我是否需要使用hive并为我的JSON创建Avro方案？或者，我是否需要将JSON作为字符串插入到特定列中？

问将JSON插入Hadoop
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将JSON插入HadoopEN