blocks|key|147395|text|你也可以在Hadoop中使用head命令！语法为|type|unstyled|depth|inlineStyleRanges|entityRanges|data|147396|hdfs+dfs+-cat+<hdfs_filename>+%7C+head+-n+3|code-block|syntax|javascript|147397|这将仅打印文件中的三行。|147398|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

You can use head command in Hadoop too! Syntax would be 

<pre><code>hdfs dfs -cat &lt;hdfs_filename&gt; | head -n 3
</code></pre>

This will print only three lines from the file.

blocks|key|147324|text|Linux上的head和tail命令分别显示前10行和后10行。但是，这两个命令的输出不是随机采样的，它们的顺序与文件本身的顺序相同。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|147325|Linux+shuffle+-+Hadoop命令可以帮助我们生成输入行的随机排列&将其与shuf命令结合使用会很有帮助，如下所示：|147326|$+hadoop+fs+-cat+<file_path_on_hdfs>+%7C+shuf+-n+<N>|147327|因此，在这种情况下，如果iris2.csv是HDFS上的一个文件，并且您希望从数据集中随机采样50行：|147328|$+hadoop+fs+-cat+/file_path_on_hdfs/iris2.csv+%7C+shuf+-n+50|147329|注意:也可以使用Linux+sort命令，但shuf命令更快，而且随机采样数据更好。|147330|entityMap^0|7|4|C|4|0|18|4|0|0|1E|0|C|9|0|0|1M|0|E|4|M|4|0^^$0|@$1|2|3|4|5|6|7|R|8|@$9|S|A|T|B|C]|$9|U|A|V|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|W|8|@$9|X|A|Y|B|C]]|D|@]|E|$]]|$1|H|3|I|5|6|7|Z|8|@$9|10|A|11|B|C]]|D|@]|E|$]]|$1|J|3|K|5|6|7|12|8|@$9|13|A|14|B|C]]|D|@]|E|$]]|$1|L|3|M|5|6|7|15|8|@$9|16|A|17|B|C]]|D|@]|E|$]]|$1|N|3|O|5|6|7|18|8|@$9|19|A|1A|B|C]|$9|1B|A|1C|B|C]]|D|@]|E|$]]|$1|P|3|-4|5|6|7|1D|8|@]|D|@]|E|$]]]|Q|$]]

The <code>head</code> and <code>tail</code> commands on Linux display the first 10 and last 10 lines respectively. But, the output of these two commands is not randomly sampled, they are in the same order as in the file itself.

The Linux shuffle - <code>shuf</code> command helps us generate random permutations of input lines &amp; using this in conjunction with the Hadoop commands would be helpful, like so:

<code>$ hadoop fs -cat &lt;file_path_on_hdfs&gt; | shuf -n &lt;N&gt;</code>

Therefore, in this case if <code>iris2.csv</code> is a file on HDFS and you wanted 50 lines randomly sampled from the dataset:

<code>$ hadoop fs -cat /file_path_on_hdfs/iris2.csv | shuf -n 50</code>

Note: The Linux <code>sort</code> command could also be used, but the <code>shuf</code> command is faster and randomly samples data better.

blocks|key|4653910|text|hdfs+dfs+-cat+yourFile+%7C+shuf+-n+<number_of_line>|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|4653911|将在you.Though上实现，但它在mac上不可用。您可以安装GNU+coreutils。|unstyled|4653912|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$B|C]]|$1|D|3|E|5|F|7|J|8|@]|9|@]|A|$]]|$1|G|3|-4|5|F|7|K|8|@]|9|@]|A|$]]]|H|$]]

<pre><code>hdfs dfs -cat yourFile | shuf -n &lt;number_of_line&gt;
</code></pre>

Will do the trick for you.Though its not available on mac os. You can get installed GNU coreutils.

blocks|key|4653809|text|我的建议是将数据加载到Hive表中，然后您可以执行以下操作：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4653810|SELECT+column1,+column2+FROM+(
++++SELECT+iris2.column1,+iris2.column2,+rand()+AS+r
++++FROM+iris2
++++ORDER+BY+r
)+t
LIMIT+50;|code-block|syntax|javascript|4653811|编辑:这是该查询的更简单版本：|4653812|SELECT+iris2.column1,+iris2.column2
FROM+iris2
ORDER+BY+rand()
LIMIT+50;|4653813|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|N|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|O|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|K|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|L|$]]

My suggestion would be to load that data into Hive table, then you can do something like this:

<pre><code>SELECT column1, column2 FROM (
 SELECT iris2.column1, iris2.column2, rand() AS r
 FROM iris2
 ORDER BY r
) t
LIMIT 50;
</code></pre>

EDIT:
This is simpler version of that query:

<pre><code>SELECT iris2.column1, iris2.column2
FROM iris2
ORDER BY rand()
LIMIT 50;
</code></pre>

blocks|key|147253|text|编写此命令|type|unstyled|depth|inlineStyleRanges|entityRanges|data|147254|sudo+-u+hdfs+hdfs+dfs+-cat+"path+of+csv+file"+%7Chead+-n+50|code-block|syntax|javascript|147255|50为行数(用户可根据需要定制)|147256|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

Write this command

<pre><code>sudo -u hdfs hdfs dfs -cat "path of csv file" |head -n 50
</code></pre>

50 is number of lines(this can be customize by the user based on the requirements)

blocks|key|147406|text|工作代码：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|147407|hadoop+fs+-cat+/tmp/a/b/20200630.xls+%7C+head+-n+10

hadoop+fs+-cat+/tmp/a/b/20200630.xls+%7C+tail+-3|code-block|syntax|javascript|147408|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Working code:
<pre><code>hadoop fs -cat /tmp/a/b/20200630.xls | head -n 10

hadoop fs -cat /tmp/a/b/20200630.xls | tail -3
</code></pre>

blocks|key|4654022|text|我对HDFS集群上的avro文件使用了tail和cat，但结果没有以正确的编码打印出来。我试过了，对我来说效果很好。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4654023|hdfs+dfs+-text+hdfs://<path_of_directory>/part-m-00000.avro+%7C+head+-n+1|code-block|syntax|javascript|4654024|将1更改为更高的整数，以打印avro文件中的更多样本。|4654025|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

I was using tail and cat for an avro file on HDFS cluster, but the result was not getting printed in correct encoding. I tried this and worked well for me.
<pre><code>hdfs dfs -text hdfs://&lt;path_of_directory&gt;/part-m-00000.avro | head -n 1
</code></pre>
Change 1 to higher integer to print more samples from avro file.

blocks|key|4653941|text|hadoop+fs+-cat++/user/hive/warehouse/vamshi_customers/*+%7Ctail|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|4653942|根据@Viacheslav+Rodionov发布的答案，我认为头部工作正常，但对于尾部部分，我发布的部分工作良好。|unstyled|4653943|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$B|C]]|$1|D|3|E|5|F|7|J|8|@]|9|@]|A|$]]|$1|G|3|-4|5|F|7|K|8|@]|9|@]|A|$]]]|H|$]]

<pre><code>hadoop fs -cat /user/hive/warehouse/vamshi_customers/* |tail
</code></pre>

I think the head part is working as per the answer posted by @Viacheslav Rodionov works fine but for the tail part the one that I posted is working good.

I am having a <code>2 GB</code> data in my <code>HDFS</code>.

Is it possible to get that data randomly.
Like we do in the Unix command line 

<pre><code>cat iris2.csv |head -n 50
</code></pre>

Get a few lines of HDFS data

Hadoop 

Linux

Hive

HDFS

Unix

我的HDFS中有一个2 GB数据。有没有可能随机获取这些数据。就像我们在Unix命令行中所做的那样cat iris2.csv |head -n 50

问获取几行HDFS数据
EN

回答 8

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问获取几行HDFS数据EN

回答 8

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问获取几行HDFS数据
EN