blocks|key|1934237|text|UNIX+Bash脚本博客suggests|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1934238|awk+'!x[$0]%2B%2B'|code-block|syntax|javascript|1934239|这个命令告诉awk要打印哪些行。变量$0保存一行的全部内容，方括号是数组访问。因此，对于文件的每一行，如果先前未设置(!)该节点的内容，则会递增数组x的节点，并打印该行。|style|CODE|1934240|entityMap|0|LINK|mutability|MUTABLE|url|http://unstableme.blogspot.com/2008/03/remove-duplicates-without-sorting-file.html^0|D|8|0|0|0|I|2|1N|1|22|1|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@$A|V|B|W|1|X]]|C|$]]|$1|D|3|E|5|F|7|Y|8|@]|9|@]|C|$G|H]]|$1|I|3|J|5|6|7|Z|8|@$A|10|B|11|K|L]|$A|12|B|13|K|L]|$A|14|B|15|K|L]]|9|@]|C|$]]|$1|M|3|-4|5|6|7|16|8|@]|9|@]|C|$]]]|N|$O|$5|P|Q|R|C|$S|T]]]]

The UNIX Bash Scripting blog <a href="http://unstableme.blogspot.com/2008/03/remove-duplicates-without-sorting-file.html" rel="noreferrer">suggests</a>:

<pre><code>awk '!x[$0]++'
</code></pre>

This command is telling awk which lines to print. The variable <code>$0</code> holds the entire contents of a line and square brackets are array access. So, for each line of the file, the node of the array <code>x</code> is incremented and the line printed if the content of that node was not (<code>!</code>) previously set.

blocks|key|3485009|text|上面Michael+Hoffman的解决方案简明扼要。对于较大的文件，Schwartzian变换方法涉及使用awk添加索引字段，然后进行多轮排序和uniq，所需的内存开销较少。以下代码片段在bash中有效|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3485010|awk+'{print(NR"\t"$0)}'+file_name+%7C+sort+-t$'\t'+-k2,2+%7C+uniq+--skip-fields+1+%7C+sort+-k1,1+-t$'\t'+%7C+cut+-f2+-d$'\t'|code-block|syntax|javascript|3485011|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Michael Hoffman's solution above is short and sweet. For larger files, a Schwartzian transform approach involving the addition of an index field using awk followed by multiple rounds of sort and uniq involves less memory overhead. The following snippet works in bash

<pre><code>awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'
</code></pre>

blocks|key|3485043|text|谢谢你，1_CR！我需要一个"uniq+-u“(完全去掉重复项)，而不是uniq+(保留一份重复项)。awk和perl解决方案不能真正修改以实现此目的，但您的解决方案可以！我可能还需要较低的内存使用量，因为我将统一使用100,000,000行8-)。以防其他人需要它，我只是在命令的uniq部分添加了一个"-u“：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3485044|awk+'{print(NR"\t"$0)}'+file_name+%7C+sort+-t$'\t'+-k2,2+%7C+uniq+-u+--skip-fields+1+%7C+sort+-k1,1+-t$'\t'+%7C+cut+-f2+-d$'\t'|code-block|syntax|javascript|3485045|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Thanks 1_CR! I needed a "uniq -u" (remove duplicates entirely) rather than uniq (leave 1 copy of duplicates). The awk and perl solutions can't really be modified to do this, your's can! I may have also needed the lower memory use since I will be uniq'ing like 100,000,000 lines 8-). Just in case anyone else needs it, I just put a "-u" in the uniq portion of the command:

<pre><code>awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq -u --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'
</code></pre>

blocks|key|3398813|text|一个迟来的答案--我刚刚遇到了一个副本--但也许值得补充一下……|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3398814|@1_CR答案背后的原则可以写得更简洁，使用cat+-n而不是awk来添加行号：|offset|length|style|CODE|3398815|cat+-n+file_name+%7C+sort+-uk2+%7C+sort+-n+%7C+cut+-f2-|code-block|syntax|javascript|3398816|3398817|使用cat+-n+|unordered-list-item|3398818|+sort+-u删除重复数据(-k2表示‘排序关键字从字段2开始’)|3398819|使用D12按预置编号进行排序H213H114使用D15删除行号(D16表示‘选择字段2直到结束’)H217F218|3398820|3398821|entityMap^0|0|M|6|V|3|0|0|0|2|6|0|1|7|F|3|0|0|0^^$0|@$1|2|3|4|5|6|7|X|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Y|8|@$D|Z|E|10|F|G]|$D|11|E|12|F|G]]|9|@]|A|$]]|$1|H|3|I|5|J|7|13|8|@]|9|@]|A|$K|L]]|$1|M|3|-4|5|6|7|14|8|@]|9|@]|A|$]]|$1|N|3|O|5|P|7|15|8|@$D|16|E|17|F|G]]|9|@]|A|$]]|$1|Q|3|R|5|P|7|18|8|@$D|19|E|1A|F|G]|$D|1B|E|1C|F|G]]|9|@]|A|$]]|$1|S|3|T|5|P|7|1D|8|@]|9|@]|A|$]]|$1|U|3|-4|5|6|7|1E|8|@]|9|@]|A|$]]|$1|V|3|-4|5|6|7|1F|8|@]|9|@]|A|$]]]|W|$]]

A late answer - I just ran into a duplicate of this - but perhaps worth adding...
The principle behind @1_CR's answer can be written more concisely, using <code>cat -n</code> instead of <code>awk</code> to add line numbers:
<pre><code>cat -n file_name | sort -uk2 | sort -n | cut -f2-
</code></pre>
<ul>
<li>Use <code>cat -n</code> to prepend line numbers</li>
<li>Use <code>sort -u</code> remove duplicate data (<code>-k2</code> says 'start at field 2 for sort key')</li>
<li>Use <code>sort -n</code> to sort by prepended number</li>
<li>Use <code>cut</code> to remove the line numbering (<code>-f2-</code> says 'select field 2 till end')</li>
</ul>

blocks|key|3398832|text|我只想删除下面几行中的所有重复项，而不是文件中的所有重复项。所以我使用了：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3398833|awk+'{
++if+($0+!=+PREVLINE)+print+$0;
++PREVLINE=$0;
}'|code-block|syntax|javascript|3398834|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

I just wanted to remove all duplicates on following lines, not everywhere in the file. So I used:

<pre><code>awk '{
 if ($0 != PREVLINE) print $0;
 PREVLINE=$0;
}'
</code></pre>

blocks|key|1760350|text|要从2个文件中删除重复项，请执行以下操作：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1760351|awk+'!a[$0]%2B%2B'+file1.csv+file2.csv|code-block|syntax|javascript|1760352|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

To remove duplicate from 2 files : 

<pre><code>awk '!a[$0]++' file1.csv file2.csv
</code></pre>

blocks|key|1933610|text|uniq命令在别名中工作，即使是http://man7.org/linux/man-pages/man1/uniq.1.html|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1933611|entityMap|0|LINK|mutability|MUTABLE|url|http://man7.org/linux/man-pages/man1/uniq.1.html^0|0|4|G|1C|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@$9|O|A|P|B|C]]|D|@$9|Q|A|R|1|S]]|E|$]]|$1|F|3|-4|5|6|7|T|8|@]|D|@]|E|$]]]|G|$H|$5|I|J|K|E|$L|M]]]]

the <code>uniq</code> command works in an alias even <a href="http://man7.org/linux/man-pages/man1/uniq.1.html" rel="nofollow noreferrer">http://man7.org/linux/man-pages/man1/uniq.1.html</a>

blocks|key|1760403|text|现在您可以查看这个用Rust：uq编写的小工具。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1760404|它执行唯一性过滤，而不必首先对输入进行排序，因此可以应用于连续流。|1760405|与排名靠前的awk解决方案和其他基于shell的解决方案相比，此工具有两个优势：|1760406|1760407|uq使用它们的散列值来记住行的出现，所以当行是long.|ordered-list-item|style|CODE|1760408|uq时，它不会使用那么多的内存使用。awk可以通过对要存储的条目数量设置限制来保持内存使用量恒定(当达到限制时，有一个标志来控制是覆盖还是终止)，而当行太多时，awk解决方案可能会遇到OOM。|1760409|1760410|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/lostutils/uq^0|F|2|0|0|0|0|0|0|2|0|0|2|I|3|28|3|0|0^^$0|@$1|2|3|4|5|6|7|Y|8|@]|9|@$A|Z|B|10|1|11]]|C|$]]|$1|D|3|E|5|6|7|12|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|13|8|@]|9|@]|C|$]]|$1|H|3|-4|5|6|7|14|8|@]|9|@]|C|$]]|$1|I|3|J|5|K|7|15|8|@$A|16|B|17|L|M]]|9|@]|C|$]]|$1|N|3|O|5|K|7|18|8|@$A|19|B|1A|L|M]|$A|1B|B|1C|L|M]|$A|1D|B|1E|L|M]]|9|@]|C|$]]|$1|P|3|-4|5|6|7|1F|8|@]|9|@]|C|$]]|$1|Q|3|-4|5|6|7|1G|8|@]|9|@]|C|$]]]|R|$S|$5|T|U|V|C|$W|X]]]]

Now you can check out this small tool written in Rust: <a href="https://github.com/lostutils/uq" rel="nofollow noreferrer">uq</a>.
It performs uniqueness filtering without having to sort the input first, therefore can apply on continuous stream.
There are two advantages of this tool over the top-voted awk solution and other shell-based solutions:
<ol>
<li><code>uq</code> remembers the occurence of lines using their hash values, so it doesn't use as much memory use when the lines are long.</li>
<li><code>uq</code> can keep the memory usage constant by setting a limit on the number of entries to store (when the limit is reached, there is a flag to control either to override or to die), while the <code>awk</code> solution could run into OOM when there are too many lines.</li>
</ol>

I have a utility script in Python:

<pre><code>#!/usr/bin/env python
import sys
unique_lines = []
duplicate_lines = []
for line in sys.stdin:
 if line in unique_lines:
 duplicate_lines.append(line)
 else:
 unique_lines.append(line)
 sys.stdout.write(line)
# optionally do something with duplicate_lines
</code></pre>

This simple functionality (<code>uniq</code> without needing to sort first, stable ordering) must be available as a simple UNIX utility, mustn't it? Maybe a combination of filters in a pipe?

Reason for asking: needing this functionality on a system on which I cannot execute Python from anywhere.

Remove duplicate lines without sorting

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我在Python中有一个实用程序脚本：#!/usr/bin/env pythonimport sysunique_lines = []duplicate_lines = []for line in sys.stdin:  if line in unique_lines:    duplicate_lines.appen...

问删除重复行而不排序
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问删除重复行而不排序EN