blocks|key|1323823|text|假设单词是每行一个，并且文件已经排序：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1323824|uniq+filename|code-block|syntax|javascript|1323825|如果文件未排序：|1323826|sort+filename+%7C+uniq|1323827|如果它们不是每行一个，并且您不介意它们是每行一个：|1323828|tr+-s+[:space:]+\\n+<+filename+%7C+sort+%7C+uniq|1323829|不过，这并不会删除标点符号，所以您可能需要：|1323830|tr+-s+[:space:][:punct:]+\\n+<+filename+%7C+sort+%7C+uniq|1323831|但这会删除连字符单词中的连字符。"man+tr“获取更多选项。|1323832|entityMap^0|0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|X|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Y|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|Z|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|10|8|@]|9|@]|A|$]]|$1|M|3|N|5|D|7|11|8|@]|9|@]|A|$E|F]]|$1|O|3|P|5|6|7|12|8|@]|9|@]|A|$]]|$1|Q|3|R|5|D|7|13|8|@]|9|@]|A|$E|F]]|$1|S|3|T|5|6|7|14|8|@]|9|@]|A|$]]|$1|U|3|-4|5|6|7|15|8|@]|9|@]|A|$]]]|V|$]]

Assuming that the words are one per line, and the file is already sorted:

<pre><code>uniq filename
</code></pre>

If the file's not sorted:

<pre><code>sort filename | uniq
</code></pre>

If they're not one per line, and you don't mind them being one per line:

<pre><code>tr -s [:space:] \\n &lt; filename | sort | uniq
</code></pre>

That doesn't remove punctuation, though, so maybe you want:

<pre><code>tr -s [:space:][:punct:] \\n &lt; filename | sort | uniq
</code></pre>

But that removes the hyphen from hyphenated words. "man tr" for more options.

blocks|key|1539096|text|ruby+-pi.bak+-e+'$_.split(",").uniq.join(",")'+filename？|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1539097|我承认这两种引语很难看。|1539098|entityMap^0|0|1J|0|0^^$0|@$1|2|3|4|5|6|7|J|8|@$9|K|A|L|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|M|8|@]|D|@]|E|$]]|$1|H|3|-4|5|6|7|N|8|@]|D|@]|E|$]]]|I|$]]

<code>ruby -pi.bak -e '$_.split(",").uniq.join(",")' filename</code> ?

I'll admit the two kinds of quotations are ugly.

blocks|key|1323990|text|创建一个唯一的列表非常容易，这要归功于uniq，尽管大多数Unix命令喜欢每行一个条目，而不是逗号分隔的列表，所以我们必须首先将它转换为：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1323991|$+sed+'s/,+/\n/g'+filename+%7C+sort+%7C+uniq
word1
word2
word3
word4
word5
word6
word7|code-block|syntax|javascript|1323992|更难的部分是将它放在一行中，使用逗号作为分隔符，而不是结束符。我使用了perl一行程序来做这件事，但是如果有人有更常用的东西，请编辑我。:)|1323993|$+sed+'s/,+/\n/g'+filename+%7C+sort+%7C+uniq+%7C+perl+-e+'@a+=+<>;+chomp+@a;+print((join+",+",+@a),+"\n")'
word1,+word2,+word3,+word4,+word5,+word6,+word7|1323994|entityMap^0|J|4|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@$9|R|A|S|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|T|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|U|8|@]|D|@]|E|$]]|$1|M|3|N|5|H|7|V|8|@]|D|@]|E|$I|J]]|$1|O|3|-4|5|6|7|W|8|@]|D|@]|E|$]]]|P|$]]

Creating a unique list is pretty easy thanks to <code>uniq</code>, although most Unix commands like one entry per line instead of a comma-separated list, so we have to start by converting it to that:

<pre><code>$ sed 's/, /\n/g' filename | sort | uniq
word1
word2
word3
word4
word5
word6
word7
</code></pre>

The harder part is putting this on one line again with commas as separators and not terminators. I used a perl one-liner to do this, but if someone has something more idiomatic, please edit me. :)

<pre><code>$ sed 's/, /\n/g' filename | sort | uniq | perl -e '@a = &lt;&gt;; chomp @a; print((join ", ", @a), "\n")'
word1, word2, word3, word4, word5, word6, word7
</code></pre>

blocks|key|1539317|text|下面是一个awk脚本，它将使每一行保持得体，只删除重复的单词：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1539318|BEGIN+{+
+++++FS=",+"+
}+
{+
++++for+(i=1;+i+<=+NF;+i%2B%2B)+
++++++++used[$i]+=+1
++++for+(x+in+used)
++++++++printf+"%25s,+",x
++++printf+"\n"
++++split("",+used)
}+|code-block|syntax|javascript|1539319|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Here's an awk script that will leave each line in tact, only removing the duplicate words:

<pre><code>BEGIN { 
 FS=", " 
} 
{ 
 for (i=1; i &lt;= NF; i++) 
 used[$i] = 1
 for (x in used)
 printf "%s, ",x
 printf "\n"
 split("", used)
} 
</code></pre>

blocks|key|1539434|text|我今天也遇到了同样的问题..一个包含238,000个单词的单词列表，但其中大约有40,000个单词是重复的。我已经把它们放在单独的代码行中了|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1539435|cat+filename+%7C+tr+"+"+"\n"+%7C+sort+|code-block|syntax|javascript|1539436|要删除重复项，我只需执行以下操作|1539437|cat+filename+%7C+uniq+>+newfilename+.|1539438|运行良好，没有错误，现在我的文件从1.45MB下降到1.01MB|1539439|entityMap^0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Q|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|R|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|S|8|@]|9|@]|A|$]]|$1|M|3|-4|5|6|7|T|8|@]|9|@]|A|$]]]|N|$]]

i had the very same problem today.. a word list with 238,000 words but about 40, 000 of those were duplicates. I already had them in individual lines by doing

<pre><code>cat filename | tr " " "\n" | sort 
</code></pre>

to remove the duplicates I simply did 

<pre><code>cat filename | uniq &gt; newfilename .
</code></pre>

Worked perfectly no errors and now my file is down from 1.45MB to 1.01MB

blocks|key|1323757|text|我认为您会希望用换行符替换空格，使用uniq命令查找唯一行，然后再次用空格替换换行符。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1323758|entityMap|0|LINK|mutability|MUTABLE|url|http://www.computerhope.com/unix/uuniq.htm^0|I|4|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

I'd think you'll want to replace the spaces with newlines, use the <a href="http://www.computerhope.com/unix/uuniq.htm" rel="nofollow noreferrer">uniq</a> command to find unique lines, then replace the newlines with spaces again.

blocks|key|1323922|text|我假设您希望单词在一行中是唯一的，而不是贯穿整个文件。如果是这样的话，下面的Perl脚本就可以做到这一点。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1323923|while+(<DATA>)
{
++++chomp;
++++my+%25seen+=+();
++++my+@words+=+split(m!,\s*!);
++++@words+=+grep+{+$seen{$_}+?+0+:+($seen{$_}+=+1)+}+@words;
++++print+join(",+",+@words),+"\n";
}

__DATA__
word1,+word2,+word3,+word2,+word4,+word5,+word3,+word6,+word7,+word3|code-block|syntax|javascript|1323924|如果您希望整个文件具有唯一性，只需将%25seen散列移到while+(){}循环之外即可。|offset|length|style|CODE|1323925|entityMap^0|0|0|I|5|R|A|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Q|8|@$I|R|J|S|K|L]|$I|T|J|U|K|L]]|9|@]|A|$]]|$1|M|3|-4|5|6|7|V|8|@]|9|@]|A|$]]]|N|$]]

I presumed you wanted the words to be unique on a single line, rather than throughout the file. If this is the case, then the Perl script below will do the trick.

<pre><code>while (&lt;DATA&gt;)
{
 chomp;
 my %seen = ();
 my @words = split(m!,\s*!);
 @words = grep { $seen{$_} ? 0 : ($seen{$_} = 1) } @words;
 print join(", ", @words), "\n";
}

__DATA__
word1, word2, word3, word2, word4, word5, word3, word6, word7, word3
</code></pre>

If you want uniqueness over the whole file, you can just move the <code>%seen</code> hash outside the <code>while (){}</code> loop.

blocks|key|1324180|text|在尝试解决几乎相同的问题时遇到了这个线程。我连接了几个包含密码的文件，所以自然会有很多重复的文件。此外，还有许多非标准字符。我真的不需要对它们进行排序，但这似乎对uniq是必要的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1324181|我试过了：|1324182|sort+/Users/me/Documents/file.txt+%7C+uniq+-u
sort:+string+comparison+failed:+Illegal+byte+sequence
sort:+Set+LC_ALL='C'+to+work+around+the+problem.
sort:+The+strings+compared+were+`t\203tonnement'+and+`t\203tonner'|code-block|syntax|javascript|1324183|已尝试：|1324184|sort+-u+/Users/me/Documents/file.txt+>>+/Users/me/Documents/file2.txt
sort:+string+comparison+failed:+Illegal+byte+sequence
sort:+Set+LC_ALL='C'+to+work+around+the+problem.
sort:+The+strings+compared+were+`t\203tonnement'+and+`t\203tonner'.|1324185|甚至尝试先通过cat传递它，这样我就可以看看我们是否得到了正确的输入。|1324186|cat+/Users/me/Documents/file.txt+%7C+sort+%7C+uniq+-u+>+/Users/me/Documents/file2.txt
sort:+string+comparison+failed:+Illegal+byte+sequence
sort:+Set+LC_ALL='C'+to+work+around+the+problem.
sort:+The+strings+compared+were+`zon\351s'+and+`zoologie'.|1324187|我不知道发生了什么。在文件中没有找到字符串“t\203tonement”和"t\203tonner“，虽然可以找到"t/203”和"tonnement“，但它们位于不相邻的单独行上。与"zon\351s“相同。|1324188|最终对我起作用的是：|1324189|awk+'!x[$0]%2B%2B'+/Users/me/Documents/file.txt+>+/Users/me/Documents/file2.txt|1324190|它还保留了唯一区别在于大小写的单词，这正是我想要的。我不需要对列表进行排序，所以它不需要排序也没关系。|1324191|entityMap^0|0|0|0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|10|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|11|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|12|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|6|7|13|8|@]|9|@]|A|$]]|$1|K|3|L|5|F|7|14|8|@]|9|@]|A|$G|H]]|$1|M|3|N|5|6|7|15|8|@]|9|@]|A|$]]|$1|O|3|P|5|F|7|16|8|@]|9|@]|A|$G|H]]|$1|Q|3|R|5|6|7|17|8|@]|9|@]|A|$]]|$1|S|3|T|5|6|7|18|8|@]|9|@]|A|$]]|$1|U|3|V|5|F|7|19|8|@]|9|@]|A|$G|H]]|$1|W|3|X|5|6|7|1A|8|@]|9|@]|A|$]]|$1|Y|3|-4|5|6|7|1B|8|@]|9|@]|A|$]]]|Z|$]]

Came across this thread while trying to solve much the same problem. I had concatenated several files containing passwords, so naturally there were a lot of doubles. Also, many non-standard characters. I didn't really need them sorted, but it seemed that was gonna be necessary for uniq.

I tried:

<pre><code>sort /Users/me/Documents/file.txt | uniq -u
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `t\203tonnement' and `t\203tonner'
</code></pre>

Tried:

<pre><code>sort -u /Users/me/Documents/file.txt &gt;&gt; /Users/me/Documents/file2.txt
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `t\203tonnement' and `t\203tonner'.
</code></pre>

And even tried passing it through cat first, just so I could see if we were getting a proper input.

<pre><code>cat /Users/me/Documents/file.txt | sort | uniq -u &gt; /Users/me/Documents/file2.txt
sort: string comparison failed: Illegal byte sequence
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were `zon\351s' and `zoologie'.
</code></pre>

I'm not sure what's happening. The strings "t\203tonnement" and "t\203tonner" aren't found in the file, though "t/203" and "tonnement" are found, but on separate, non-adjoining lines. Same with "zon\351s".

What finally worked for me was:

<pre><code>awk '!x[$0]++' /Users/me/Documents/file.txt &gt; /Users/me/Documents/file2.txt
</code></pre>

It also preserved words whose only difference was case, which is what I wanted. I didn't need the list sorted, so it was fine that it wasn't.

blocks|key|1539259|text|如果您还想了解单词的数量，请不要忘记uniq实用程序的-c选项。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1539260|entityMap^0|I|4|R|2|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]|$9|K|A|L|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|M|8|@]|D|@]|E|$]]]|G|$]]

And don't forget the <code>-c</code> option for the <code>uniq</code> utility if you're interested in getting a count of the words as well.

blocks|key|1539476|text|使用vim+(vim+filename)打开文件，并使用唯一标志(:sort+u)运行排序命令。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1539477|entityMap^0|7|C|X|7|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]|$9|K|A|L|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|M|8|@]|D|@]|E|$]]]|G|$]]

open file with vim (<code>vim filename</code>) and run sort command with unique flag (<code>:sort u</code>).

I have a plain text file with words, which are separated by comma, for example:

<pre><code>word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3
</code></pre>

i want to delete the duplicates and to become:

<pre><code>word1, word2, word3, word4, word5, word6, word7
</code></pre>

Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....

How to remove duplicate words from a plain text file using linux command

我有一个纯文本文件，其中包含用逗号分隔的单词，例如：word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3我想删除重复项，并成为：word1, word2, word3, word4, word5, word6, word7有什么想法吗？我认为，egrep可以帮助我，但我不确定，如何确切地使用它...

问如何使用linux命令从纯文本文件中删除重复的单词
EN

回答 10

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用linux命令从纯文本文件中删除重复的单词EN

回答 10

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用linux命令从纯文本文件中删除重复的单词
EN