我有一个很大的文本文件,如下所示:
Mitchel-2
Anna-2
Witold-4
Serena-3
Serena-9
Witros-3
所以我需要在"-“之前的第一个单词”-“从不重复。任何方法都可以删除除第一个之外的所有。所以,如果我有3000行以" Serena“开头,但"-”后面总是有一个不同的数字,有没有办法删除Serena的2999行,只留下第一行?
另外,Serena只是一个例子,我有超过200个重复的单词。
发布于 2016-03-14 03:10:55
我不认为你能用notepad++做到这一点。您可以为每个名称使用正则表达式,但由于您有超过200个名称,这是不切实际的。
但是你可以写一个程序来帮你完成这项工作。基本上你会经历两个步骤:
1)搜索每个唯一的名称并将其保存在一个集合中(不允许重复的条目)。2)对于集合中的每个唯一名称,搜索文件上的重复项。
我已经编写了一个简单的c++程序来查找字符串变量中的重复项。您可以将其调整为您喜欢的语言。我用Microsoft Visual Studio Community 2015编译的(它在cpp.sh中不起作用)
#include "stdafx.h"
#include <regex>
#include <string>
#include <iostream>
#include <set>
using namespace std;
int main()
{
typedef match_results<const char*> cmatch;
set<string> names;
string notepad_text = "Serena-1\nSerena-2\nSerena-3\nSerena-4\nAna-1\nSerena-7\nWilson-1\nAna-2\nJohn-1\nAna-3\nJohn-2\nWilson-2";
regex regex_find_names("^\\w+"); //double slashes are needed because this is in a string
// 1) Let's find every name
//sregex_iterator it_beg(notepad_text.begin(), notepad_text.end(), regex_find_names);
sregex_iterator find_names_itit(notepad_text.begin(), notepad_text.end(), regex_find_names);
sregex_iterator it_end; //defaults to the end condition
while (find_names_itit != it_end) {
names.insert(find_names_itit->str()); //automatically deletes duplicates
++find_names_itit;
}
// 2) For demonstration purposes, let's print what we've found
cout << "---printing the names we've found:\n\n";
set<string>::const_iterator names_it; // declare an iterator
names_it = names.begin(); // assign it to the start of the set
while (names_it != names.end()) // while it hasn't reach the end
{
cout << *names_it << " ";
++names_it;
}
// 3) Let's find the duplicates
cout << "\n\n---printing the regex matches:\n";
string current_name;
set<string>::const_iterator current_name_it; //this iterates over every name we've found
current_name_it = names.begin();
while (current_name_it != names.end())
{
// we're building something like "^Serena.*"
current_name = "^";
current_name += *current_name_it;
current_name += ".*";
cout << "\n-Lets find duplicates of: " << *current_name_it << endl;
++current_name_it;
// let's iterate through the matches
regex regex_obj(current_name); //double slashes are needed because this is in a string
sregex_iterator it_beg(notepad_text.begin(), notepad_text.end(), regex_obj);
sregex_iterator it(notepad_text.begin(), notepad_text.end(), regex_obj); //this iterates over the match results
sregex_iterator it_end;
//string res = *it;
while (it != it_end) {
if (it != it_beg)
{
cout << it->str() << endl;
}
++it;
}
}
int i; //depending on the compaling getting this additional char is necessary to see the console window
cin >> i;
return 0;
}
输入字符串为:
Serena-1
Serena-2
Serena-3
Serena-4
Ana-1
Serena-5
Wilson-1
Ana-2
John-1
Ana-3
John-2
Wilson-2
在这里打印
---printing the names we've found:
Ana John Serena Wilson
---printing the regex matches:
-Lets find duplicates of: Ana
Ana-2
Ana-3
-Lets find duplicates of: John
John-2
-Lets find duplicates of: Serena
Serena-2
Serena-3
Serena-4
Serena-5
-Lets find duplicates of: Wilson
Wilson-2
https://stackoverflow.com/questions/35972083
复制相似问题