首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >检测重复的英文名称

检测重复的英文名称
EN

Stack Overflow用户
提问于 2012-01-11 12:33:18
回答 1查看 178关注 0票数 0

我正在尝试寻找一个示例来演示Lucene或其他类型的索引,它可以检查英文名和姓的组合是否可能存在重复项。重复检查需要能够考虑到常见的昵称,例如Robert的Bob和William的Bill,以及拼写错误。有没有人知道一个例子?

我计划在用户注册期间执行重复搜索。需要根据从存储用户名的数据库表构建的索引来检查新用户记录。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2012-01-13 08:49:58

我会在索引时在firstName上使用SynonymFilter,这样你就有了所有可能的组合(Bob -> Robert,Robert -> Bob等)。为您现有的用户建立索引。

然后使用QueryParser (不包括分析器中的SynonymFilter )询问一些模糊查询。

这是我想出来的代码:

代码语言:javascript
复制
public class NameDuplicateTests {
    private Analyzer analyzer;
    private IndexSearcher searcher;
    private IndexReader reader;
    private QueryParser qp;

    private final static Multimap<String, String> firstNameSynonyms;
    static {
        firstNameSynonyms = HashMultimap.create();
        List<String> robertSynonyms = ImmutableList.of("Bob", "Bobby", "Robert");
        for (String name: robertSynonyms) {
            firstNameSynonyms.putAll(name, robertSynonyms);
        }
        List<String> willSynonyms = ImmutableList.of("William", "Will", "Bill", "Billy");
        for (String name: willSynonyms) {
            firstNameSynonyms.putAll(name, willSynonyms);
        }
    }

    public static Analyzer createAnalyzer() {
        return new Analyzer() {
            @Override
            public TokenStream tokenStream(String fieldName, Reader reader) {
                TokenStream tokenizer = new WhitespaceTokenizer(reader);
                if (fieldName.equals("firstName")) {
                    tokenizer = new SynonymFilter(tokenizer, new SynonymEngine() {
                        @Override
                        public String[] getSynonyms(String s) throws IOException {
                            return firstNameSynonyms.get(s).toArray(new String[0]);
                        }
                    });
                }
                return tokenizer;
            }
        };
    }


    @Before
    public void setUp() throws Exception {
        Directory dir = new RAMDirectory();
        analyzer = createAnalyzer();

        IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
        ImmutableList<String> firstNames = ImmutableList.of("William", "Robert", "Bobby", "Will", "Anton");
        ImmutableList<String> lastNames = ImmutableList.of("Robert", "Williams", "Mayor", "Bob", "FunkyMother");

        for (int id = 0; id < firstNames.size(); id++) {
            Document doc = new Document();
            doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
            doc.add(new Field("firstName", firstNames.get(id), Field.Store.YES, Field.Index.ANALYZED));
            doc.add(new Field("lastName", lastNames.get(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
            writer.addDocument(doc);
        }
        writer.close();

        qp = new QueryParser(Version.LUCENE_30, "firstName", new WhitespaceAnalyzer());
        searcher = new IndexSearcher(dir);
        reader = searcher.getIndexReader();
    }

    @After
    public void tearDown() throws Exception {
        searcher.close();
    }

    @Test
    public void testNameFilter() throws Exception {
        search("+firstName:Bob +lastName:Williams");
        search("+firstName:Bob +lastName:Wolliam~");
    }

    private void search(String query) throws ParseException, IOException {
        Query q = qp.parse(query);
        System.out.println(q);
        TopDocs res = searcher.search(q, 3);
        for (ScoreDoc sd: res.scoreDocs) {
            Document doc = reader.document(sd.doc);
            System.out.println("Found " + doc.get("firstName") + " " + doc.get("lastName"));
        }
    }
}

这会导致:

代码语言:javascript
复制
+firstName:Bob +lastName:Williams
Found Robert Williams
+firstName:Bob +lastName:wolliam~0.5
Found Robert Williams

希望这能有所帮助!

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/8814190

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档