我使用dtSearch突出显示文档中的文本搜索匹配。执行此操作的代码减去一些细节和清理,大致如下:
SearchJob sj = new SearchJob();
sj.Request = "\"audit trail\""; // the user query
sj.FoldersToSearch.Add(path_to_src_document);
sj.Execute();
FileConverter fileConverter = new FileConverter();
fileConverter.SetInputItem(sj.Results, 0);
fileConvert.BeforeHit = "<a name=\"HH_%%ThisHit%%\"/><b>";
fileConverter.AfterHit = "</b>";
fileConverter.Execute();
string myHighlightedDoc = fileConverter.OutputString;
如果我给dtSearch一个引用的短语查询,如
“审计跟踪”
然后,dtSearch将按如下方式进行突出显示:
审计跟踪是一件有趣的事情,有一个审计线索!
注意,短语中的每个单词都是单独突出显示的。相反,我希望短语被突出显示为整个单位,像这样:
审计跟踪是一件有趣的事情,有一个审计线索!
这将使突出显示看起来更好看,( B)改进我的javascript的行为,帮助用户从hit导航到hit,以及C)给出更准确的总数#点击计数。
有什么好的方法可以让dtSearch用这种方式突出强调短语?
发布于 2010-04-26 20:13:40
注意:我认为这里的文本和代码需要更多的工作。如果人们想帮助修改答案或代码,这很可能成为社区wiki。
我问过dtSearch (4/26/2010)。他们的答复分为两部分:
首先,仅仅通过改变一个标志就不可能得到想要的突出显示行为。
第二,在短语匹配被视为整体的情况下,可以获得一些低级的命中信息。特别是,如果您在您的dtsSearchWantHitsByWord中同时设置了SearchJob和dtsSearchWantHitsArray标志,那么您的搜索结果将使用查询中每个单词或短语匹配的单词偏移量进行注释。例如,如果输入文档是
审计跟踪是一件有趣的事情,有一个审计线索!
您的查询是
“审计跟踪”
然后(在.NET API中),sj.Results.CurrentItem.HitsByWord将包含如下字符串:
审计线索(2 11 )
表示“审计跟踪”一词从文件中的第二个词和第十一个单词开始。
对于这些信息,您可以做的一件事是创建一个“跳过列表”,指示哪些dtSearch突出显示是不重要的(即哪些是短语连续,而不是一个单词或短语的开头)。例如,如果您的跳过列表是4、7、9,这可能意味着第4、第7和第9次点击是微不足道的,而其他点击是合法的。这种类型的“跳过列表”至少可以有两种方式使用:
skipList.contains(i).
之间出现任何空白
假设这些“跳过列表”确实有用,您将如何生成它们?下面是一些主要起作用的代码:
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using NUnit.Framework;
public class DtSearchUtil
{
/// <summary>
/// Makes a "skip list" for the dtSearch result document with the specified
/// WordArray data. The skip list indicates which hits in the dtSearch markup
/// should be skipped during hit navigation. The reason to skip some hits
/// is to allow navigation to be phrase aware, rather than forcing the user
/// to visit each word in the phrase as if it were an independent hit.
/// The skip list consists of 1-indexed hit offsets. 2, for example, would
/// mean that the second hit should be skipped during hit navigation.
/// </summary>
/// <param name="dtsHitsByWordArray">dtSearch HitsByWord data. You'll get this from SearchResultItem.HitsByWord
/// if you did your search with the dtsSearchWantHitsByWord and dtsSearchWantHitsArray
/// SearchFlags.</param>
/// <param name="userHitCount">How many total hits there are, if phrases are counted
/// as one hit each.</param>
/// <returns></returns>
public static List<int> MakeHitSkipList(string[] dtsHitsByWordArray, out int userHitCount)
{
List<int> skipList = new List<int>();
userHitCount = 0;
int curHitNum = 0; // like the dtSearch doc-level highlights, this counts hits word-by-word, rather than phrase by phrase
List<PhraseRecord> hitRecords = new List<PhraseRecord>();
foreach (string dtsHitsByWordString in dtsHitsByWordArray)
{
hitRecords.Add(PhraseRecord.ParseHitsByWordString(dtsHitsByWordString));
}
int prevEndOffset = -1;
while (true)
{
int nextOffset = int.MaxValue;
foreach (PhraseRecord rec in hitRecords)
{
if (rec.CurOffset >= rec.OffsetList.Count)
continue;
nextOffset = Math.Min(nextOffset, rec.OffsetList[rec.CurOffset]);
}
if (nextOffset == int.MaxValue)
break;
userHitCount++;
PhraseRecord longestMatch = null;
for (int i = 0; i < hitRecords.Count; i++)
{
PhraseRecord rec = hitRecords[i];
if (rec.CurOffset >= rec.OffsetList.Count)
continue;
if (nextOffset == rec.OffsetList[rec.CurOffset])
{
if (longestMatch == null ||
longestMatch.LengthInWords < rec.LengthInWords)
{
longestMatch = rec;
}
}
}
// skip subsequent words in the phrase
for (int i = 1; i < longestMatch.LengthInWords; i++)
{
skipList.Add(curHitNum + i);
}
prevEndOffset = longestMatch.OffsetList[longestMatch.CurOffset] +
(longestMatch.LengthInWords - 1);
longestMatch.CurOffset++;
curHitNum += longestMatch.LengthInWords;
// skip over any unneeded, overlapping matches (i.e. at the same offset)
for (int i = 0; i < hitRecords.Count; i++)
{
while (hitRecords[i].CurOffset < hitRecords[i].OffsetList.Count &&
hitRecords[i].OffsetList[hitRecords[i].CurOffset] <= prevEndOffset)
{
hitRecords[i].CurOffset++;
}
}
}
return skipList;
}
// Parsed form of the phrase-aware hit offset stuff that dtSearch can give you
private class PhraseRecord
{
public string PhraseText;
/// <summary>
/// Offsets into the source text at which this phrase matches. For example,
/// offset 300 would mean that one of the places the phrase matches is
/// starting at the 300th word in the document. (Words are counted according
/// to dtSearch's internal word breaking algorithm.)
/// See also:
/// http://support.dtsearch.com/webhelp/dtSearchNetApi2/frames.html?frmname=topic&frmfile=dtSearch__Engine__SearchFlags.html
/// </summary>
public List<int> OffsetList;
// BUG: We calculate this with a whitespace tokenizer. This will probably
// cause bad results in some places. (Better to figure out how to count
// the way dtSearch would.)
public int LengthInWords
{
get
{
return Regex.Matches(PhraseText, @"[^\s]+").Count;
}
}
public int CurOffset = 0;
public static PhraseRecord ParseHitsByWordString(string dtsHitsByWordString)
{
Match m = Regex.Match(dtsHitsByWordString, @"^([^,]*),\s*\d*\s*\(([^)]*)\).*");
if (!m.Success)
throw new ArgumentException("Bad dtsHitsByWordString. Did you forget to set dtsHitsByWordString in dtSearch?");
string phraseText = m.Groups[1].Value;
string parenStuff = m.Groups[2].Value;
PhraseRecord hitRecord = new PhraseRecord();
hitRecord.PhraseText = phraseText;
hitRecord.OffsetList = GetMatchOffsetsFromParenGroupString(parenStuff);
return hitRecord;
}
static List<int> GetMatchOffsetsFromParenGroupString(string parenGroupString)
{
List<int> res = new List<int>();
MatchCollection matchCollection = Regex.Matches(parenGroupString, @"\d+");
foreach (Match match in matchCollection)
{
string digitString = match.Groups[0].Value;
res.Add(int.Parse(digitString));
}
return res;
}
}
}
[TestFixture]
public class DtSearchUtilTests
{
[Test]
public void TestMultiPhrasesWithoutFieldName()
{
string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706 );",
@"bana*, 4 (490 505 689 713 )"
};
// expected dtSearch hit order:
// 0: apple@482
// 1: pie@483 [should skip]
// 2: banana-something@490
// 3: apple@499
// 4: pie@500 [should skip]
// 5: banana-something@505
// 6: apple@552
// 7: pie@553 [should skip]
// 8: apple@578
// 9: pie@579 [should skip]
// 10: apple@589
// 11: pie@590 [should skip]
// 12: apple@683
// 13: pie@684 [skip]
// 14: banana-something@689
// 15: apple@706
// 16: pie@707 [skip]
// 17: banana-something@713
int userHitCount;
List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);
Assert.AreEqual(11, userHitCount);
Assert.AreEqual(1, skipList[0]);
Assert.AreEqual(4, skipList[1]);
Assert.AreEqual(7, skipList[2]);
Assert.AreEqual(9, skipList[3]);
Assert.AreEqual(11, skipList[4]);
Assert.AreEqual(13, skipList[5]);
Assert.AreEqual(16, skipList[6]);
Assert.AreEqual(7, skipList.Count);
}
[Test]
public void TestPhraseOveralap1()
{
string[] foo = { @"apple pie, 7 (482 499 552 );",
@"apple, 4 (482 490 499 552)"
};
// expected dtSearch hit order:
// 0: apple@482
// 1: pie@483 [should skip]
// 2: apple@490
// 3: apple@499
// 4: pie@500 [should skip]
// 5: apple@552
// 6: pie@553 [should skip]
int userHitCount;
List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);
Assert.AreEqual(4, userHitCount);
Assert.AreEqual(1, skipList[0]);
Assert.AreEqual(4, skipList[1]);
Assert.AreEqual(6, skipList[2]);
Assert.AreEqual(3, skipList.Count);
}
[Test]
public void TestPhraseOveralap2()
{
string[] foo = { @"apple pie, 7 (482 499 552 );",
@"pie, 4 (483 490 500 553)"
};
// expected dtSearch hit order:
// 0: apple@482
// 1: pie@483 [should skip]
// 2: pie@490
// 3: apple@499
// 4: pie@500 [should skip]
// 5: apple@552
// 6: pie@553 [should skip]
int userHitCount;
List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);
Assert.AreEqual(4, userHitCount);
Assert.AreEqual(1, skipList[0]);
Assert.AreEqual(4, skipList[1]);
Assert.AreEqual(6, skipList[2]);
Assert.AreEqual(3, skipList.Count);
}
// TODO: test "apple pie" and "apple", plus "apple pie" and "pie"
// "subject" should not freak it out
[Test]
public void TestSinglePhraseWithFieldName()
{
string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706 ), subject" };
int userHitCount;
List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);
Assert.AreEqual(7, userHitCount);
Assert.AreEqual(7, skipList.Count);
Assert.AreEqual(1, skipList[0]);
Assert.AreEqual(3, skipList[1]);
Assert.AreEqual(5, skipList[2]);
Assert.AreEqual(7, skipList[3]);
Assert.AreEqual(9, skipList[4]);
Assert.AreEqual(11, skipList[5]);
Assert.AreEqual(13, skipList[6]);
}
}
https://stackoverflow.com/questions/2716474
复制相似问题