如何过滤包含字符串和子字符串的列表,以便只返回最长的字符串。(如果列表中的任何项是另一项的子字符串,则仅返回较长的字符串。)
我有这个功能。有没有更快的方法?
def filterSublist(lst):
uniq = lst
for elem in lst:
uniq = [x for x in uniq if (x == elem) or (x not in elem)]
return uniq
lst = ["a", "abc", "b", "d", "xy", "xyz"]
print filterSublist(lst)
> ['abc', 'd', 'xyz']
> Function time: 0.000011
发布于 2014-05-26 18:49:13
一个简单的二次时间解是这样的:
res = []
n = len(lst)
for i in xrange(n):
if not any(i != j and lst[i] in lst[j] for j in xrange(n)):
res.append(lst[i])
但我们可以做得更好:
假设$
是一个不会出现在任何字符串中的字符,并且其值小于所有实际字符。
假设S
是所有字符串的串联,$
在其间。在您的示例中,S = a$abc$b$d$xy$xyz
。
可以在线性时间内构建S
的suffix array。您还可以使用一个简单得多的O(nlog^2n)构造算法,正如我在in another answer中描述的那样。
现在,对于lst
中的每个字符串,检查它是否只在后缀数组中出现一次。您可以执行两个二进制搜索来查找子字符串的位置,它们在后缀数组中形成一个连续的范围。如果字符串多次出现,则将其删除。
通过预先计算LCP信息,这也可以在线性时间内完成。
O(nlog^2n)实现示例,改编自my suffix array answer
def findFirst(lo, hi, pred):
""" Find the first i in range(lo, hi) with pred(i) == True.
Requires pred to be a monotone. If there is no such i, return hi. """
while lo < hi:
mid = (lo + hi) // 2
if pred(mid): hi = mid;
else: lo = mid + 1
return lo
# uses the algorithm described in https://stackoverflow.com/a/21342145/916657
class SuffixArray(object):
def __init__(self, s):
""" build the suffix array of s in O(n log^2 n) where n = len(s). """
n = len(s)
log2 = 0
while (1<<log2) < n:
log2 += 1
rank = [[0]*n for _ in xrange(log2)]
for i in xrange(n):
rank[0][i] = s[i]
L = [0]*n
for step in xrange(1, log2):
length = 1 << step
for i in xrange(n):
L[i] = (rank[step - 1][i],
rank[step - 1][i + length // 2] if i + length // 2 < n else -1,
i)
L.sort()
for i in xrange(n):
rank[step][L[i][2]] = \
rank[step][L[i - 1][2]] if i > 0 and L[i][:2] == L[i-1][:2] else i
self.log2 = log2
self.rank = rank
self.sa = [l[2] for l in L]
self.s = s
self.rev = [0]*n
for i, j in enumerate(self.sa):
self.rev[j] = i
def lcp(self, x, y):
""" compute the longest common prefix of s[x:] and s[y:] in O(log n). """
n = len(self.s)
if x == y:
return n - x
ret = 0
for k in xrange(self.log2 - 1, -1, -1):
if x >= n or y >= n:
break
if self.rank[k][x] == self.rank[k][y]:
x += 1<<k
y += 1<<k
ret += 1<<k
return ret
def compareSubstrings(self, x, lx, y, ly):
""" compare substrings s[x:x+lx] and s[y:y+yl] in O(log n). """
l = min((self.lcp(x, y), lx, ly))
if l == lx == ly: return 0
if l == lx: return -1
if l == ly: return 1
return cmp(self.s[x + l], self.s[y + l])
def count(self, x, l):
""" count occurences of substring s[x:x+l] in O(log n). """
n = len(self.s)
cs = self.compareSubstrings
lo = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) >= 0)
hi = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) > 0)
return hi - lo
def debug(self):
""" print the suffix array for debugging purposes. """
for i, j in enumerate(self.sa):
print str(i).ljust(4), self.s[j:], self.lcp(self.sa[i], self.sa[i-1]) if i >0 else "n/a"
def filterSublist(lst):
splitter = "\x00"
s = splitter.join(lst) + splitter
sa = SuffixArray(s)
res = []
offset = 0
for x in lst:
if sa.count(offset, len(x)) == 1:
res.append(x)
offset += len(x) + 1
return res
但是,解释开销可能会导致这种方法比O(n^2)方法更慢,除非S
非常大(大约10^5个字符或更多)。
发布于 2014-05-26 19:28:51
顺序重要吗?如果不是,
a = ["a", "abc", "b", "d", "xy", "xyz"]
a.sort(key=len, reverse=True)
n = len(a)
for i in range(n - 1):
if a[i]:
for j in range(i + 1, n):
if a[j] in a[i]:
a[j] = ''
print filter(len, a) # ['abc', 'xyz', 'd']
效率不是很高,但很简单。
https://stackoverflow.com/questions/23868093
复制相似问题