首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >Python中有没有`string.split()`的生成器版本?

Python中有没有`string.split()`的生成器版本?
EN

Stack Overflow用户
提问于 2010-10-05 16:31:01
回答 11查看 32.4K关注 0票数 134

string.split()返回一个列表实例。有没有返回的版本?有什么理由反对使用生成器版本吗?

EN

回答 11

Stack Overflow用户

发布于 2012-03-19 23:38:09

我能想到的最有效的方法就是使用str.find()方法的offset参数编写一个。这避免了大量内存使用,并在不需要时依赖于regexp的开销。

编辑2016-8-2:更新为可选地支持正则表达式分隔符

代码语言:javascript
复制
def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

你可以随心所欲地使用它。

代码语言:javascript
复制
>>> print list(isplit("abcb","b"))
['a','c','']

虽然每次执行find()或分片时都会在字符串中查找一些开销,但这应该是最小的,因为字符串在内存中表示为连续数组。

票数 18
EN

Stack Overflow用户

发布于 2012-10-07 06:34:21

这是我的实现,它比这里的其他答案快得多,也更完整。它有4个单独的子功能,用于不同的情况。

我只复制str_split主函数的文档字符串:

代码语言:javascript
复制
str_split(s, *delims, empty=None)

用其余参数拆分字符串s,可能会省略空部分(empty关键字参数负责此操作)。这是一个生成器函数。

当只提供一个分隔符时,字符串将被它简单地拆分。然后,默认情况下,emptyTrue

代码语言:javascript
复制
str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

当提供多个分隔符时,默认情况下,字符串按这些分隔符中可能最长的序列拆分,或者,如果empty设置为True,则分隔符之间还包括空字符串。请注意,这种情况下的分隔符只能是单个字符。

代码语言:javascript
复制
str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

如果未提供分隔符,则使用string.whitespace,因此效果与str.split()相同,只是此函数是一个生成器。

代码语言:javascript
复制
str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

代码语言:javascript
复制
import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

这个函数可以在Python3中工作,并且可以应用一个简单的,尽管相当丑陋的修复程序,使它在2和3版本中都能工作。该函数的第一行应更改为:

代码语言:javascript
复制
def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')
票数 6
EN

Stack Overflow用户

发布于 2015-04-17 19:43:02

我写了一个@ninjagecko答案的版本,它的行为更像string.split (即,默认情况下以空格分隔,您可以指定一个分隔符)。

代码语言:javascript
复制
def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"\s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

下面是我使用的测试(在python3和python2中):

代码语言:javascript
复制
# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python的regex模块说它does "the right thing" for unicode空白,但我还没有实际测试过它。

也可以作为gist使用。

票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/3862010

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档