问从python中unicode (用于外语)段落中提取哈希标签
EN

Stack Overflow用户

提问于 2013-08-15 10:13:30

回答 1查看 892关注 0票数 2

我正在尝试开发一个函数，从段落中提取哈希标签，基本上是以#开头的单词(#cool #life #cars #سيارات)

我尝试过几种方法，例如使用split()和使用正则表达式，但没有尝试包括阿拉伯语、俄语等的unicode字符。

我尝试使用工作良好的split()，但它将包含任何单词，在我的示例中，我不能包括具有特殊字符(如,.%$]{}{)(.. )的单词，也不能尝试包含一些验证，例如单词长度不超过15个字符。

我试过这种方法-

def _strip_hash_tags(self, ):
    """tags should not be more than 15 characters"""
    hash_tags = re.compile(r'(?i)(?<=\#)\w+')
    return [i for i in hash_tags.findall(self.content) if len(i) < 15]

这只适用于英语，不适用于外语。有什么建议吗？

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-08-15 10:17:48

正如这里所讨论的- python regular expression with utf8 issue。

首先，您应该使用re.compile(ur'<unicode string>')。另外，添加re.UNICODE标志也不错(但不确定这里是否真的需要)。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import re


def strip_hash_tags(content):
    """tags should not be more than 15 characters"""
    hash_tags = re.compile(ur'(?i)(?<=\#)\w+',re.UNICODE)
    return [i for i in hash_tags.findall(content) if len(i) < 15]

str = u"am trying to work on a function to extract hashtags from paragraphs, basically words that starts with # (#cool #life #cars #سيارات)"

print(strip_hash_tags(str))

# [u'cool', u'life', u'cars', u'\xd8\xb3\xd9']

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/18250593

复制

相似问题

问从python中unicode (用于外语)段落中提取哈希标签
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从python中unicode (用于外语)段落中提取哈希标签EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从python中unicode (用于外语)段落中提取哈希标签
EN