问替换Python字符串中的自定义"HTML“标记
EN

Stack Overflow用户

提问于 2019-03-05 06:21:34

回答 2查看 598关注 0票数 0

我希望能够在字符串中包含一个自定义的"HTML“标记，例如："This is a <photo id="4" /> string"。

在本例中，自定义标记为<photo id="4" />。我也可以改变这个自定义标签，以不同的方式编写，如果它使它更容易，即[photo id:4]或其他东西。

我希望能够将这个字符串传递给一个函数，该函数将提取标记<photo id="4" />，并允许我将其转换为一些更复杂的模板，如<div class="photo"><img src="...." alt="..."></div>，然后我可以使用它来替换原始字符串中的标记。

我想象它的工作原理是这样的：

>>> content = "This is a <photo id="4" /> string"
# Pass the string to a function that returns all the tags with the given name.
>>> tags = parse_tags('photo', string)
>>> print(tags)
[{'tag': 'photo', 'id': 4, 'raw': '<photo id="4" />'}]
# Now that I know I need to render a photo with ID 4, so I can pass that to some sort of template thing
>>> rendered = render_photo(id=tags[0]['id'])
>>> print(rendered)
<div class="photo"><img src="...." alt="..."></div>
>>> content = content.replace(tags[0]['raw'], rendered)
>>> print(content)
This is a <div class="photo"><img src="...." alt="..."></div> string

我认为这是一个相当常见的模式，比如在博客文章中放一张照片，所以我想知道是否有一个库可以做类似上面示例parse_tags函数的事情。或者我需要写它吗？

这个照片标签的例子只是一个例子。我希望有不同名称的标签。作为一个不同的例子，也许我有一个人员数据库，我想要一个像<person name="John Doe" />这样的标记。在这种情况下，我想要的输出类似于{'tag': 'person', 'name': 'John Doe', 'raw': '<person name="John Doe" />'}。然后，我可以使用这个名字来查找此人，并返回此人的vcard或其他内容的呈现模板。

python

html

parsing

回答 2

Stack Overflow用户

发布于 2019-03-05 07:16:18

如果您正在使用HTML5，我建议您研究一下xml模块(etree)。它将允许您将整个文档解析为树结构，并单独操作标签(然后将重新生成的bask转换为html文档)。

您还可以使用正则表达式来执行文本替换。如果不需要做太多更改，这可能比加载xml树结构更快。

    import re
    text = """<html><body>some text <photo> and tags <photo id="4"> more text <person name="John Doe"> yet more text"""
    tags = ["photo","person","abc"]
    patterns = "|".join([ f"(<{tag} .*?>)|(<{tag}>)" for tag in tags ])
    matches = list(re.finditer(patterns,text))
    for match in reversed(matches):
        tag = text[match.start():match.end()]
        print(match.start(),match.end(),tag)
        # substitute what you need for that tag
        text = text[:match.start()] + "***" + text[match.end():]
    print(text)

这将被打印出来：

    64 88 <person name="John Doe">
    39 53 <photo id="4">
    22 29 <photo>
    <html><body>some text *** and tags *** more text *** yet more text

以相反的顺序执行替换可以确保finditer()找到的范围仍然有效，因为文本会随着替换而发生变化。

票数 0

Stack Overflow用户

发布于 2019-03-05 11:12:57

对于这种“外科”解析(您希望隔离特定的标记而不是创建完整的分层文档)，pyparsing的makeHTMLTags方法非常有用。

请参见下面的带注释的脚本，其中显示了解析器的创建，并将其用于parseTag和replaceTag方法：

import pyparsing as pp

def make_tag_parser(tag):
    # makeHTMLTags returns 2 parsers, one for the opening tag and one for the
    # closing tag - we only need the opening tag; the parser will return parsed
    # fields of the tag itself
    tag_parser = pp.makeHTMLTags(tag)[0]

    # instead of returning parsed bits of the tag, use originalTextFor to
    # return the raw tag as token[0] (specifying asString=False will retain
    # the parsed attributes and tag name as attributes)
    parser = pp.originalTextFor(tag_parser, asString=False)

    # add one more callback to define the 'raw' attribute, copied from t[0]
    def add_raw_attr(t):
        t['raw'] = t[0]
    parser.addParseAction(add_raw_attr)

    return parser

# parseTag to find all the matches and report their attributes
def parseTag(tag, s):
    return make_tag_parser(tag).searchString(s)


content = """This is a <photo id="4" /> string"""

tag_matches = parseTag("photo", content)
for match in tag_matches:
    print(match.dump())
    print("raw: {!r}".format(match.raw))
    print("tag: {!r}".format(match.tag))
    print("id:  {!r}".format(match.id))


# transform tag to perform tag->div transforms
def replaceTag(tag, transform, s):
    parser = make_tag_parser(tag)

    # add one more parse action to do transform
    parser.addParseAction(lambda t: transform.format(**t))
    return parser.transformString(s)

print(replaceTag("photo", 
                   '<div class="{tag}"><img src="<src_path>/img_{id}.jpg." alt="{tag}_{id}"></div>', 
                   content))

打印：

['<photo id="4" />']
- empty: True
- id: '4'
- raw: '<photo id="4" />'
- startPhoto: ['photo', ['id', '4'], True]
  [0]:
    photo
  [1]:
    ['id', '4']
  [2]:
    True
- tag: 'photo'
raw: '<photo id="4" />'
tag: 'photo'
id:  '4'
This is a <div class="photo"><img src="<src_path>/img_4.jpg." alt="photo_4"></div> string

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/54992490

复制

相似问题

问替换Python字符串中的自定义"HTML“标记
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问替换Python字符串中的自定义"HTML“标记EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问替换Python字符串中的自定义"HTML“标记
EN