我试图从这个HTML文本中提取作者的姓名和发布日期。
以下是我到目前为止所得到的:(authorName) =(“.”)
不过,这只适用于这个特定的情况,我正在寻找一种通用的方法。能给我一些如何处理这个问题的建议吗?
omni_bizObjectId = "13560483";var omni_publicationDate =“2019-01-25T12:00+00:00”;var omni_sourceSite ="sfgate";var omni_authorName = "Heather“;var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = " SF ";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";
发布于 2019-02-20 06:03:12
您可以使用此正则表达式在group1中捕获作者名称,
authorName\s+=\s+"([^"]*)"此正则表达式匹配authorName,然后是一个或多个空白,然后是一个双引号",然后捕获下一个双引号之间的任何数据,并将其存储在group1中,在group1中可以使用m.group(1)捕获数据。
检查下面的Python代码,了解如何从group1捕获数据,
import re
s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'
m = re.search(r'authorName\s+=\s+"([^"]*)"',s)
if (m):
print(m.group(1))只打印作者的名字,
Heather Knight编辑:感谢Onyambu指出publicationDate.
与authorName类似,您可以使用上面的regex并将authorName替换为publicationDate,并使用此regex捕获publicationDate。
publicationDate\s+=\s+"([^"]*)"如果您想使用单个正则表达式同时提取这两个正则表达式,则可以使用该正则表达式,
(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"Python代码,
import re
s = 'teacher a prime example of where SF should invest windfall";var omni_bizObjectId = "13560483";var omni_className = "article";var omni_publicationDate = "2019-01-25T12:00:00+00:00";var omni_sourceSite ="sfgate";var omni_authorName = "Heather Knight";var omni_authorTitle = "";var omni_premiumStatus = "isPremium";var omni_premiumEndDate = "1893506400";var omni_originalSource = "SF";var omni_pageNumber = "1";var omni_breakingNewsFlag = "0";var omni_localNewsFlag = "1";var omni_isListView = "0";var omni_paywallSite = "1";var omni_displayTemplate = "ard";'
m = re.search(r'(?i).*publicationdate\s+=\s+"([^"]*)".*authorName\s+=\s+"([^"]*)"',s)
if (m):
print('Publication Date:', m.group(1))
print('Author Name:', m.group(2))指纹,
Publication Date: 2019-01-25T12:00:00+00:00
Author Name: Heather Knighthttps://stackoverflow.com/questions/54779598
复制相似问题