文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在不包含周围文本的情况下解析准确的数据？

问如何在不包含周围文本的情况下解析准确的数据？
EN

Stack Overflow用户

提问于 2016-11-09 04:17:24

回答 2查看 132关注 0票数 0

我的代码非常接近成功，但我只需要一点帮助。

我有100页的数据，但在我将其应用于其他页面之前，我只对1页进行了完美的解析。在这一页，这是一封电子邮件，我需要提取几个东西:日期，部门，鱼类种类，英镑和金钱。到目前为止，我已经成功地使用RegularExpressions来识别某些单词并从该行中提取数据:例如，查找"Sent“，因为我知道日期信息总是在它之后；以及查找" Pounds”或"lbs“，因为英镑信息总是在那之前。

我遇到的问题是，我的代码抓取了数据所在的整行，而不仅仅是数字数据。例如，我只想获取英镑的数值，但我意识到这将是非常困难的，因为100封电子邮件中的每一封都是不同的。我甚至不确定是否可以让这段代码万无一失，因为我需要RegEx来识别数据周围的文本，但不能将它包含在我的导出命令中。那么，我只是盲目地抓取某些已识别的单词后面的字符吗？

下面是我用来提取磅数据的一段代码：

for filename in os.listdir(path):
    file_path = os.path.join(path, filename)
    if os.path.isfile(file_path):
        with open(file_path, 'r') as f:
            sector_result = []
            pattern = re.compile("Pounds | lbs", re.IGNORECASE)
            for linenum, line in enumerate(f):
            if pattern.search(line) != None:
                sector_result.append((linenum, line.rstrip('\n')))
                for linenum, line in sector_result:
                    print ("Pounds:", line)

下面是它打印出来的内容：

Pounds: -GOM Cod up to 5,000 lbs (live wt) @ 1.40 lbs
Pounds: -GOM Cod up to 5,000 lbs (live wt) @ 1.40 lbs
Pounds: -American Plaice 2,000 lbs      .60 lbs or best offer

理想情况下，我只希望5,000磅的数值被导出，但我不确定我将如何去抓取这个数字。

以下是我需要解析的原始电子邮件文本：

From: 
Sent: Friday, November 15, 2013 2:43pm
To: 

Subject: NEFS 11 fish for lease

Greetings,

NEFS 11 has the following fish for lease:

-GOM Cod up to 5,000 lbs (live wt) @ 1.40 lbs
-American Plaice 2,000 lbs      .60 lbs or best offer

这是另一个需要解析的单独的电子邮件；这就是为什么编写这段代码很困难，因为它必须处理各种不同的措辞的电子邮件，因为所有的电子邮件都是由不同的人编写的：

From:
Sent: Monday, December 09, 2013 1:13pm
To:

Subject: NEFS 6 Stocks for lease October 28 2013

Hi All,

The following is available from NEFS VI:

4,000  lbs. GBE COD (live wt)

10,000 lbs. SNE Winter Flounder

10,000 lbs. SNE Yellowtail

10,000 lbs GB Winter Flounder

Will lease for cash or trade for GOM YT, GOM Cod, Dabs, Grey sole stocks on equitable basis.  

Please forward all offers.

Thank you,

感谢任何和所有的帮助，以及提出问题的批评。谢谢。

python

regex

extract

回答 2

Stack Overflow用户

发布于 2016-11-09 04:36:10

正则表达式可以识别和非捕获导出周围的文本，这被称为非捕获组。例如：

Pounds: -GOM Cod up to 5,000 lbs (live wt) @ 1.40 lbs

要识别、up to、所需的值和(live wt)，您可以编写如下正则表达式：

(?: up to).(\d+,\d+.lbs).(?:\(live wt\))

本质上，(?:)是一个不被捕获的匹配组，因此正则表达式只捕获中间括号中的组。

如果你提供你想要的确切的周围文字，我可以更具体。

编辑：

结束你的新例子，我可以看到所有例子之间唯一的相似之处是你有一个数字(以千计，所以它有一个,)，后面跟着一些空格，然后是lbs。因此，您的正则表达式将如下所示：

(?:(\d+,\d+)\s+lbs)

这将返回数字本身的匹配。你可以看到一个使用here的例子。这个正则表达式将排除较小的值，因为忽略了不是以千为单位的值(即不包含,的值)。

编辑2:

此外，我还想指出，这完全可以不使用正则表达式使用str.split()来完成。而不是试图查找特定的单词模式，您可以只利用这样一个事实，即您想要的数字将是lbs之前的单词，即如果lbs位于位置i，则您的数字位于位置i-1。

您必须面对的唯一另一个考虑因素是如何处理多个值，这两个显而易见的值是：

Biggest。
First value。

下面是这两种情况如何与原始代码一起工作：

def max_pounds(line):
    pound_values = {}
    words = line.split()
    for i, word in enumerate(words):
        if word.lower() == 'lbs':
            # Convert the number into an float
            # And save the original string representation.
            pound_values[(float(words[i-1].replace(',','')))] = words[i-1]
    # Print the biggest numerical number.
    print(pound_values[max(pound_values.keys())])

def first_pounds(line):
    words = line.split()
    for i, word in enumerate(words):
        if word.lower() == 'lbs':
            # print the number and exit.
            print(words[i-1])
            return

for filename in os.listdir(path):
    file_path = os.path.join(path, filename)
    if os.path.isfile(file_path):
        with open(file_path, 'r') as f:
            sector_result = []
            pattern = re.compile("Pounds | lbs", re.IGNORECASE)
            for linenum, line in enumerate(f):
            if pattern.search(line) != None:
                sector_result.append((linenum, line.rstrip('\n')))
                for linenum, line in sector_result:
                    print ("Pounds:", line)
                    # Only one function is required.
                    max_pounds(line)
                    first_pounts(line)

需要注意的一点是，代码不能处理第一个单词为lbs的边缘情况，但这可以通过try-catch轻松处理。

如果lbs之前的值不是数字，则正则表达式或拆分都不起作用。如果你遇到这个问题，我建议你在你的数据中搜索冒犯的电子邮件-如果数量足够少，可以手动编辑它们。

票数 1

Stack Overflow用户

发布于 2016-11-09 05:10:02

下面是一个足够灵活的正则表达式：

for filename in os.listdir(path):
    file_path = os.path.join(path, filename)
    if os.path.isfile(file_path):
        with open(file_path, 'r') as f:
            pattern = r'(\d[\d,.]+)\s*(?:lbs|[Pp]ounds)'
            content = f.read()

            ### if you want only the first match ###
            match = re.search(pattern, content)
            if match:
                print(match.group(1))

            ### if you want all the matches ###
            matches = re.findall(pattern, content)
            if matches:
                print(matches)

如果需要，您可以更彻底地使用正则表达式。

希望这能有所帮助！

更新

这里的主要部分是正则表达式(\d[\d,.]+)\s*(?:lbs|[Pp]ounds)。这是一个基本的问题，解释如下：

(                      
    \d                 -> Start with any digit character
    [\d,.]+            -> Followed by either other digits or commas or dots
)                      
\s*                    -> Followed by zero or more spaces
(?:                    
    lbs|[Pp]ounds      -> Followed by either 'lbs' or 'Pounds' or 'pounds'
)

括号定义了捕获组，所以(\d[\d,.]+)是被捕获的东西，所以基本上是数字部分。

带有?:的括号定义了一个非捕获组。

此正则表达式将匹配：

2,890磅(捕获'2,890')
3.6磅(捕获'3.6')
5678829磅
23磅

以及不想要的东西，比如：

2.....lbs
3,4,6,7,8重金

它将不匹配：

7,423
23m lbs
45 ppounds
2.8磅

您可以根据您拥有的内容的复杂性来创建一个更复杂的正则表达式。我认为这个正则表达式对于您的目的来说已经足够好了。

希望这能帮助你澄清

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40495652

复制

相似问题

问如何在不包含周围文本的情况下解析准确的数据？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在不包含周围文本的情况下解析准确的数据？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在不包含周围文本的情况下解析准确的数据？
EN