问编写一个python脚本，根据任何类从超文本标记语言中查找XPath
EN

Stack Overflow用户

提问于 2018-06-14 05:30:55

回答 2查看 1.9K关注 0票数 3

在Python语言中，我希望用户在控制台提示符中输入一个URL (接受输入并将其存储在某个变量中)，例如，如果网页包含以下内容，则为：

<html>
<head>
</head>
    <body>
        <div>
            <h1 class="class_one">First heading</h1>
                <p>Some text</p>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h3 class="class_three">
                            </h3>
                        </center>
                        <center>
                            <h3 class="find_first_class">
                                Some text
                            </h3>
                        </center>
                    </div>
                </div>
            </div>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h2 class="find_second_class">
                            </h2>
                        </center>
                    </div>
                </div>
            </div>
        </div>
    </body>
</html>

然后，CSV应该包含网页HTML中的每个类的行(因为类可以出现多次，所以任何给定的类都可以有多行)。

现在，我想为页面上出现的所有类生成XPath。到目前为止，我写的是：

import urllib2
from bs4 import BeautifulSoup

result = {}
user_url_list = raw_input("Please enter your urls separated by spaces : \n")
url_list = map(str, user_url_list.split())
for url in url_list:
    try:
        page = urllib2.urlopen(url)
        soup = BeautifulSoup(page, 'html.parser')
        user_class_list = raw_input("Please enter the classes to parse for " + url + " separated by spaces : \n")
        class_list = map(str, user_class_list.split())
        for find_class in class_list:
            try:
                name_box = soup.find(attrs={'class': find_class})
                print(xpath_soup(name_box))
                break
            except:
                print("There was some error getting the xpath of class : " + find_class + " for url : " + url + "\n..trying next class now \n")
                continue
    except:
        print(url + " is not valid, please enter correct full url \n")
        continue
print(result)

python

web-scraping

scripting

回答 2

Stack Overflow用户

发布于 2018-06-14 07:47:56

下面是Orhan提到的try/except逻辑。lxml解析传递给它的文档，可以通过xpath引用元素并提取类。在此之后，只需简单地检查它们是否出现在所需的类中。lxml还允许通过ElementTree重新构建初始的xpath。

import csv
import requests
from lxml import etree

target_url = input('Which url is to be scraped?')

page = '''
<html>
<head>
</head>
    <body>
        <div>
            <h1 class="class_one">First heading</h1>
                <p>Some text</p>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h3 class="class_three">
                            </h3>
                        </center>
                        <center>
                            <h3 class="find_first_class">
                                Some text
                            </h3>
                        </center>
                    </div>
                </div>
            </div>
            <div class="class_two">
                <div class="class_three">
                    <div class="class_one">
                        <center class="class_two">
                            <h2 class="find_second_class">
                            </h2>
                        </center>
                    </div>
                </div>
            </div>
        </div>
    </body>
</html>
'''

#response = requests.get(target_url)
#document = etree.parse(response.content)
classes_list = ['find_first_class', 'find_second_class']
expressions = []

document = etree.fromstring(page)

for element in document.xpath('//*'):
    try:
        ele_class = element.xpath("@class")[0]
        print(ele_class)
        if ele_class in classes_list:
            tree = etree.ElementTree(element)
            expressions.append((ele_class, tree.getpath(element)))
    except IndexError:
        print("No class in this element.")
        continue

with open('test.csv', 'w') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerows(expressions)

票数 2

Stack Overflow用户

发布于 2018-06-14 12:04:53

对于任何看起来很相似的人来说，我是如何让它工作的：

import os
import csv
import requests
from lxml import html


csv_file = raw_input("Enter CSV file name\n")
full_csv_file = os.path.abspath(csv_file)
with open(full_csv_file) as f:
    reader = csv.reader(f)
    data = [temp_reader for temp_reader in reader]
    final_result = {}

    for item_list in data:
        class_locations = {}
        url = item_list[0]
        item_list.pop(0)
        try:
            page = requests.get(url)
            root = html.fromstring(page.text)
            tree = root.getroottree()
            for find_class in item_list:
                find_class_locations =[]
                try:
                    result = root.xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' " + find_class + " ')]")
                    for r in result:
                        find_class_locations.append(tree.getpath(r))
                    class_locations[find_class] = find_class_locations
                except Exception,e:
                    print(e)
                    continue
            final_result[url] = class_locations
        except Exception, e:
            print(e)
            continue
print(final_result)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50846570

复制

相似问题

问编写一个python脚本，根据任何类从超文本标记语言中查找XPath
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问编写一个python脚本，根据任何类从超文本标记语言中查找XPathEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问编写一个python脚本，根据任何类从超文本标记语言中查找XPath
EN