我正在尝试打开一个csv文件中的抓取所有urls。然后打开csv文件并读取每个打开每个url的url,以搜索并获取源代码信息、作者和许可证信息。然后需要遵循受人尊敬的gitlink来查看是否有许可证文件。如果存在许可证文件,请下载并将其保存为csv文件。
我已经准备好了下面的代码,但是我在读取文件中的第一个url时收到以下错误:没有找到"'https://tools.kali.org/information-gathering/ace-voip'“的连接适配器
实际错误:
get_adapter raise InvalidSchema中的文件“已省略”,第742行(“未找到{!r}的连接适配器”.format(Url))
InvalidSchema:找不到用于"'https://tools.kali.org/information-gathering/ace-voip'“的连接适配器
我认为这是因为在我的url前面添加了“[”,然而,这在我列出的url文件中并不存在。
我是python的新手,非常感谢大家在这方面的帮助。
import urllib.request, urllib.parse, urllib.error
import ssl
import zlib
from bs4 import BeautifulSoup
import csv
from urllib.request import urlopen
import urllib
import urllib.parse
import requests
#Testing ssl and reading url
#urllib.request.urlopen('https://google.com').read()
ctx = ssl._create_default_https_context()
# Establish chrome driver and go to report site URL
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'https://tools.kali.org/tools-listing'
html = urllib.request.urlopen(url, context=ctx)#.read().decode('utf-8')
de_data=zlib.decompress(html.read(), 16+zlib.MAX_WBITS)
print(de_data)
soup = BeautifulSoup(de_data, 'lxml')
data = []
for url in soup.find_all('a', href=True, text=True):
print(url['href'])
data.append(url['href'])
print(data)
####New Replacement for above that works removing spaces########
with open('kalitools.csv', 'w') as file:
for url in data:
file.write(str(url) + 'n')
# loading csv file with URLS and parsing each
######TESTING Reading URLS########
with open('E:/KaliScrape/kalitools.txt', 'r') as f_urls, open('ommitted/output.txt', 'w', newline='') as f_output:
csv_urls = csv.reader(f_urls)
csv_output = csv.writer(f_output)
csv_output.writerow(['Source', 'Author', 'License'])
print(csv_urls)
for line in csv_urls:
r = requests.get(line)#.text
soup = BeautifulSoup(r, 'lxml')
#r = requests.get(line[0], verify=False)#.text
#for line in csv_urls:
# line = 'https://' + line if 'https' not in line else line
# source = urlopen(line).read()
src = soup.find('li')
print('Source:', src.text)
auth = soup.find('li')
print('Author:', auth.text)
lic = soup.find('li')
print('License:', lic.text)
csv_output.writerow([src.text, auth.text, lic.text])
发布于 2021-06-24 17:10:52
所以,问题是你得到了一个列表,你只需要选择索引为零的列表元素,
for line in csv_urls:
r = requests.get(line[0])#.text
https://stackoverflow.com/questions/68118374
复制相似问题