似乎无法将以下HTML字符串:
[<address class="styles_address__zrPvy"><svg class="styles_addressIcon__3Pu3L" height="42" viewbox="0 0 32 42" width="32" xmlns="http://www.w3.org/2000/svg"><path d="M14.381 41.153C2.462 23.873.25 22.1.25 15.75.25 7.051 7.301 0 16 0s15.75 7.051 15.75 15.75c0 6.35-2.212 8.124-14.131 25.403a1.97 1.97 0 01-3.238 0zM16 22.313a6.562 6.562 0 100-13.125 6.562 6.562 0 000 13.124z"></path></svg>Level 1 44 Market Street<!-- -->, <!-- -->Sydney</address>]"title“可以正常工作,但"address”不能提取。
path = "C:\\Users\\mpeter\\Downloads\\lksd\\"
titleList = []
for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read(), 'lxml')
    title = soup.find_all("title")
    title = soup.title.string
    titleList.append(title)
    
streetAddressList = []
for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read(), 'lxml')
    address = soup.find_all("address", class_={"styles_address__zrPvy"})
    address = soup.address.string
    streetAddressList.append(address)
  
with open('output2.csv', 'w') as myfile:
   writer = csv.writer(myfile)
   writer.writerows((titleList, streetAddressList))当我去掉address = soup.address.string时,它可以工作,但会提取整个元素。
发布于 2021-01-21 15:29:56
使用findall,我们可以获得所有匹配的内容。然后需要迭代结果集以获得每个内容。
使用以下代码:
from bs4 import BeautifulSoup
import pandas as pd
strs = '[<address class="styles_address__zrPvy"><svg class="styles_addressIcon__3Pu3L" height="42" viewbox="0 0 32 42" width="32" xmlns="http://www.w3.org/2000/svg"><path d="M14.381 41.153C2.462 23.873.25 22.1.25 15.75.25 7.051 7.301 0 16 0s15.75 7.051 15.75 15.75c0 6.35-2.212 8.124-14.131 25.403a1.97 1.97 0 01-3.238 0zM16 22.313a6.562 6.562 0 100-13.125 6.562 6.562 0 000 13.124z"></path></svg>Level 1 44 Market Street<!-- -->, <!-- -->Sydney</address>] [<address class="styles_address__zrPvy"><svg class="styles_addressIcon__3Pu3L" height="42" viewbox="0 0 32 42" width="32" xmlns="http://www.w3.org/2000/svg"><path d="M14.381 41.153C2.462 23.873.25 22.1.25 15.75.25 7.051 7.301 0 16 0s15.75 7.051 15.75 15.75c0 6.35-2.212 8.124-14.131 25.403a1.97 1.97 0 01-3.238 0zM16 22.313a6.562 6.562 0 100-13.125 6.562 6.562 0 000 13.124z"></path></svg>14, Bengaluru<!-- -->, <!-- -->India</address>]'
soup = BeautifulSoup(strs, 'lxml')
addresses = soup.find_all(class_={"styles_address__zrPvy"})
addr = []
df = pd.DataFrame()
for address in addresses:
    addr.append(address.text)
df['address'] = addr
df输出:
   address
0   Level 1 44 Market Street, Sydney
1   14, Bengaluru, India现在地址列表在Dataframe中。您可以使用df.to_csv()将此数据帧写入csv
https://stackoverflow.com/questions/65821352
复制相似问题