我想从国会贸易那里刮掉国会的股票交易。我可以抓取数据,但是包含股票代码的列有一个span标记,它将公司名称与公司代码分隔开来。pandas.read_html()删除了这个span标记,它连接了公司名称和代码,并且很难恢复代码。
例如,以"INC“后缀结尾的公司名称出现在代码中,也是大写字母。参见下面的“公司”和"AE“的例子。

这里是我找到span标签的地方:

公司代码长度为1到5个字符,而我没有调整代码,因为公司后缀种类繁多(例如,"INC“、"CORP”、"PLC“、"SE”等),而且并不是所有的公司名称都有后缀。
我如何用空格替换span标记以分离公司名称和代码,或者将span解析为另一列?
这是我的代码:
import pandas as pd
import yfinance as yf
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import datetime
def get_url(page=1, pageSize=50, assetType='stock'):
if page == 1:
return f'https://www.capitoltrades.com/trades?assetType={assetType}&pageSize={pageSize}'
elif page > 1:
return f'https://www.capitoltrades.com/trades?assetType={assetType}&page={page}&pageSize={pageSize}'
else:
return None
driver = webdriver.Firefox()
driver.get(get_url(page=1))
driver.implicitly_wait(10)
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'lxml')
tables = soup.find_all('table')
table = pd.read_html(str(tables))[0]
driver.close()发布于 2022-11-12 04:46:37
要将公司名称和代码分隔开来,或者将span解析为另一列,以获得整体整洁的ResultSet,您可以稍微更改您的工具选择策略。在这种情况下,最好是apply bs4 with pandas DataFrame而不是pd.read_html()方法。
以完整工作代码为例:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
#The base url: https://www.capitoltrades.com/trades?assetType=stock&pageSize=50
data = []
for page in range(1, 5):
driver.get(f'https://www.capitoltrades.com/trades?assetType=stock&pageSize=50&page={page}')
driver.maximize_window()
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'lxml')
for row in soup.select('table.q-table.trades-table > tbody tr'):
Politician = row.select_one('[class="q-fieldset politician-name"] > a').text.strip()
Politician_info = row.select_one('[class="q-fieldset politician-info"]').get_text(' ',strip=True)
Traded_Issuer = row.select_one('[class="q-fieldset issuer-name"] > a').text.strip()
Issuer_ticker =row.select_one('span[class="q-field issuer-ticker"]').text.strip()
Published = row.select_one('[class="q-td q-column--pubDate"] .q-value').text.strip()
Traded = row.select_one('[class="q-td q-column--txbDate"] .q-value')
Traded = Traded.text.strip() if Traded else None
Filed_after = row.select_one('[class="q-td q-column--reportingGap"] .q-value').text.strip()
Owner =row.select_one('[class="svg-image owner-icon"]+span').text.strip()
_type = row.select_one('[class="q-data-cell tx-type"]').get_text(strip=True)
Size = row.select_one('[class="q-td q-column--value"] > div').get_text(strip=True)
#Size = Size.text.strip() if Size else None
Price =row.select_one('[class="q-field trade-price"]').text.strip()
data.append({
'Politician':Politician,
'Politician_info':Politician_info,
'Traded_Issuer':Traded_Issuer,
'Issuer_ticker':Issuer_ticker,
'Published':Published,
'Traded':Traded,
'Filed_after':Filed_after,
'Owner':Owner,
'Type':_type,
'Size':Size,
'Price':Price
})
df = pd.DataFrame(data)
print(df)输出:
Politician Politician_info Traded_Issuer Issuer_ticker ... Owner Type Size Price
0 Debbie Wasserman Schultz Democrat House FL ADAMS RESOURCES & ENERGY INC AE:US ... Child sell 1K - 15K 32.27
1 Kathy Manning Democrat House NC 3M Co MMM:US ... Spouse buy 1K - 15K 108.95
2 Kathy Manning Democrat House NC Accenture PLC ACN:US ... Spouse buy 1K - 15K 250.07
3 Kathy Manning Democrat House NC Adobe Inc ADBE:US ... Spouse sell 1K - 15K 286.15
4 Kathy Manning Democrat House NC Alphabet Inc GOOGL:US ... Spouse buy 1K - 15K 97.56
.. ... ... ... ... ... ... ... ... ...
195 Diana Harshbarger Republican House TN CME Group Inc CME:US ... Joint sell 1K - 15K 176.26
196 Diana Harshbarger Republican House TN CME Group Inc CME:US ... Spouse sell 1K - 15K 176.26
197 Diana Harshbarger Republican House TN The Home Depot Inc HD:US ... Undisclosed sell 1K - 15K 268.69
198 Diana Harshbarger Republican House TN The Home Depot Inc HD:US ... Undisclosed sell 1K - 15K 268.69
199 Diana Harshbarger Republican House TN The Home Depot Inc HD:US ... Joint sell 1K - 15K 268.69
[200 rows x 11 columns]https://stackoverflow.com/questions/74409839
复制相似问题