首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >用空格替换span标记,或者用pandas.read_html将内容解析为新列

用空格替换span标记,或者用pandas.read_html将内容解析为新列
EN

Stack Overflow用户
提问于 2022-11-12 01:56:35
回答 1查看 47关注 0票数 0

我想从国会贸易那里刮掉国会的股票交易。我可以抓取数据,但是包含股票代码的列有一个span标记,它将公司名称与公司代码分隔开来。pandas.read_html()删除了这个span标记,它连接了公司名称和代码,并且很难恢复代码。

例如,以"INC“后缀结尾的公司名称出现在代码中,也是大写字母。参见下面的“公司”和"AE“的例子。

这里是我找到span标签的地方:

公司代码长度为1到5个字符,而我没有调整代码,因为公司后缀种类繁多(例如,"INC“、"CORP”、"PLC“、"SE”等),而且并不是所有的公司名称都有后缀。

我如何用空格替换span标记以分离公司名称和代码,或者将span解析为另一列?

这是我的代码:

代码语言:javascript
运行
复制
import pandas as pd
import yfinance as yf
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import datetime

def get_url(page=1, pageSize=50, assetType='stock'):
    if page == 1:
        return f'https://www.capitoltrades.com/trades?assetType={assetType}&pageSize={pageSize}'
    elif page > 1:
        return f'https://www.capitoltrades.com/trades?assetType={assetType}&page={page}&pageSize={pageSize}'
    else:
        return None

driver = webdriver.Firefox()
driver.get(get_url(page=1))
driver.implicitly_wait(10)
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'lxml')
tables = soup.find_all('table')
table = pd.read_html(str(tables))[0]
driver.close()
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-11-12 04:46:37

要将公司名称和代码分隔开来,或者将span解析为另一列,以获得整体整洁的ResultSet,您可以稍微更改您的工具选择策略。在这种情况下,最好是apply bs4 with pandas DataFrame而不是pd.read_html()方法。

完整工作代码为例:

代码语言:javascript
运行
复制
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
#The base url: https://www.capitoltrades.com/trades?assetType=stock&pageSize=50
data = []
for page in range(1, 5):
    driver.get(f'https://www.capitoltrades.com/trades?assetType=stock&pageSize=50&page={page}')
    driver.maximize_window()
    time.sleep(3)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    for row in soup.select('table.q-table.trades-table > tbody tr'):
        Politician = row.select_one('[class="q-fieldset politician-name"] > a').text.strip()
        Politician_info = row.select_one('[class="q-fieldset politician-info"]').get_text(' ',strip=True)
        Traded_Issuer = row.select_one('[class="q-fieldset issuer-name"] > a').text.strip()
        Issuer_ticker =row.select_one('span[class="q-field issuer-ticker"]').text.strip()
        Published = row.select_one('[class="q-td q-column--pubDate"] .q-value').text.strip()
        Traded = row.select_one('[class="q-td q-column--txbDate"] .q-value')
        Traded = Traded.text.strip() if Traded else None
        Filed_after = row.select_one('[class="q-td q-column--reportingGap"] .q-value').text.strip()
        Owner =row.select_one('[class="svg-image owner-icon"]+span').text.strip()
        _type = row.select_one('[class="q-data-cell tx-type"]').get_text(strip=True)
        Size = row.select_one('[class="q-td q-column--value"] > div').get_text(strip=True)
        #Size = Size.text.strip() if Size else None
        Price =row.select_one('[class="q-field trade-price"]').text.strip()

        data.append({
            'Politician':Politician,
            'Politician_info':Politician_info,
            'Traded_Issuer':Traded_Issuer,
            'Issuer_ticker':Issuer_ticker,
            'Published':Published,
            'Traded':Traded,
            'Filed_after':Filed_after,
            'Owner':Owner,
            'Type':_type,
            'Size':Size,
            'Price':Price

        })


df = pd.DataFrame(data)
print(df)

输出:

代码语言:javascript
运行
复制
                   Politician      Politician_info                 Traded_Issuer Issuer_ticker  ...        Owner  Type      Size   Price
0    Debbie Wasserman Schultz    Democrat House FL  ADAMS RESOURCES & ENERGY INC         AE:US  ...        Child  sell  1K - 15K   32.27     
1               Kathy Manning    Democrat House NC                         3M Co        MMM:US  ...       Spouse   buy  1K - 15K  108.95     
2               Kathy Manning    Democrat House NC                 Accenture PLC        ACN:US  ...       Spouse   buy  1K - 15K  250.07     
3               Kathy Manning    Democrat House NC                     Adobe Inc       ADBE:US  ...       Spouse  sell  1K - 15K  286.15     
4               Kathy Manning    Democrat House NC                  Alphabet Inc      GOOGL:US  ...       Spouse   buy  1K - 15K   97.56     
..                        ...                  ...                           ...           ...  ...          ...   ...       ...     ...       
195         Diana Harshbarger  Republican House TN                 CME Group Inc        CME:US  ...        Joint  sell  1K - 15K  176.26       
196         Diana Harshbarger  Republican House TN                 CME Group Inc        CME:US  ...       Spouse  sell  1K - 15K  176.26       
197         Diana Harshbarger  Republican House TN            The Home Depot Inc         HD:US  ...  Undisclosed  sell  1K - 15K  268.69       
198         Diana Harshbarger  Republican House TN            The Home Depot Inc         HD:US  ...  Undisclosed  sell  1K - 15K  268.69       
199         Diana Harshbarger  Republican House TN            The Home Depot Inc         HD:US  ...        Joint  sell  1K - 15K  268.69       

[200 rows x 11 columns]
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74409839

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档