我想使用我的网络刮刀,以获得所有关于苹果的推特,直到我指定的日期。目前,我正在收集今天或最近几天的推文。然而,我的目标是刮掉过去三年所有的推文。当我运行我的代码时,它只需要几个小时就能运行几天。有人对我如何优化我的代码以更快地运行有什么建议吗?对于这个可能微不足道的问题,我很抱歉,但我是个初学者,我正试着逐步开始。
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
options = Options()
options.headless = True
options.add_argument('window-size=1920x1080')
web = "https://stocktwits.com/search"
driver = webdriver.Chrome(r"C:\Users\veron\Downloads\chromedriver\chromedriver.exe", options=options)
driver.get(web)
#driver.maximize_window()
username = driver.find_element_by_xpath('//input[@placeholder = "Symbol or @Username"]')
username.send_keys("AAPL")
time.sleep(2)
username.send_keys(Keys.ENTER)
time.sleep(2)
def get_tweet(element):
try:
user = element.find_element_by_xpath('.//span[@class = "st_2JY3sEE"]/a[contains(@href, "/")]/span[text()]').text
text = element.find_element_by_xpath('.//div[@class="st_3SL2gug"]').text
date = element.find_element_by_xpath('.//a[@class ="st_28bQfzV st_1E79qOs st_3TuKxmZ st_1VMMH6S"]').text
# date = date_old.replace("\n", "")
tweet_data = [user, text, date]
except:
tweet_data = ['user', 'text', 'date']
return tweet_data
user_data = []
text_data = []
date_data = []
scrolling = True
while scrolling:
tweets = WebDriverWait(driver, 5).until(
EC.presence_of_all_elements_located((By.XPATH, '//div[@class = "st_2o0zabc st_jGV698i st_PLa30pM"]')))
# print(len(tweets))
for tweet in tweets:
tweet_list = get_tweet(tweet)
user_data.append(tweet_list[0])
text_data.append(" ".join(tweet_list[1].split()))
date_data.append(tweet_list[2])
# Get the initial scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
# Specified date
str1 = "4/30/22"
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(2)
# Calculate new scroll height and compare it with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
# check if the date substring from above is in the date list, condition 1
res = any(str1 in string for string in date_data)
if res is True:
scrolling = False
break
# condition 2
if new_height == last_height:
scrolling = False
break
else:
last_height = new_height
break
driver.quit()
df_tweets = pd.DataFrame({'user': user_data, 'text': text_data, 'date': date_data}) # , 'date': date_data
df_tweets.to_csv('stocktwits_tweets.csv', index=False)
print(df_tweets)
发布于 2022-05-03 12:15:02
有人对我如何优化我的代码以更快地运行有什么建议吗?
测量什么是慢的。一种典型的做法是:
一旦你知道什么是最慢的,你可能已经有了解决办法。或者如果没有,那就随便问问别人吧。通过自己做这项调查,你会更多地尊重人们的时间,而且更有可能得到答案,因为你证明了你做了最初的工作。
https://codereview.stackexchange.com/questions/276203
复制