文章/答案/技术大牛

发布

社区首页 >问答首页 >使用python网络爬虫抓取twitter帐户

问使用python网络爬虫抓取twitter帐户
EN

Stack Overflow用户

提问于 2020-09-24 05:52:17

回答 2查看 1.1K关注 0票数 0

我正在为我的A-Level Computer Science课程编写这个程序，我正在尝试让一个爬虫从一个给定的用户关注/关注列表中抓取所有找到的用户。

脚本的开头如下所示：

import requests
# import database as db
from bs4 import BeautifulSoup

debug = True


def getStartNode():  # Get the Twitter profile of the starting node
    global startNodeFollowing  # Declare the nodes vars as global for use in external functions
    global startNodeFollowers
    global startNodeLink
    if not debug:  # If debugging == False, allow the user to enter any starting node Twitter profile
        startNodeLink = input("Enter a link to the starting users Twitter profile\n[URL]: ")[:-1]  # Get profile link, remove the last char from input (space char, needed to enter link in terminal)
    else:  # If debugging == True, have predetermined starting node to save time during development
        startNodeLink = ("https://twitter.com/ckjellberg03")
    startNodeFollowers = (startNodeLink + "/followers")  # Create a new var using the starting node's Twitter profile, append for followers and following URL pages
    startNodeFollowing = (startNodeLink + "/following")

爬虫就在这里：

def spider():  # Web Crawler
    getStartNode()
    print("\nUsing:", startNodeLink)

    urlFollowers = startNodeFollowers
    sourceCode = requests.get(urlFollowers)
    plainText = sourceCode.text  # Source code of the URL (urlFollowers) in plain text format
    soup = BeautifulSoup(plainText,'lxml')  # BeautifulSoup object to search through plainText for specific items/classes etc
    for link in soup.findAll('a', {'class': 'css-4rbku5 css-18t94o4 css-1dbjc4n r-1loqt21 r-1wbh5a2 r-dnmrzs r-1ny4l3l'}):  # 'a' is a link in HTML (anchor), class is the Twitter class for a profile
        href = link.get(href)
        print(href) # Display everything found (development purposes)

我非常确定从/followers链接到Twitter profile的用户的类标识符是“css-4rbku5css-18t94o4css-1dbjc4nr-1loqt21r-1wbh5a2r-dnmrzs 1ny4l3l”，但打印结果没有显示任何内容。

有什么建议可以给我指明正确的方向吗？

谢谢!

python

web

web-crawler

Stack Overflow用户

发布于 2021-05-09 16:13:16

以下是如何在没有API的情况下完成此操作。一些困难源于在User-Agent中使用正确的浏览器，

import re, requests

headers = { 'User-Agent': 'UCWEB/2.0 (compatible; Googlebot/2.1; +google.com/bot.html)'}


def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

content = ""
for user in ['billgates']:
    content += "============================\n\n"
    content += user + "\n\n"
    content += "============================\n\n"
    url_twitter = 'https://twitter.com/%s' % user
    resp = requests.get(url_twitter, headers=headers)  # Send request
    res = re.findall(r'<p class="TweetTextSize.*?tweet-text.*?>(.*?)</p>',resp.text)
    for x in res:
        x = cleanhtml(x)
        x = x.replace("&#39;","'")
        x = x.replace('&quot;','"')
        x = x.replace("&nbsp;"," ")
        content += x 
        content += "\n\n"
        content += "---"
        content += "\n\n"

票数 0

查看全部 2 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64036776

复制

相似问题

问使用python网络爬虫抓取twitter帐户
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python网络爬虫抓取twitter帐户EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用python网络爬虫抓取twitter帐户
EN