文章/答案/技术大牛

发布

问Reddit bot检查转贴
EN

Code Review用户

提问于 2015-07-30 07:06:12

回答 1查看 1.4K关注 0票数 7

我是Reddit上/r/抽奖的主持人，它允许用户发布他们的推荐链接到竞赛/抽奖。一个主要规则是，如果另一个用户已经这样做了，则不允许用户发布到竞赛的链接。因为所有的参考链接都有一个不同的URL (即contest.com/?ref=Kevin & contest.com/?ref=Steve)，所以对reposts的检查并不那么简单。

我认为找到重新发布的一个好方法是检索网页的标题( <title>标记)，并将其与其他一些重要信息一起存储在数据库中。

它每隔15m就扫描一次，以寻找新的帖子。它对每一篇文章都做了以下几点：

看看我们是否已经通过搜索DB中的pid (PostId)来查看帖子。如果我们有这个，跳过然后进入下一篇文章。
使用urllib获取最终的URL。一些URL重定向到另一个网页(即bit.ly链接)
使用<title>获取网页的标题( BeautifulSoup )。
在DB中搜索标题。如果标题在数据库中，那么这意味着提交的帖子是一个转发，我们希望检索原始帖子(permalink，submitter)上的一些信息。我们将此信息添加到将发送给版主的字符串中。
如果提交的帖子的标题在数据库中不存在，那么它就是唯一的帖子，我们将把它添加到数据库中。
处理完所有帖子后，将所有转发的消息发送给版主，让他们手动检查。

我遇到了很多问题，它们主要与找到文章的最终URL和找到页面的标题有关。为了保持简单，我最终可能会删除函数以找到URL的最终URL，因为它并不是很重要。

我遇到了ASCII/Unicode问题，并且一直得到UnicodeEncodeError/UnicodeDecodeError异常。

关于如何改进守则的建议将不胜感激。

import traceback
import praw # simple interface to the reddit API, also handles rate limiting of requests
import time
import sqlite3
import re
from urlparse import urlparse
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
import requests

'''USER CONFIGURATION'''

APP_ID = 'XXXX'
APP_SECRET = 'XXXX'
APP_URI = 'XXXX'
APP_REFRESH = 'XXXX'
USERAGENT = 'XXXX'
SUBREDDIT = "XXXX"
MAXPOSTS = 30
WAIT = 900 #15m This is how many seconds you will wait between cycles. The bot is completely inactive during this time.

# Resolve redirects for a URL. i.e. bit.ly/XXXX --> somesite.com/blahblah
# Also input # of retries in case rate-limit
def resolve_redirects(url, tries):
    tries -= 1
    try:
        req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.125 Safari/537.36"}) # User agent since some sites block python/urllib2 useragent
        return urllib2.urlopen(req)
    except urllib2.HTTPError, e:
        print('HTTPError: ' + str(e.code) + ': ' + domain)
        if (e.code == 403 or e.code == 429) and tries > 0:
            time.sleep(5)
            resolve_redirects(url, tries)
    except urllib2.URLError, e:
        print('URLError: ' + str(e.reason) + ': ' + domain)
    except Exception:
        import traceback
        print('Generic Exception: ' + traceback.format_exc())

# Get title of webpage if possible. Otherwise just set the page title equal to the pages URL        
def get_title(url):
    try:
        title = BeautifulSoup(url).title.string.strip()
    except AttributeError:
        title = url.geturl()
    return title.encode('utf-8').strip()

# Load Database
sql = sqlite3.connect('Reddit_DB.db')
print('Loaded SQL Database')
cur = sql.cursor()

# Create Table and Login to Reddit
cur.execute('CREATE TABLE IF NOT EXISTS duplicates(id TEXT, permalink TEXT, domain TEXT, url TEXT, title TEXT, submitter TEXT)')
sql.commit()
print('Logging in...')
r = praw.Reddit(USERAGENT)
r.set_oauth_app_info(APP_ID, APP_SECRET, APP_URI)
r.refresh_access_information(APP_REFRESH)

# Main portion of code
def replybot():
    print('Searching %s @ %s' % (SUBREDDIT, time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))))
    subreddit = r.get_subreddit(SUBREDDIT)
    submissions = list(subreddit.get_new(limit=MAXPOSTS))
    msg = ""
    for post in submissions:
        global domain # Need to be global to use in resolve_redirects()
        pid = post.id

        try:
            author = post.author.name
        except AttributeError:
            print "AttributeError: Author is deleted"
            continue

        # See if we have already looked at this post before. If we have, skip it.
        cur.execute('SELECT * FROM duplicates WHERE ID=?', [pid])
        sql.commit()
        if cur.fetchone(): # Post is already in the database
            continue

        url = post.url
        domain = post.domain
        if domain == "self." + str(SUBREDDIT): # Skip self posts
            continue

        # Get the final url after redirects (i.e. in case URL redirects to a different URL)
        try:
            post_url = resolve_redirects(url, 3)
            effective_url = post_url.geturl()
        except AttributeError:
            print "AttributeError: Post URL/Effective URL"
            continue

        # Get Title of webpage in Final URL
        try:    
            post_title = get_title(post_url).encode('utf-8').strip()
        except UnicodeDecodeError:
            post_title = unicode(get_title(post_url).strip(),"utf-8")
        except UnicodeEncodeError:
            print "UnicodeError: " + post.title
            continue

        # Check if the post is a repost by seeing if the Title already exists. If it does, get the Repost's permalink, title, submitter and create the message. Otherwise post is unique and is added to DB
        cur.execute('SELECT * FROM duplicates where TITLE=?', [post_title])
        sql.commit()
        row = cur.fetchone()
        if row:
            repost_permalink = row[1]
            repost_title = row[4]
            repost_submitter = row[5]
            print "Found repost of %s by %s" % (post.title, author)
            msg += 'Repost: [%s](%s) by /u/%s. Original: [Here](%s) by /u/%s.\n\n' % (post.title, post.permalink, author, repost_permalink, repost_submitter)
        else:
            cur.execute('INSERT INTO duplicates VALUES(?,?,?,?,?,?)', [pid, post.permalink, domain, effective_url, post_title, author])
            sql.commit()

    # If message exists (meaning there was a repost), send message to moderators
    if len(msg) > 0:
        r.send_message('/r/sweepstakes', 'Possible Repost', msg)
        print "Sent message"
    else:
        print "Nothing to send"

cycles = 0
while True:
    try:
        # Keep refresh alive by refreshing every 45m
        if cycles % 3 == 0:
            r.refresh_access_information(APP_REFRESH)
            print "Refreshed OAuth"
        replybot()
        cycles += 1
    except Exception as e:
        traceback.print_exc()
    time.sleep(WAIT)

python

beautifulsoup

回答 1

Code Review用户

发布于 2015-07-30 11:28:01

使用

的现代版本

这里最明显的是使用Python 3，这将在很大程度上帮助您解决Unicode问题，因为Python 3在Python 2混为一谈的事物之间保持了更严格的分离。在某些情况下，您的错误只是Python 2's方法的工件，并且会消失。在其他情况下，您将得到错误，使您更好地了解问题是什么。

一般来说，现在在新代码中使用Python 2的唯一原因是，您必须使用数量越来越少的尚未移植的库中的一个。您使用了三个非stdlib包：requests和praw都支持Python3。

剩下的是: BeautifulSoup。事实上，您正像BeautifulSoup一样导入它，这意味着您使用的是bs3，它只在Python2.x上工作，自2012年以来还没有进行过更新。升级到BeautifulSoup 4-它是积极维护的(在这篇文章发布的时候，上一个版本还不到4周前)，并且支持所有当前版本的Python。

使用请求

您导入请求，但也导入urllib和urllib2。其中，最容易用于您想要的东西的是requests，而您实际使用的唯一一种是urllib2。

一般Pythonism

e.code == 403 or e.code == 429

可缩短为：

e.code in 403, 429

一般来说，Python风格更喜欢迭代而不是递归。所以，不要像这样重新尝试：

def resolve_redirects(url, tries):
    tries -= 1
    # Several lines of code unrelated to tries
    ...
    except urllib2.HTTPError, e:
        time.sleep(5)
        resolve_redirects(url, tries)

这样做(还转换为使用requests和字符串格式而不是串联)：

def resolve_redirects(url, tries):
    for _ in range(tries):
        response = requests.get(url, headers=...)
        if response.status_code in 403, 429:
            print('HTTP Error: {} ')
            continue
        elif response.status_code != 200:
            # Generic error 
            response.raise_for_status()
        else:
            return response

在这里，我还删除了对泛型错误的异常处理，因为我不认为这是处理它们的合适位置。相反，让他们泡到主线，并在那里处理他们。

这里有一个隐含的流程：

try:
    post_url = resolve_redirects(url, 3)
    effective_url = post_url.geturl()
except AttributeError:
    print "AttributeError: Post URL/Effective URL"
    continue

几乎可以肯定的是，AttributeError的出现是因为您之前的异常处理。您正在打印错误，然后忽略它并继续运行，这使得resolve_redirects从末尾掉下来返回None。现在，您可以将这个守卫改为except URLError:，这样您就可以更好地了解正在发生的事情。

您可能还应该重命名post_url，因为它已经不是真正的url了(它是一个Response，因此由于没有更好的名称，让我们将其命名为post_response)。

这是处理该错误的正确位置。但是，与其在这里调用print，不如考虑使用logging模块。

在此之上：

submissions = list(subreddit.get_new(limit=MAXPOSTS))

没有必要把这个结果变成一个列表。您可以传递给list的任何东西都可以直接迭代。如果需要迭代多次(不需要)，只需要将其转换为列表即可。

url = post.url
domain = post.domain

只需直接使用post.url和post.domain。

try:    
    post_title = get_title(post_url).encode('utf-8').strip()
except UnicodeDecodeError:
    post_title = unicode(get_title(post_url).strip(),"utf-8")
except UnicodeEncodeError:
    print "UnicodeError: " + post.title
    continue

这是一个可爱的可憎的东西。它..。看起来您正在尝试处理任意编码的页面，并将其标准化为UTF8吗？如果是这样的话，就这么做：

title = get_title(post_response.text).strip().encode('utf8')

在Python3中，encode不会引发UnicodeDecodeError，因为有人意识到这有点奇怪。对utf8的编码不应该引发UnicodeEncodeError，因为没有utf8无法编码的unicode编码点。

如果您对原始字节的编码方式感到满意，请执行以下操作：

title = get_title(post_response.content).strip()

对于重新发布，您将逐步构建一个字符串消息来发送给某人。最好(而且可能更快一点)建立一个相关信息的列表：

reposts = []
for post in posts:
    ...
    if row:
        # There's a repost
        reposts.append((tuple of the things you current make a string for))
    ...
if reposts:
    msg = 'Repost: [{}]({}) by /u/{}. Original: [Here]({}) by /u/{}.'
    msg = '\n\n'.join(msg.format(post) for post in reposts)
    r.send_message(...)

row对象可以通过列名访问--将row变量重命名为repost，您可以这样做，例如，“重新发布‘permalink’”，而不必创建变量来跟踪每个变量是什么。

下面是一种更有效的管理cycles计数器的方法：

import itertools as it

for cycle in it.count(1):
    ...

票数 7

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/98552

复制

相似问题

问Reddit bot检查转贴
EN

回答 1

Code Review用户

使用

使用请求

一般Pythonism

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Reddit bot检查转贴EN

回答 1

Code Review用户

使用

使用请求

一般Pythonism

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Reddit bot检查转贴
EN