我是Reddit上/r/抽奖的主持人,它允许用户发布他们的推荐链接到竞赛/抽奖。一个主要规则是,如果另一个用户已经这样做了,则不允许用户发布到竞赛的链接。因为所有的参考链接都有一个不同的URL (即contest.com/?ref=Kevin
& contest.com/?ref=Steve
),所以对reposts的检查并不那么简单。
我认为找到重新发布的一个好方法是检索网页的标题( <title>
标记),并将其与其他一些重要信息一起存储在数据库中。
它每隔15m就扫描一次,以寻找新的帖子。它对每一篇文章都做了以下几点:
pid
(PostId)来查看帖子。如果我们有这个,跳过然后进入下一篇文章。urllib
获取最终的URL。一些URL重定向到另一个网页(即bit.ly链接)<title>
获取网页的标题( BeautifulSoup
)。permalink
,submitter
)上的一些信息。我们将此信息添加到将发送给版主的字符串中。我遇到了很多问题,它们主要与找到文章的最终URL和找到页面的标题有关。为了保持简单,我最终可能会删除函数以找到URL的最终URL,因为它并不是很重要。
我遇到了ASCII/Unicode问题,并且一直得到UnicodeEncodeError/UnicodeDecodeError
异常。
关于如何改进守则的建议将不胜感激。
import traceback
import praw # simple interface to the reddit API, also handles rate limiting of requests
import time
import sqlite3
import re
from urlparse import urlparse
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
import requests
'''USER CONFIGURATION'''
APP_ID = 'XXXX'
APP_SECRET = 'XXXX'
APP_URI = 'XXXX'
APP_REFRESH = 'XXXX'
USERAGENT = 'XXXX'
SUBREDDIT = "XXXX"
MAXPOSTS = 30
WAIT = 900 #15m This is how many seconds you will wait between cycles. The bot is completely inactive during this time.
# Resolve redirects for a URL. i.e. bit.ly/XXXX --> somesite.com/blahblah
# Also input # of retries in case rate-limit
def resolve_redirects(url, tries):
tries -= 1
try:
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.125 Safari/537.36"}) # User agent since some sites block python/urllib2 useragent
return urllib2.urlopen(req)
except urllib2.HTTPError, e:
print('HTTPError: ' + str(e.code) + ': ' + domain)
if (e.code == 403 or e.code == 429) and tries > 0:
time.sleep(5)
resolve_redirects(url, tries)
except urllib2.URLError, e:
print('URLError: ' + str(e.reason) + ': ' + domain)
except Exception:
import traceback
print('Generic Exception: ' + traceback.format_exc())
# Get title of webpage if possible. Otherwise just set the page title equal to the pages URL
def get_title(url):
try:
title = BeautifulSoup(url).title.string.strip()
except AttributeError:
title = url.geturl()
return title.encode('utf-8').strip()
# Load Database
sql = sqlite3.connect('Reddit_DB.db')
print('Loaded SQL Database')
cur = sql.cursor()
# Create Table and Login to Reddit
cur.execute('CREATE TABLE IF NOT EXISTS duplicates(id TEXT, permalink TEXT, domain TEXT, url TEXT, title TEXT, submitter TEXT)')
sql.commit()
print('Logging in...')
r = praw.Reddit(USERAGENT)
r.set_oauth_app_info(APP_ID, APP_SECRET, APP_URI)
r.refresh_access_information(APP_REFRESH)
# Main portion of code
def replybot():
print('Searching %s @ %s' % (SUBREDDIT, time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))))
subreddit = r.get_subreddit(SUBREDDIT)
submissions = list(subreddit.get_new(limit=MAXPOSTS))
msg = ""
for post in submissions:
global domain # Need to be global to use in resolve_redirects()
pid = post.id
try:
author = post.author.name
except AttributeError:
print "AttributeError: Author is deleted"
continue
# See if we have already looked at this post before. If we have, skip it.
cur.execute('SELECT * FROM duplicates WHERE ID=?', [pid])
sql.commit()
if cur.fetchone(): # Post is already in the database
continue
url = post.url
domain = post.domain
if domain == "self." + str(SUBREDDIT): # Skip self posts
continue
# Get the final url after redirects (i.e. in case URL redirects to a different URL)
try:
post_url = resolve_redirects(url, 3)
effective_url = post_url.geturl()
except AttributeError:
print "AttributeError: Post URL/Effective URL"
continue
# Get Title of webpage in Final URL
try:
post_title = get_title(post_url).encode('utf-8').strip()
except UnicodeDecodeError:
post_title = unicode(get_title(post_url).strip(),"utf-8")
except UnicodeEncodeError:
print "UnicodeError: " + post.title
continue
# Check if the post is a repost by seeing if the Title already exists. If it does, get the Repost's permalink, title, submitter and create the message. Otherwise post is unique and is added to DB
cur.execute('SELECT * FROM duplicates where TITLE=?', [post_title])
sql.commit()
row = cur.fetchone()
if row:
repost_permalink = row[1]
repost_title = row[4]
repost_submitter = row[5]
print "Found repost of %s by %s" % (post.title, author)
msg += 'Repost: [%s](%s) by /u/%s. Original: [Here](%s) by /u/%s.\n\n' % (post.title, post.permalink, author, repost_permalink, repost_submitter)
else:
cur.execute('INSERT INTO duplicates VALUES(?,?,?,?,?,?)', [pid, post.permalink, domain, effective_url, post_title, author])
sql.commit()
# If message exists (meaning there was a repost), send message to moderators
if len(msg) > 0:
r.send_message('/r/sweepstakes', 'Possible Repost', msg)
print "Sent message"
else:
print "Nothing to send"
cycles = 0
while True:
try:
# Keep refresh alive by refreshing every 45m
if cycles % 3 == 0:
r.refresh_access_information(APP_REFRESH)
print "Refreshed OAuth"
replybot()
cycles += 1
except Exception as e:
traceback.print_exc()
time.sleep(WAIT)
发布于 2015-07-30 11:28:01
的现代版本
这里最明显的是使用Python 3,这将在很大程度上帮助您解决Unicode问题,因为Python 3在Python 2混为一谈的事物之间保持了更严格的分离。在某些情况下,您的错误只是Python 2's方法的工件,并且会消失。在其他情况下,您将得到错误,使您更好地了解问题是什么。
一般来说,现在在新代码中使用Python 2的唯一原因是,您必须使用数量越来越少的尚未移植的库中的一个。您使用了三个非stdlib包:requests
和praw
都支持Python3。
剩下的是: BeautifulSoup。事实上,您正像BeautifulSoup
一样导入它,这意味着您使用的是bs3,它只在Python2.x上工作,自2012年以来还没有进行过更新。升级到BeautifulSoup 4-它是积极维护的(在这篇文章发布的时候,上一个版本还不到4周前),并且支持所有当前版本的Python。
您导入请求,但也导入urllib
和urllib2
。其中,最容易用于您想要的东西的是requests
,而您实际使用的唯一一种是urllib2
。
e.code == 403 or e.code == 429
可缩短为:
e.code in 403, 429
一般来说,Python风格更喜欢迭代而不是递归。所以,不要像这样重新尝试:
def resolve_redirects(url, tries):
tries -= 1
# Several lines of code unrelated to tries
...
except urllib2.HTTPError, e:
time.sleep(5)
resolve_redirects(url, tries)
这样做(还转换为使用requests
和字符串格式而不是串联):
def resolve_redirects(url, tries):
for _ in range(tries):
response = requests.get(url, headers=...)
if response.status_code in 403, 429:
print('HTTP Error: {} ')
continue
elif response.status_code != 200:
# Generic error
response.raise_for_status()
else:
return response
在这里,我还删除了对泛型错误的异常处理,因为我不认为这是处理它们的合适位置。相反,让他们泡到主线,并在那里处理他们。
这里有一个隐含的流程:
try:
post_url = resolve_redirects(url, 3)
effective_url = post_url.geturl()
except AttributeError:
print "AttributeError: Post URL/Effective URL"
continue
几乎可以肯定的是,AttributeError
的出现是因为您之前的异常处理。您正在打印错误,然后忽略它并继续运行,这使得resolve_redirects
从末尾掉下来返回None
。现在,您可以将这个守卫改为except URLError:
,这样您就可以更好地了解正在发生的事情。
您可能还应该重命名post_url
,因为它已经不是真正的url了(它是一个Response
,因此由于没有更好的名称,让我们将其命名为post_response
)。
这是处理该错误的正确位置。但是,与其在这里调用print
,不如考虑使用logging
模块。
在此之上:
submissions = list(subreddit.get_new(limit=MAXPOSTS))
没有必要把这个结果变成一个列表。您可以传递给list
的任何东西都可以直接迭代。如果需要迭代多次(不需要),只需要将其转换为列表即可。
url = post.url
domain = post.domain
只需直接使用post.url
和post.domain
。
try:
post_title = get_title(post_url).encode('utf-8').strip()
except UnicodeDecodeError:
post_title = unicode(get_title(post_url).strip(),"utf-8")
except UnicodeEncodeError:
print "UnicodeError: " + post.title
continue
这是一个可爱的可憎的东西。它..。看起来您正在尝试处理任意编码的页面,并将其标准化为UTF8吗?如果是这样的话,就这么做:
title = get_title(post_response.text).strip().encode('utf8')
在Python3中,encode
不会引发UnicodeDecodeError
,因为有人意识到这有点奇怪。对utf8的编码不应该引发UnicodeEncodeError
,因为没有utf8无法编码的unicode编码点。
如果您对原始字节的编码方式感到满意,请执行以下操作:
title = get_title(post_response.content).strip()
对于重新发布,您将逐步构建一个字符串消息来发送给某人。最好(而且可能更快一点)建立一个相关信息的列表:
reposts = []
for post in posts:
...
if row:
# There's a repost
reposts.append((tuple of the things you current make a string for))
...
if reposts:
msg = 'Repost: [{}]({}) by /u/{}. Original: [Here]({}) by /u/{}.'
msg = '\n\n'.join(msg.format(post) for post in reposts)
r.send_message(...)
row
对象可以通过列名访问--将row
变量重命名为repost
,您可以这样做,例如,“重新发布‘permalink’”,而不必创建变量来跟踪每个变量是什么。
下面是一种更有效的管理cycles
计数器的方法:
import itertools as it
for cycle in it.count(1):
...
https://codereview.stackexchange.com/questions/98552
复制相似问题