文章/答案/技术大牛

发布

自动抓取最新税务法规

文章来源：企鹅号 - 微税搜

自动抓取最新税务法规

背景介绍：

在日常的学习工作中，需要我们及时了解和掌握的财经、税收法规越来越多，本文中，我们尝试利用Python编写了一个小程序，及时追踪、了解税务局网站中最新发布的法规，以抛砖引玉、启迪思维。

准备工作：

安装 Python 软件

使用Jupyter Notebook 界面编写程序 (大家可选择安装Anaconda 平台）

调用 BeautifulSoup / requests / Pandas 等解析网站内容

示例程序：

我们假设需要及时掌握税务总局发布的最新法规。

#--------- 源代码 ---------

#!hello python for chinatax.gov

#define py for chinatax

import requests

from bs4 import BeautifulSoup

result = {}

def chinataxfile(url):

res = requests.get(url)

res.encoding = 'utf-8'

soup = BeautifulSoup(res.text,'html.parser')

result['linkage'] = url

result['article'] = soup.select('.sv_texth1_red')[0].text

result['filenu'] = soup.select('.sv_black14_30')[0].text

result['maintext'] = ''.join(p.text.strip() for p in soup.select('.sv_texth3 p'))

return result

#url for module

res = requests.get('http://www.chinatax.gov.cn/n810341/n810755/index.html')

res.encoding = 'utf-8'

soup = BeautifulSoup(res.text,'html.parser')

#get the len of list

import pandas

import copy

x = len(soup.select('#comp_2420064 a'))

y = 0

details=[]

while y

htma = soup.select('#comp_2420064 a')[y]

htmb = 'http://www.chinatax.gov.cn/'+htma.get('href').strip('../../')

y += 1

details.append(copy.deepcopy(chinataxfile(htmb))) # use deepcopy to avoid erasing

import pandas

df = pandas.DataFrame(details)

df.head(x)

#---------- 结束 --------

输出内容：

p.s. 也可以导出到本地Excel文档、或者采用类似方法解析其他信息源等.

发表于: 2018-04-162018-04-16 18:27:24
原文链接：http://kuaibao.qq.com/s/20180416G1ASTB00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

自动抓取最新税务法规

相关快讯

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐