前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >shell脚本--练习1(爬虫)

shell脚本--练习1(爬虫)

作者头像
醉生萌死
修改2018-11-05 22:58:18
9910
修改2018-11-05 22:58:18
举报
文章被收录于专栏:醉生梦死醉生梦死

系统环境 

代码语言:javascript
复制
[root@m01 scripts]# uname -r
2.6.32-696.el6.x86_64
[root@m01 scripts]# uname -m
x86_64
[root@m01 scripts]# cat /etc/redhat-release 
CentOS release 6.9 (Final)

 shell练习1

代码语言:javascript
复制
#!/bin/bash
# date: 2018-03-xx
# author: yk
# descrption: Climbing 51cto data
# version: 0.1

source /etc/profile
. /etc/init.d/functions

# Create a temporary file 
TmpFile="/tmp/.$(date +%Y%m%d_%H%M%S).log.tmp"
touch $TmpFile
# Store web page information
BlogFile="/tmp/$(date +%Y%m%d_%H%M%S)_blog.html"
touch $BlogFile

# Let the user enter the 51cto blogger's homepage URL
read -p 'please input websitei' Website
# Climb 51cto blogger home
wget -q -O $TmpFile $Website &>/dev/null
[ $? -ne 0 ] && echo "you input website is not exist" && exit 1

# Blogger's last page blog. That is, the last page contains the number of pages
MainURL=$(sed -n '/class="last".*末页.*/p' $TmpFile | egrep -o 'http:.*p[0-9]{1,}')

# 28 pages
Pages=$(echo $MainURL | sed -n 's#^.*p##gp')

# If it is not the home page, the number of extracted pages is definitely not a number
if [ "$Pages" -gt 0 ] &>/dev/null
then
	echo "please wait ......"
else
	echo "you input url is not homepage"
	rm -f $TmpFile
	rm -f $BlogFile
	exit 1
fi



# Url address, in addition to the last number
UR=$(echo $MainURL | sed -rn 's#[0-9]{1,}$##gp')

# Traverse every page
for ((i=1;i<=$Pages;i++))
do
	# Splice together, which is the complete blogger's website
	wget -q -O $TmpFile ${UR}$i &>/dev/null
	# Get time, title, link
	egrep -A 1 '<a class="tit" | class="time' $TmpFile | sed '/^\-\-/d' | sed -r 's#[ ]+# #g'   >>$BlogFile
	# Pause 0.05 seconds, not too fast
	sleep 0.05   
done

# clear tmp file
>$TmpFile


# ===============================================================
action "The blogger’s blog information has been downloaded locally" /bin/true
echo "Extracting required data from downloaded data ......"
echo "please wait ....."
# ===============================================================


i=0
# Extract the desired data for each line of the file
while read line
do
	# Because every 4th line is the content of a blog, it only needs to extract from every 4th line and loop execution.
	((++i))
	case "$i" in
		1)
			# Get blog posting time
			Time=$(echo $line | sed -r 's#^.*>发布于:(.*)</a>#\1#g')
			;;
		3)
			# get href 
			Href=$(echo $line | sed -r 's#^.*href=\"(.*)\">#\1#g')
			;;
		4)
			# get blog title
			Title=$(echo $line | sed -r 's#^(.*)<.*$#\1#g')
			;;
		*)
	esac
	# Every 4 acts as a blog, appends the acquired information to a temporary file
	if [ "$i" -eq "4" ]
	then
		i=0
		echo "<a href=\"$Href\">$Time---$Title</a><br/>" >> $TmpFile
	fi
done < $BlogFile
# clear file
>$BlogFile
# Sort by time , Append to file $BlogFile
cat $TmpFile | sort -rt '>' -k2 >>$BlogFile 
rm -f $TmpFile

action "success" /bin/true

注:仅供参考

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 系统环境 
  •  shell练习1
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档