您好,欢迎来到三六零分类信息网!老站,搜索引擎当天收录,欢迎发信息

Python抓取电影天堂电影信息的代码

2024/12/30 22:26:30发布19次查看
python2.7mac os
抓取的是电影天堂里面最新电影的页面。链接地址: http://www.dytt8.net/html/gndy/dyzz/index.html
获取页面的中电影详情页链接
import urllib2import osimport reimport string# 电影url集合movieurls = []# 获取电影列表def querymovielist(): url = 'http://www.dytt8.net/html/gndy/dyzz/index.html' conent = urllib2.urlopen(url) conent = conent.read() conent = conent.decode('gb2312','ignore').encode('utf-8','ignore') pattern = re.compile ('.*?>
'+ '(.*?) ',re.s) items = re.findall(pattern,conent) str = ''.join(items) pattern = re.compile ('(.*?).*?(.*?) ',re.s) news = re.findall(pattern, str) for j in news: movieurls.append('http://www.dytt8.net'+j[0])
抓取详情页中的电影数据
def querymovieinfo(movieurls): for index, item in enumerate(movieurls): print('电影url: ' + item) conent = urllib2.urlopen(item) conent = conent.read() conent = conent.decode('gb2312','ignore').encode('utf-8','ignore') moviename = re.findall(r'(.*?)
', conent, re.s) if (len(moviename) > 0): moviename = moviename[0] + # 截取名称 moviename = moviename[moviename.find(《) + 3:moviename.find(》)] else: moviename = print(电影名称: + moviename.strip()) moviecontent = re.findall(r'(.*?)',conent , re.s) pattern = re.compile('(.*?)', re.s) moviedate = re.findall(pattern,moviecontent[0]) if (len(moviedate) > 0): moviedate = moviedate[0].strip() + '' else: moviedate = print(电影发布时间: + moviedate[-10:]) pattern = re.compile('
(.*?)
0): movieinfo = movieinfo[0]+'' # 删除
标签 movieinfo = movieinfo.replace(
,) # 根据 ◎ 符号拆分 movieinfo = movieinfo.split('◎') else: movieinfo = print(电影基础信息: ) for item in movieinfo: print(item) # 电影海报 pattern = re.compile('', re.s) movieimg = re.findall(pattern,moviecontent[0]) if (len(movieimg) > 0): movieimg = movieimg[0] else: movieimg = print(电影海报: + movieimg) pattern = re.compile('.*? ', re.s) moviedownurl = re.findall(pattern,moviecontent[0]) if (len(moviedownurl) > 0): moviedownurl = moviedownurl[0] else: moviedownurl = print(电影下载地址: + moviedownurl + ) print(------------------------------------------------\n\n\n)
执行抓取
if __name__=='__main__': print(开始抓取电影数据); querymovielist() print(len(movieurls)) querymovieinfo(movieurls) print(结束抓取电影数据)
总结
学好正则表达式很重要,很重要,很重要!!!! python的语法好有感觉, 对比java …
该用户其它信息

VIP推荐

免费发布信息,免费发布B2B信息网站平台 - 三六零分类信息网 沪ICP备09012988号-2
企业名录 Product