最近,我做了猫眼爬虫和数据分析,收获很大,记录在这里。爬虫和数据分析是两个模块,可以参考目录:
[En]
Recently, I have done cat's eye crawler and data analysis, and gained a lot, which is recorded here. Crawler and data analysis are two modules, you can refer to the directory:
目录
数据分析是第二块,以后有空了再更新。
一、猫眼爬虫
1. 猫眼爬虫第一步——找到我们需要的数据
打开网站猫眼验证中心: https://www.maoyan.com/board/4, 就是我们想要爬的页面了。
按F12打开开发工具,以便定位我们需要的元素的标签。
Tips: 可以按 control+U打开HTML页面进行定位,更清晰,可以打开的页面如下:
用control + F 可以快速定位,找到我们需要的元素,这个页面不是必须打开的,只是打开看更清晰,看各个标签的关系很清楚。
通过这种方式,我们可以确定我们需要什么数据以及它们在哪里,然后开始获取数据。
[En]
In this way, we can determine what data we need and where they are, and then start to get the data.
2. 猫眼爬虫第二步——获取数据
先导入我们需要的库:
# for data scraping
#encoding:utf-8
import requests
from bs4 import BeautifulSoup
import time as ti
import csv
from lxml import etree
import re
# for data analyzing
import pandas as pd
接下来获取数据,代码如下:
#Part I: camouflage and request response:
def get_html(url):
#Because many web pages have anti crawlers, we add hearders disguise
#In the cat's eye movie web page -- F12 -- Network -- all -- 4 -- header -- find the user agent
#Copy and paste the content
headers = { # 设置header
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding' : 'gzip, deflate, br',
'Accept-Language' : 'zh-CN,zh;q=0.9',
'Cache-Control' : 'no-cache',
'Connection' : 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36 Edg/94.0.992.50',
'referer' : 'https://passport.meituan.com/',
'Cookie' : '__mta=42753434.1633656738499.1634781127005.1634781128998.34; uuid_n_v=v1; _lxsdk_cuid=17c5d879290c8-03443510ba6172-6373267-144000-17c5d879291c8; uuid=60ACEF00317A11ECAAC07D88ABE178B722CFA72214D742A2849B46660B8F79A8; _lxsdk=60ACEF00317A11ECAAC07D88ABE178B722CFA72214D742A2849B46660B8F79A8; _csrf=94b23e138a83e44c117736c59d0901983cb89b75a2c0de2587b8c273d115e639; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1634716251,1634716252,1634719353,1634779997; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1634781129; _lxsdk_s=17ca07b2470-536-b73-84%7C%7C12'
}
#The purpose of responding to the request, combined with the disguise of hearders, is to let the server know that this is not a crawler, but a person
#Get website information using get
result = requests.get(url, headers = headers)
#Because the crawler is fast, the server may make an error 403
#Write a judgment, 200 is success
if result.status_code == 200:
#The response is successful and a string is returned
return result.text
return
headers的目的是让网站不要把我们的爬虫程序当成爬虫,而是当成人的行为。设置方法就是猫眼主页--F12--Network -- all -- 4 -- header -- 找到对应信息复制粘贴过来。其实headers不是必须加的,只是为了防止猫眼反爬,一般只加user agent就够了,但是我当时还是被反爬了,同学建议我多写些信息,所以就加的很详细。
以上就可以获得HTML页面的全部信息了。
3. 猫眼爬虫第三步——解析数据
在获得HTML页面的信息后,我们需要解析HTML内容,定位提取我们需要的信息,代码如下:
def parsing_html(html):
ti.sleep(1)
#patter = re.compile('.*?board-index')
bsSoup = BeautifulSoup(html, 'html.parser')
#a = [x.find("i").text for x in bsSoup.find_all("dd")]
movies = bsSoup.find_all("dd")
a = []
for i in movies:
ti.sleep(0.1)
rating = i.find('i').text
title = i.find("a").get("title")
actors = re.findall("主演:(.*)",i.find("p",class_ = "star").text)[0]
time = re.findall("上映时间:(.*)",i.find("p",class_ = "releasetime").text)[0]
url1 = "https://maoyan.com" + i.find("p",class_ = "name").a.get("href")
score = i.find("i",class_ = "integer").text + i.find("i",class_ = "fraction").text
movie = get_html(url1)
bsMovie = BeautifulSoup(movie, 'html.parser')
#print(bsMovie)
director = bsMovie.find("a",class_= "name").text.replace("\n","").replace(" ","")
income = bsMovie.find_all("div", class_= "mbox-name")
income = income[-2].text if income else "暂无"
location_and_duration = bsMovie.find("div", class_="movie-brief-container").find_all("li", class_="ellipsis")[1].text.split('/')
duration = location_and_duration[1].strip()
location = location_and_duration[0].strip()
ti.sleep(0.5)
m_type_list = [t.text.strip() for t in bsMovie.find("div", class_="movie-brief-container").find("li", class_="ellipsis").find_all("a",class_="text-link")]
m_type = ','.join(m_type_list)
ti.sleep(0.2)
#print(m_type)
c = {'Rating' : rating,
'Title' : title,
'Name of director' : director,
'Name of actors' : actors,
'Cumulative income' : income,
'Duration': duration,
'Type' : m_type,
'Country or a Region' : location,
'Release time' : time,
'Web link' : url1,
'Score' : score
}
a.append(c)
return a
4. 猫眼爬虫第四步——存储文件
现在我们有了所需的信息,并且可以爬出来,我们需要将此信息存储在一个文件中:
[En]
Now that we have the information we need and can crawl out, we need to store this information in a file:
def write_to_file(content):
with open('maoyan.csv','a',encoding='utf-8-sig')as csvfile:
writer = csv.writer(csvfile)
values = list(content.values())
writer.writerow(values)
因为我们是想爬前100的信息,可是一页只有十个电影的信息,这个时候就需要循环十次,才可以爬完 100个电影信息:
def next_page(offset):
url = 'http://maoyan.com/board/4?offset='+str(offset)
html = get_html(url)
for item in parsing_html(html):
print(item)
write_to_file(item)
注意我这里用了print,把写入文件的信息也打印出来了,这样方便我自己看结果来进行调整。
调用以上函数,并循环十次:
for i in range(10):
next_page(offset=10*i)
ti.sleep(1)
运行结果被打印出来了:
正好一百条信息,是我们想要的结果!
我们给爬出来的信息写一个表头,方便我们查询:
df = pd.read_csv("maoyan.csv", header=None, index_col=None)
df.columns = ['Rating',
'Title',
'Name_of_director',
'Name_of_actors',
'Cumulative_income',
'Duration',
'Type',
'Country_or_a_Region',
'Release_time',
'Web_link',
'Score']
df.to_csv("maoyan.csv", index=False)
在文件夹中打开我们刚刚写入的文件,效果如下:
效果很好,爬虫部分结束!
Original: https://blog.csdn.net/qq_44665162/article/details/121103280
Author: 茱迪chen
Title: 【python】猫眼爬虫Top100电影信息
相关阅读
Title: RuntimeError: element 0 of tensors does not require grad and does not have a grad_
今天在跑代码的过程中,因为要训练一个模型然后在测试阶段使用PGD来生成相应的adv_image来测试这个模型,结果运行到测试阶段出现下面的问题。
报错如下:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_
我的代码如下:
def validate_roubst(val_loader, model, criterion, epoch, args, log=None, tf_writer=None, flag='roubst_val'):
batch_time = AverageMeter('Time', ':6.3f')
losses = AverageMeter('Loss', ':.4e')
top1 = AverageMeter('Acc@1', ':6.2f')
top5 = AverageMeter('Acc@5', ':6.2f')
model.eval()
all_preds = []
all_targets = []
with torch.no_grad():
end = time.time()
for i, (input, target) in enumerate(val_loader):
if args.gpu is not None:
print('............')
input = input.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)
attack_method = PGD(model, args.device)
adv_example = attack_method.generate(input, target, epsilon = 8/255, num_steps = 20, step_size = 0.01, clip_max = 1.0, clip_min = 0.0, print_process = False, bound = 'linf')
output = model(adv_example)
loss = criterion(output, target)
acc1, acc5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.item(), input.size(0))
top1.update(acc1[0], input.size(0))
top5.update(acc5[0], input.size(0))
batch_time.update(time.time() - end)
end = time.time()
_, pred = torch.max(output, 1)
all_preds.extend(pred.cpu().numpy())
all_targets.extend(target.cpu().numpy())
if i % args.print_freq == 0:
output = ('Test: [{0}/{1}]\t'
'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
'Prec@1 {top1.val:.3f} ({top1.avg:.3f})\t'
'Prec@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
i, len(val_loader), batch_time=batch_time, loss=losses,
top1=top1, top5=top5))
print(output)
cf = confusion_matrix(all_targets, all_preds).astype(float)
cls_cnt = cf.sum(axis=1)
cls_hit = np.diag(cf)
cls_acc = cls_hit / cls_cnt
output = ('{flag} Results: Prec@1 {top1.avg:.3f} Prec@5 {top5.avg:.3f} Loss {loss.avg:.5f}'
.format(flag=flag, top1=top1, top5=top5, loss=losses))
out_cls_acc = '%s Class Accuracy: %s'%(flag,(np.array2string(cls_acc, separator=',', formatter={'float_kind':lambda x: "%.3f" % x})))
print(output)
print(out_cls_acc)
if log is not None:
log.write(output + '\n')
log.write(out_cls_acc + '\n')
log.flush()
tf_writer.add_scalar('loss/test_'+ flag, losses.avg, epoch)
tf_writer.add_scalar('acc/test_' + flag + '_top1', top1.avg, epoch)
tf_writer.add_scalar('acc/test_' + flag + '_top5', top5.avg, epoch)
tf_writer.add_scalars('acc/test_' + flag + '_cls_acc', {str(i):x for i, x in enumerate(cls_acc)}, epoch)
return top1.avg
出了问题当然要找到解决方案:
2.1 方案1
大多数人是说要加这一句:
loss.requires_grad_(True) #加入此句就行了
具体做法就是:
loss = criterion(output, target)
loss.requires_grad_(True) # 加入此句在这个位置
...
loss.backward()
但是经过本人尝试,还是没有什么用,因为我在train阶段不会 出现错误,只有在test阶段就报错。
2.2 方案2
回到本质,或者从错误报告的角度来看,错误提示大致意味着元素不需要渐变。
[En]
To return to the essence, or from the point of view of error reporting, the error hint roughly means that the element does not require a gradient.
然后我仔细瞅了瞅我那段代码,发现了一个可疑之处: with torch.no_grad()
最后仔细查看了这个东西的一些使用规则(参考文献1):
with torch.no_grad()则主要是 用于停止autograd模块的工作,以起到加速和节省显存的作用,具体行为就是停止gradient计算,从而节省了GPU算力和显存,但是并不会影响dropout和batchnorm层的行为。
看到我上面加粗的字体了吧,原来使用with torch.no_grad()就不会自动求梯度了,因为我们使用PGD生成adv_image需要求梯度,所以加上with torch.no_grad()就导致了我们无法求梯度,最终出现了下面的错误。
故解决方案为:
将 with torch.no_grad() 去掉
Original: https://blog.csdn.net/wyf2017/article/details/123156380
Author: 流年若逝
Title: RuntimeError: element 0 of tensors does not require grad and does not have a grad_

语言模型理论与实战

一些三维重建知识点

毁麦背后,看中美决裂粮食生死战!

【windows10卸载并重新安装CUDA、cuDNN】,【TensorFlow-CUDA-cuDNN-GPU版本对应】,【cuDNN系统环境变量设置】

目标检测-小目标检测技巧

【语音识别】基于DNN-HMM的语音系统

opencv小笔记(TypeError: unsupported operand type(s) for +: ‘NoneType‘ and ‘NoneType‘)

Pandas 学习 第4篇:DataFrame -(创建、属性、操作列、类型转换)

动漫图片生成实战(GAN,WGAN)

多层神经网络 —— Sequential模型

小爱同学app安卓版_小爱同学APK提取版-小爱同学APP最新版下载5.15.10安卓版-玩友游戏网…

一文开启自然语言处理之旅

深度学习理论向应用的过渡课程【北京大学_TensorFlow2.0笔记】学习笔记(十二)——Embedding,LSTM,GRU

基于GMM-HMM的语音识别系统
