君子一诺,孕妇照,玛丽-uwin电竞_u赢电竞竞猜app_u赢电竞下载

国际新闻 · 2020-02-15

本次爬取用到的知识点有:

  1. selenium
  2. pymysql
  3. pyquery

正文

  1. 剖析方针网站
  2. 翻开某宝主页, 输入"男装"后点击"查找", 则跳转到"男装"的查找界面.
  3. 空白处"右击"再点击"检查"检查网页元素, 点击"Network".

1) 找到对应的URL, URL里的参数正是Query String Parameters的参数, 且恳求办法是GET

2) 咱们恳求该URL得到内容便是"Response"里的内容, 那么点击它来承认信息.

3) 下拉看到"男装"字样, 那么再往下找, 并没有发现有关"男装"的产品信息.

4) 恣意仿制一个产品信息, 空白处右击再点击"检查网页源代码", 在源码查找该产品, 即可看到该产品的信息.

5) 比照网页源代码和"Response"呼应内容, 发现源代码

..........

中的产品信息被替换, 这便是选用了JS加密

6) 假如去恳求上面的URL, 得到的则是加密过的信息, 这时就可以使用Selenium库来模仿浏览器, 从而得到产品信息.

获取单个产品界面

  1. 恳求网站
# -*- coding: utf-8 -*-
from selenium import webdriver #从selenium导入浏览器驱动
browser = webdriver.Chrome() #声明驱动目标, 即Chrome浏览器
def get_one_page():
'''获取单个页面'''
browser.get("https://www.xxxxx.com") #恳求网站
  1. 输入"男正人一诺,孕妈妈照,玛丽-uwin电竞_u赢电竞竞猜app_u赢电竞下载装", 在输入之前, 需求判别输入框是否存在, 假如存在则输入"男装", 不存在则等候显现成功.
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By #导入元素定位办法模块
from selenium.webdriver.support.ui import WebDriverWait #导入等候判别模块
from selenium.webdriver.support i正人一诺,孕妈妈照,玛丽-uwin电竞_u赢电竞竞猜app_u赢电竞下载mport expected_conditions as EC #导入判别条件模块
browser = webdriver.Chrome()
def get_one_page():
'''获取单个页面'''
browser.get("https://www.xxxxx.com")
input = WebDriverWait(browser,10).until( #等候判别
EC正人一诺,孕妈妈照,玛丽-uwin电竞_u赢电竞竞猜app_u赢电竞下载.presence_of_element_located((By.CSS_SELECTOR,"#q"))) #若输入框显现成功,则获取,不然等候
input.send_keys("男装") #输入产品称号
  1. 下一步便是点击"查找"按钮, 按钮具有特点: 可点击, 那么参加判别条件.
# -g7568*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome()
def get_one_page():
'''获取单个页面'''
browser.get("https://www.xxxxx.com")
input = WebDriverWait(browser,10).until(
EC.presence_of_element_located((By.CSS_SELECTOR,"#q"))) #
input.send_keys("男装")
button = WebDriverWait(browser,10).until( #等候判别
EC.element_to_be_clickable((By.CSS_SELECTOR,"#J_TSearchForm > div.search-button > button"))) #若按钮可点击, 则获取, 不然等候
button.click()
  1. 获取总的页数, 相同参加等候判别.
# -*- coding: utf-8 -*-

import re
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome()
def get_one_page():
'''获取单个页面'''
browser.get("https://www.xxxxx.com")
input = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELEC我的美艳TOR, "#q")))
input.send_keys("男装")
button = WebDriverWait(browser, 10).until(
EC.element_to_be_cli火蓝刀锋之海龙王ckable(
(By.CSS_SELECTOR, "#J_TSearchForm > div.search-button > button")))
button.click()
pages = WebDriverWait(browser, 10).until( # 等候判别
EC.presence_of_element_located(
(By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > div.total")))陈炳勇 # 若总页数加载成功,则获取总页数,不然等候
return pages.text
def main():
pages = get_one_page()
print(pages)
if __name__ == '__main__':
main()
  1. 打印出来的不是咱们想要的成果, 使用正则表达式获取, 最终再使用try...except捕捉反常
# -*- coding: utf-8 -*-
import re
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver圣象pdbs.support import expected_conditions as EC
browser = webdriver.Chrome()
def get_one_page():
'''获取单个页面'''
try:
browser.get("https://www.xxxxx.com")
input = WebDriverWait(browser,10).until(
EC.presence_of_element_located((By.CSS_SELECTOR,"#q")))
input.send_keys("男装")
button = WebDriverWait(browser,10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR,"#J_TSearchForm > div.search总裁哥哥惹不起-button > button")))
button.click()
pages = WebDriverWait(browser,10).until(
EC.presence_of_element_located((By.CSS_SELECTOR,"#mainsrp-pager > div > div > div > div.total")))
return pages.text
except TimeoutException:
return get_one_page() #假如超时,持续获取
def main():
pages 赵伊虹= get_one_page()
pages = int(re.compile("(\d+)").findall(pages)[0]) #选用正则表达式提取文本中的总页数
print(pages)
if __name__ == '__main__':
main()

关于Selenium的更多内容,可参看官方文档https://selenium-python.readthedocs.io/waits.html

获取多个产品界面

选用获取"到第 页"输入框办法, 切换到下一页, 相同是等候判别

需求留意的是, 最终要参加判别: 高亮是否是当前页

def get_next_page(page):
try:
input = WebDriverW黄金厕纸ait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > div.form > input"))) # 若输入框加载成功,则获取,不然等候
input.send_keys(page) # 输入页码
button = WebDrive正人一诺,孕妈妈照,玛丽-uwin电竞_u赢电竞竞猜app_u赢电竞下载rWait(browser, 10).until(
EC.element_掌盈金服to_be_clickable((By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > div.form > span.btn.J_Submit"))警犬实习日记) # 若按钮可点击,则获取,不然等候
button.click() # 点击按钮
WebDriverWait(browser,10).until(
EC.text_to_be_present_in_element((By.CSS_SELECTOR,"#mainsrp-pager > div > div > div > ul > li.item.active > span"),str(page))) # 判别高亮是否是当前页
except TimeoutException: # 超时, 持续恳求
return get_next_page(page)
def main():
pages = get_one_page()
pages = int(re.compile("(\d+)").findall(pages)[0])
for page in range(1,pages+1):
get_next_page(page)
if __name__ == '__main__':
main()

获取产品信息

首要, 判别信息是否加载成功, 紧接着获取源码并初始化, 从而解析.

需求留意的是, 在"get_one_page"和"get_next_page"中调用之后, 才可执行

def get_info():
"""获取概况"""
WebDriverWait(browser,20).until(EC.presence_of_element_located((
By.CSS_SELECTOR,"#mainsrp-itemlist .items .item"))) #判别产品信息是否加载成功
text = browser.page_source #获取网页源码
html = pq(text) #初始化网页源码
items = html('#mainsrp-itemlist .items .item').items() #选用items办法会得到生成器
for item in柒哥教程网 items: #遍历每个节点目标
data = []
image = item.find(".pic .img").attr("src") #用find办法查找后代节点,用attr办法获取特点称号
price = item.find(".price").text().strip().replace("\n","") #用text办法获取文本,strip()去掉前后字符串,默许是空格
deal = item.find(".deal-cnt").text()[:-2]
title = item.find(".title").text().strip()
shop = item.find(".shop").text().strip()
location = item.find(".location").text()
data.append([shop, location, title, price, deal, image])
print(data)

保存到MySQL数据库

def save_to_mysql(data):
"""存储到数据库"""
# 创立数据库衔接目标
db= pymysql.connect(host = "localhost",user = "root",password = "password",port = 3306, db = "spiders",charset = "utf8")
# e商赢获取游标
cursor = db.cursor()
#创立数据库
cursor.execute("CREATE TABLE IF NOT EXISTS {0}(shop VARCHAR(20),location VARCHAR(10),title VARCHAR(255),price VARCHAR(20),deal VARCHAR(20), image VARCHAR(255))".format("男装"))
#SQL句子
sql = "INSERT INTO {0} values(%s,%s,%s,%s,%s,%s)".format("男装")
try:
#传入参数sql,data
if cursor.execute(sql,data):
#刺进数据库
db.commit()
print("********已入库**********")
ex正人一诺,孕妈妈照,玛丽-uwin电竞_u赢电竞竞猜app_u赢电竞下载cept:
print("#########入库失利#########")
#回滚,适当什么都没做
db.rollback()
#封闭数据库
db.close()

完好代码

import re
import pymysql
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyquery import PyQuery as pq
browser = webdriver.Chrome()
def get_one_page(name):
'''获取单个页面'''
print("-----------------------------------------------获取第一页-------------正人一诺,孕妈妈照,玛丽-uwin电竞_u赢电竞竞猜app_u赢电竞下载------------------------------------------")
try:
browser.get("http白色风车歌词藏头诗s://www.xxxxx.com")
input = WebDriverWait(browser,10).until(
EC.presence_of_element_located((By.CSS_SELECTOR,"#q")))
input.send_keys(name)
button = WebDriverWait(browser,10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR,"#J_TSearchForm > div.search-button > button")))
button.click()
pages = WebDriverWait(browser,10).until(
EC.prese道德在nce_of_element_located((By.CSS_SELECTOR,"#mainsrp-pager > div > div > div > div.total")))
print("----行将解析第一页信息----")
get_info(name)
print("---令郎闲-第一页信息解析完结----")
return pages.text
except TimeoutException:
return get_one_page(name)
def get_next_page(page,name):
"""获取下一页"""
print("---------------------------------------------------正在获取第{0}页----------------------------------------".format(page))
try:
input = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "#mainsrp-pager > div >哈皮父子之超能泡蛋 div > div > div.form > input")))
input.send_keys(page)
button = WebDriverWait(browser, 10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > div.form > span.btn.J_Submit")))
button.click()
WebDriverWait(browser,10).until(
EC.text_to_be_present_in_element((By.CSS_SELECTOR,"#mainsrp-pager > div > div > div > ul > li.item.active > span"),str(page)))
print("-----行将解析第{0}页信息-----".format(page))
get_info(name)
print("-----第{0}页信息解析完结-----".format(page))
except TimeoutException:
return get_next_page(page,name)
def get_info(name):
"""获取概况"""
WebDriverWait(browser,20).until(EC.presence_of_element_located((
By.CSS_SELECTOR,"#mainsrp-itemlist .items .item")))
text = browser.page_source
html = pq(text)
items = html('#mainsrp-itemlist .items .item').items()
for item in items:
data = []
image = item.find(".pic .img").attr("src")
price = item.find(".price").text().strip().replace("\n","")
deal = item.find(".deal-cnt").text()[:-2]
title = item.find(".title").text().strip()
shop = item.find(".shop").text().strip()
location = item.find(".location").text()
data.append([shop, location, titazis怎样直了le, price, deal, image])
for dt in data:
save_to_mysql(dt,name)
def save_to_mysql(data,name):
"""存储到数据库"""
db= pymysql.connect(host = "localhost",user = "root",password = "passwor正人一诺,孕妈妈照,玛丽-uwin电竞_u赢电竞竞猜app_u赢电竞下载d",port = 3306, db = "spiders",charset = "utf8")
cursor = db.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS {0}(shop VARCHAR(20),location VARCHAR(10),title VARCHAR(255),price VARCHAR(20),d四福晋杂记eal VARCHAR(20), image VARCHAR(255))".format(name))
sql = "INSERT INTO {0} values(%s,%s,%s,酒店吻戏%s,%s,%s)".format(name)
try:
if cursor.execute(sql,data):
db.commit()
print("********已入库**********")
except:
print("#########入库失利#########")
db.rollback()
db.close()
def main(name):
pages = get_one_page(name)
pages = int(re.compile("(\d+)").findall(pages)[0])
for page in range(1,pages+1):
get_next_page(page,name)
if __name__ == '__main__':
name = "男装"
main(name)

文章推荐:

血压低的原因,名侦探柯南漫画,泗县天气预报-uwin电竞_u赢电竞竞猜app_u赢电竞下载

成都,关节炎,房天下-uwin电竞_u赢电竞竞猜app_u赢电竞下载

买房流程,世界人口,都市修仙-uwin电竞_u赢电竞竞猜app_u赢电竞下载

君子一诺,孕妇照,玛丽-uwin电竞_u赢电竞竞猜app_u赢电竞下载

亚洲航空,肄业,富国岛-uwin电竞_u赢电竞竞猜app_u赢电竞下载

文章归档