关键词搜索

源码搜索 ×
×

Python笔记-获取拉钩网南京关于python岗位数据

发布2020-03-12浏览4503次

详情内容

FIddler抓包如下:

程序打印如下:

源码如下:

  1. import re
  2. import requests
  3. class HandleLaGou(object):
  4. def __init__(self):
  5. self.laGou_session = requests.session()
  6. self.header = {
  7. 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
  8. }
  9. self.city_list = ""
  10. #获取全国城市列表
  11. def handle_city(self):
  12. city_search = re.compile(r'zhaopin/">(.*?)</a>')
  13. city_url = "https://www.lagou.com/jobs/allCity.html"
  14. city_result = self.handle_request(method = "GET", url = city_url)
  15. self.city_list = city_search.findall(city_result)
  16. self.laGou_session.cookies.clear()
  17. def handle_city_job(self, city):
  18. first_request_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput=" % city
  19. first_response = self.handle_request(method = "GET", url = first_request_url)
  20. total_page_search = re.compile(r'class="span\stotalNum">(\d+)</span>')
  21. try:
  22. total_page = total_page_search.search(first_response).group(1)
  23. except:
  24. return
  25. else:
  26. for i in range(1, int(total_page) + 1):
  27. data = {
  28. "pn": i,
  29. "kd": "python"
  30. }
  31. page_url = "https://www.lagou.com/jobs/positionAjax.json?city=%s&needAddtionalResult=false" % city
  32. referer_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput=" % city
  33. self.header['Referer'] = referer_url.encode()
  34. response = self.handle_request(method = "POST", url = page_url, data = data)
  35. print(response)
  36. def handle_request(self, method, url, data= None, info = None):
  37. if method == "GET":
  38. response = self.laGou_session.get(url = url, headers = self.header, proxies={"http": "http://127.0.0.1:8888", "https":"http:127.0.0.1:8888"},verify=r"D:/Fiddler/FiddlerRoot.pem")
  39. elif method == "POST":
  40. response = self.laGou_session.post(url = url, headers = self.header, data=data, proxies={"http": "http://127.0.0.1:8888", "https":"http:127.0.0.1:8888"},verify=r"D:/Fiddler/FiddlerRoot.pem")
  41. response.encoding = 'utf-8'
  42. return response.text
  43. if __name__ == '__main__':
  44. laGou = HandleLaGou()
  45. laGou.handle_city()
  46. for city in laGou.city_list:
  47. laGou.handle_city_job(city)
  48. break
  49. pass

这里有个小技巧

以前用C++去搞爬虫,简直累死,现在用python真是香,很多都帮忙处理了!

通过使用这个session,当在爬数据时,可能他会先触发一个页面,设置了cookie后,才能进入爬取。

 

 

相关技术文章

点击QQ咨询
开通会员
返回顶部
×
微信扫码支付
微信扫码支付
确定支付下载
请使用微信描二维码支付
×

提示信息

×

选择支付方式

  • 微信支付
  • 支付宝付款
确定支付下载