如下代码:
- import re
- import requests
-
-
- class HandleLaGou(object):
- def __init__(self):
- self.laGou_session = requests.session()
- self.header = {
- 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
- }
- self.city_list = ""
-
- #获取全国城市列表
- def handle_city(self):
- city_search = re.compile(r'zhaopin/">(.*?)</a>')
- city_url = "https://www.lagou.com/jobs/allCity.html"
- city_result = self.handle_request(method = "GET", url = city_url)
- self.city_list = city_search.findall(city_result)
-
-
- def handle_request(self, method, url, data = None, info = None):
- if method == "GET":
- response = self.laGou_session.get(url = url, headers = self.header)
- return response.text
-
-
- if __name__ == '__main__':
- laGou = HandleLaGou()
- laGou.handle_city()
- print(laGou.city_list)
- pass
运行截图如下:
从中可以学到如下的知识点:
把网页数据后,可以使用notepad++先模拟下正则表达式匹配:
这里是这样的正则表达式:
zhaopin/">(.*?)</a>
(.*?)匹配这里面任意数据,但只匹配一次。
这里的re.conpile(r'')这个r就是指Regular expression。