网络爬虫(英語:web crawler),也叫網路蜘蛛(spider)
python有这样几个库:
-
BeautifulSoup: Beautiful soup is a library for parsing HTML and XML documents. Requests (handles HTTP sessions and makes HTTP requests) in combination with BeautifulSoup (a parsing library) are the best package tools for small and quick web scraping. For scraping simpler, static, less-JS related complexities, then this tool is probably what you’re looking for. If you want to know more about BeautifulSoup, please refer to my previous guide on Extracting Data from HTML with BeautifulSoup.
lxml is a high-performance, straightforward, fast, and feature-rich parsing library which is another prominent alternative to BeautifulSoup.
-
Scrapy: Scrapy is a web crawling framework that provides a complete tool for scraping. In Scrapy, we create Spiders which are python classes that define how a particular site/sites will be scrapped. So, if you want to build a robust, concurrent, scalable, large scale scraper, then Scrapy is an excellent choice for you. Also, Scrapy comes with a bunch of middlewares for cookies, redirects, sessions, caching, etc. that helps you to deal with different complexities that you might come across. If you want to know more about Scrapy, please refer to my previous guide on Crawling the Web with Python and Scrapy.
- Selenium For heavy-JS rendered pages or very sophisticated websites, Selenium webdriver is the best tool to choose. Selenium is a tool that automates the web-browsers, also known as a web-driver. With this, you can open a Google Chrome/Mozilla Firefox automated window, which visits a URL and navigates on the links. However, it is not as efficient as the tools which we have discussed so far. This tool is something to use when all doors of web scraping are being closed, and you still want the data which matters to you. If you want to know more about Selenium, please refer to Web Scraping with Selenium.
其中 scrapy 简单的例子 : https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3
以上例子是不需要login的
如果需要login , 要用到。scrapy 的 formrequest。
以 https://ktu3333.asuscomm.com:9085/enLogin.htm
为例
测试login已成功
scrapy 抓取的只是静态内容, 目标网页含有js 和ajax , 需要配合 selenium 和 webdrive一起用
原因见: https://www.geeksforgeeks.org/scrape-content-from-dynamic-websites/
mac 如何安装 chrome web drive
https://www.swtestacademy.com/install-chrome-driver-on-mac/
2021-07-27 : login 不再使用 scrapy , 因为它login之后和selenium 不是一个session , 所以直接用selenium login
找element用 xpath , 注意xpath里如果用到参数的写法
目前为止能运行的代码: 还没加定时功能 , python 版本3.8.6
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import NoSuchElementException import time options = webdriver.ChromeOptions() options.add_argument('ignore-certificate-errors') driver = webdriver.Chrome(chrome_options=options) # driver = webdriver.Chrome('/usr/local/bin/chromedriver') driver.get("https://ktu3333.asuscomm.com:9085/enLogin.htm") # try: # element = WebDriverWait(driver, 20).until( # EC.visibility_of_element_located((By.ID, "login_button")) # ) # finally: # driver.quit() time.sleep(5) print("login page finish loaded") # find username/email field and send the username itself to the input field driver.find_element_by_id("loginname").send_keys("TheStringOfUsername") # find password input field and insert password as well driver.find_element_by_id("loginpass").send_keys("TheStringOfPassword") # click login button driver.find_element_by_id("login_button").click() time.sleep(5) print("status page finish loaded") driver.get("https://ktu3333.asuscomm.com:9085/enHBSim.htm") time.sleep(20) print("redirect success") try: # rows_in_table = driver.find_elements_by_class_name("TB") # print("the table exist") # for row in rows_in_table.find_elements_by_css_selector('tr'): # for cell in row.find_elements_by_tag_name('td'): # print(cell.text) # table = driver.find_element_by_id("OvefrviewInfo") # print("the table exist") # # for i in table: # # tbody = i.find_element_by_tag_name('tbody') # rows = table.find_elements(By.TAG_NAME, "tr") # get all of the rows in the table # for row in rows: # # cols = row.find_elements(By.TAG_NAME, "td") # get all of the rows in the table # # Get the columns (all the column 2) # col1 = row.find_elements(By.TAG_NAME, "td")[1] #note: index start from 0, 1 is col 2 # print(col1.text) #prints text from the element # col2 = row.find_elements(By.TAG_NAME, "td")[2] #note: index start from 0, 1 is col 2 # print(col2.text) #prints text from the element # col3 = row.find_elements(By.TAG_NAME, "td")[3] #note: index start from 0, 1 is col 2 # print(col3.text) #prints text from the element # # col1= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr[1]/td[3]') # print(col1.text) # col2= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr[1]/td[4]') # print(col2.text) tbody = driver.find_element_by_xpath('//*[@id="OverviewInfo"]') trows = driver.find_elements_by_xpath('//*[@id="OverviewInfo"]/tr') print("the tbody exist") print("total rows is : ") print(len(trows)) for i in range(1,len(trows)+1): # col1= driver.find_element_by_xpath('//*[@id="OverviewInfo"]/tr[i]/td[1]') # print("find col1") col2= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr['+str(i)+']/td[2]') # print("find col2") col3= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr['+str(i)+']/td[3]') # print("find col3") col4= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr['+str(i)+']/td[4]') # print("find col4") col5= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr['+str(i)+']/td[5]') col6= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr['+str(i)+']/td[6]') # print("find col5") print( col2.text , '\t' ,col3.text , '\t' , col4.text , '\t' , col5.text, '\t',col6.text) except NoSuchElementException: print("Element does not exist") driver.close()
改进版,把结果放到一个json 数组
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import NoSuchElementException import time import json options = webdriver.ChromeOptions() options.add_argument('ignore-certificate-errors') driver = webdriver.Chrome(chrome_options=options) # driver = webdriver.Chrome('/usr/local/bin/chromedriver') driver.get("https://ktu3333.asuscomm.com:9085/enLogin.htm") # try: # element = WebDriverWait(driver, 20).until( # EC.visibility_of_element_located((By.ID, "login_button")) # ) # finally: # driver.quit() time.sleep(5) print("login page finish loaded") # find username/email field and send the username itself to the input field driver.find_element_by_id("loginname").send_keys("StringOfUserName") # find password input field and insert password as well driver.find_element_by_id("loginpass").send_keys("StringOfPassword") # click login button driver.find_element_by_id("login_button").click() time.sleep(5) print("status page finish loaded") driver.get("https://ktu3333.asuscomm.com:9085/enHBSim.htm") time.sleep(20) print("redirect success") try: # rows_in_table = driver.find_elements_by_class_name("TB") # print("the table exist") # for row in rows_in_table.find_elements_by_css_selector('tr'): # for cell in row.find_elements_by_tag_name('td'): # print(cell.text) # table = driver.find_element_by_id("OvefrviewInfo") # print("the table exist") # # for i in table: # # tbody = i.find_element_by_tag_name('tbody') # rows = table.find_elements(By.TAG_NAME, "tr") # get all of the rows in the table # for row in rows: # # cols = row.find_elements(By.TAG_NAME, "td") # get all of the rows in the table # # Get the columns (all the column 2) # col1 = row.find_elements(By.TAG_NAME, "td")[1] #note: index start from 0, 1 is col 2 # print(col1.text) #prints text from the element # col2 = row.find_elements(By.TAG_NAME, "td")[2] #note: index start from 0, 1 is col 2 # print(col2.text) #prints text from the element # col3 = row.find_elements(By.TAG_NAME, "td")[3] #note: index start from 0, 1 is col 2 # print(col3.text) #prints text from the element # # col1= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr[1]/td[3]') # print(col1.text) # col2= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr[1]/td[4]') # print(col2.text) tbody = driver.find_element_by_xpath('//*[@id="OverviewInfo"]') trows = driver.find_elements_by_xpath('//*[@id="OverviewInfo"]/tr') print("the tbody exist") print("total rows is : ") print(len(trows)) totalList = [] for i in range(1,len(trows)+1): # col1= driver.find_element_by_xpath('//*[@id="OverviewInfo"]/tr[i]/td[1]') # print("find col1") col2= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr['+str(i)+']/td[2]') # print("find col2") col3= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr['+str(i)+']/td[3]') # print("find col3") col4= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr['+str(i)+']/td[4]') # print("find col4") col5= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr['+str(i)+']/td[5]') col6= driver.find_element_by_xpath('//*[@id="OverviewInfo"]//tr['+str(i)+']/td[6]') # print("find col5") print( col2.text , '\t' ,col3.text , '\t' , col4.text , '\t' , col5.text, '\t',col6.text) singelRecord = {'SIM': col2.text, 'Port Status': col3.text, 'Phone Number':col4.text, 'Last matched Balance':col5.text,'Calculated Balance': col6.text} # to_json= json.dumps(singelRecord) # print(to_json) totalList.append(singelRecord) to_json= json.dumps(totalList) print(to_json) except NoSuchElementException: print("Element does not exist") driver.close()