Table of contents
1 Introduction to dynamic crawling technology
2. Resolve the real address to crawl
3 Simulate browser crawling through Selenium
3.1 Basic introduction to Selenium
3.2 Selenium Practical Cases
3.3 Selenium gets all comments from the article
3.4 Advanced Operations of Selenium
3.4.1 Controlling CSS
3.4.2 Restrict image loading
3.4.3 Control the operation of JavaScript
References
1 Introduction to dynamic crawling technology
- Asynchronous update technology——AJAX
The value of AJAX (Asynchronous Javascript And XML, asynchronous JavaScript and XML) lies in the fact that web pages can be updated asynchronously by exchanging a small amount of data with the server in the background. This means that a part of the page can be updated without reloading the entire page. On the one hand, it reduces the download of duplicate content on the web page and saves traffic on the other hand, so AJAX has been widely used.
There are two ways to crawl dynamically loaded content in dynamic web pages loaded using AJAX:
- Resolve the real web address through browser review elements
- How to simulate a browser using Selenium
2. Resolve the real address to crawl
Taking the Hello World article on the blog of the author of "Python Web Crawler From Beginner to Practice (2nd Edition)" as an example, the goal is to capture all comments under the article. The article website is:
/2018/07/04/hello-world/
Step 01Turn on the "Check" function. Open Hello World articles with Chrome. Right-click anywhere on the page and click the "Check" command in the pop-up menu.
Step 02Find the real data address. Click the Network option in the page and refresh the page. At this time, the Network will display all the files obtained by the browser from the web server. Generally, this process becomes "packet capture".
How to quickly find the file where the comment data is located from the file: search the comment content can quickly locate the file where the comment is located.
Step 03Crawl the real comment data address. Since the real address has been found, you can directly use requests to request this address to obtain data.
Step 04Extract comments from json data. You can use the json library to parse the data and extract the desired data from it.
import requests
import json
link = "/v1/comments/list?callback=jQuery112408626354130815113_1646567623361&limit=10&repSeq=4272904&requestPath=/v1/comments/list&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&code=&_=1646567623363"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'}
r = (link, headers=headers)
# Get json's string
json_string =
json_string = json_string[json_string.find('{'):-2] # Extract from the first left brace, the last two characters - brackets and semicolons are not taken
json_data = (json_string)
comment_list = json_data['results']['parents']
for each in comment_list:
message = each['content']
print(message)
Next, you can use the for loop to crawl multi-page comment data. You can compare the real addresses of different pages and discover the differences in their parameters, and change the discounted parameter value to achieve page replacement.
import requests
import json
def single_page_comment(link):
headers = {
'User-Agent': '/v1/comments/list?callback=jQuery112407919796508302595_1646571637355&limit=10&offset=2&repSeq=4272904&requestPath=/v1/comments/list&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&code=&_=1646571637361'}
r = (link, headers=headers)
# Get json's string
json_string =
json_string = json_string[json_string.find('{'):-2] # Extract from the first left brace, the last two characters - brackets and semicolons are not taken
json_data = (json_string)
comment_list = json_data['results']['parents']
for each in comment_list:
message = each['content']
print(message)
max_comments = 50 # Set the maximum number of comments to crawl
for page in range(1,max_comments // 10 + 1):
link1 = "/v1/comments/list?callback=jQuery112407919796508302595_1646571637355&limit=10&offset="
link2 = "&repSeq=4272904&requestPath=/v1/comments/list&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&code=&_=1646571637361"
page_str = str(page)
link = link1 + page_str + link2
print(link)
single_page_comment(link)
3 Simulate browser crawling through Selenium
Some websites are very complex and it is difficult to find the page address of the call using the "check" function. In addition, the URLs of the real addresses of some data are also very complicated. In order to avoid these crawls, some websites will encrypt the addresses, which makes some of the variables confusing. Therefore, another method is introduced here, namely, using the browser rendering engine. Directly use the browser to parse HTML, apply CSS styles and execute JavaScript statements when displaying web pages. In layman's terms, it is to use browser rendering to turn crawl dynamic pages into crawl static pages.
Here, the crawl is done using Python's Selenium library to simulate a browser. Selenium is a tool for web application testing.
3.1 Basic introduction to Selenium
Browser driver download address:
Chrome: //driver/
Edge: Microsoft Edge Driver - Microsoft Edge Developer
Firefox: Releases · mozilla/geckodriver · GitHub
Safari: WebDriver Support in Safari 10 | WebKit
Use Selenium to open a browser and a web page, the code is as follows:
from selenium import webdriver
driver = ()
("/2018/07/04/hello-world/")
3.2 Selenium Practical Cases
Step 01Find the HTML code tag for the comment. Use Chrome to open the article page, right-click the page, and click the "Check" command in the pop-up shortcut menu.
Step 02Try to get a comment data. Use the following code on the code data of the original open page to get the first comment data.
from selenium import webdriver
driver = ()
("/2018/07/04/hello-world/")
# Convert iframe
driver.switch_to.frame(driver.find_element(by='css selector', value='iframe[title="livere-comment"]'))
# Get the css tag as -content
comment = driver.find_element(by='css selector', value='-content')
content = comment.find_element(by='tag name', value='p')
print()
3.3 Selenium gets all comments from the article
If you want to get all comments, the script program needs to be able to automatically click "+10 to see more" so that all comments can be displayed. So we need to find the element address of "+10 See More" and then let Selenium simulate clicking and loading comments. The specific code is as follows:
from selenium import webdriver
import time
driver = ()
driver.implicitly_wait(10) # Implicitly wait, wait up to 10 seconds
("/2018/07/04/hello-world/")
(5)
# Slide down to the bottom of the page (bottom left corner)
driver.execute_script("(0,);")
print("wait for 3 seconds")
(3)
for i in range(1, 19):
# Convert iframe, then find and view more, click
driver.switch_to.frame(driver.find_element(by="css selector", value="iframe[title='livere-comment']"))
# load_more = driver.find_element(by="css selector", value="button[data-page=\'%d\']'%i")
if i > 11:
x_path = './/div[@class="more-wrapper"]/button[' + str(i - 9) + ']'
else:
x_path = './/div[@class="more-wrapper"]/button[' + str(i) + ']'
load_more = driver.find_element(by="xpath", value=x_path)
load_more.send_keys("Enter") # Press Enter before clicking to solve the problem of invalid click click due to loss of focus when jumping click
load_more.click()
# Turn the iframe back
driver.switch_to.default_content()
print("Click and waiting loading --- please waiting for 5s")
(5)
driver.switch_to.frame(driver.find_element(by="css selector", value="iframe[title='livere-comment']"))
comments = driver.find_elements(by="css selector", value="-content")
for each_comment in comments:
content = each_comment.find_element(by="tag name", value="p")
print()
driver.switch_to.default_content()
driver.execute_script("(0,);")
(2)
Common methods of operating elements of Selenium are as follows:
- Clear: Clear the content of the element
- send_keys: simulate key input
- Click: Click the element
- Submit: Submit the form
from selenium import webdriver
import time
driver = ()
driver.implicitly_wait(10) # Implicitly wait, wait up to 10 seconds
("/")
(5)
user = driver.find_element(by="name", value="username")
user.send_keys("123456")
pwd = driver.find_element(by="password") # Find password input box
() # Clear the contents of the password input box
pwd.send_keys("******") # Enter your password in the box
driver.find_element(by="id", value="loginBtn").click() # Click to log in
In addition to implementing simple mouse operations, Selenium can also implement complex double-click, drag and drop operations. In addition, Selenium can also obtain the size of each element in the web page, and can even simulate the keyboard.
3.4 Advanced Operations of Selenium
Commonly used methods to speed up the crawling speed of Selenium are:
- Control the loading of CSS
- Control the display of picture files
- Control the operation of JavaScript
3.4.1 Controlling CSS
During the crawling process, only the content of the page is crawled. The CSS style file is used to control the appearance of the page and the placement of elements, and has no impact on the content. Therefore, we can limit the loading of CSS on the web page to reduce the crawling time. The code is as follows:
# Control CSS
from selenium import webdriver
# Use fp to control CSS loading
fp = ()
fp.set_preference("", 2)
driver = (options=fp)
("/2018/07/04/hello-world/")
3.4.2 Restrict image loading
If you do not need to crawl pictures on the web page, it is best to prohibit the picture from loading. Restricting the loading of pictures can help us greatly improve the efficiency of web crawlers.
# Restrict image loading
from selenium import webdriver
fp = ()
fp.set_preference("", 2)
driver = (options=fp)
("/2018/07/04/hello-world/")
3.4.3 Control the operation of JavaScript
If the content that needs to be crawled is not dynamically loaded through JavaScript, we can improve the crawling efficiency by prohibiting JavaScript execution. Because most web pages use JavaScript to load a lot of content asynchronously, these contents are not only not what we don't need, but their loading also wastes time.
# Restrict JavaScript execution
from selenium import webdriver
fp=()
fp.set_preference("",False)
driver=(options=fp)
("/2018/07/04/hello-world/")
References
[1] Tang Song. From Beginner to Practice (2nd Edition) [M]. Beijing: Machinery Industry Press