Dynamic web crawling – study notes

Updated to 1 hour ago

Table of contents

1 Introduction to dynamic crawling technology

2. Resolve the real address to crawl

3 Simulate browser crawling through Selenium

3.1 Basic introduction to Selenium

3.2 Selenium Practical Cases

3.3 Selenium gets all comments from the article

3.4 Advanced Operations of Selenium

3.4.1 Controlling CSS

3.4.2 Restrict image loading

3.4.3 Control the operation of JavaScript

References

1 Introduction to dynamic crawling technology

Asynchronous update technology——AJAX

The value of AJAX (Asynchronous Javascript And XML, asynchronous JavaScript and XML) lies in the fact that web pages can be updated asynchronously by exchanging a small amount of data with the server in the background. This means that a part of the page can be updated without reloading the entire page. On the one hand, it reduces the download of duplicate content on the web page and saves traffic on the other hand, so AJAX has been widely used.

There are two ways to crawl dynamically loaded content in dynamic web pages loaded using AJAX:

Resolve the real web address through browser review elements
How to simulate a browser using Selenium

2. Resolve the real address to crawl

Taking the Hello World article on the blog of the author of "Python Web Crawler From Beginner to Practice (2nd Edition)" as an example, the goal is to capture all comments under the article. The article website is:

/2018/07/04/hello-world/

Step 01Turn on the "Check" function. Open Hello World articles with Chrome. Right-click anywhere on the page and click the "Check" command in the pop-up menu.

Step 02Find the real data address. Click the Network option in the page and refresh the page. At this time, the Network will display all the files obtained by the browser from the web server. Generally, this process becomes "packet capture".

How to quickly find the file where the comment data is located from the file: search the comment content can quickly locate the file where the comment is located.

Step 03Crawl the real comment data address. Since the real address has been found, you can directly use requests to request this address to obtain data.

Step 04Extract comments from json data. You can use the json library to parse the data and extract the desired data from it.

import requests
 import json

 link = "/v1/comments/list?callback=jQuery112408626354130815113_1646567623361&limit=10&repSeq=4272904&requestPath=/v1/comments/list&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&code=&_=1646567623363"
 headers = {
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'}
 r = (link, headers=headers)
 # Get json's string
 json_string =
 json_string = json_string[json_string.find('{'):-2] # Extract from the first left brace, the last two characters - brackets and semicolons are not taken
 json_data = (json_string)
 comment_list = json_data['results']['parents']
 for each in comment_list:
     message = each['content']
     print(message)

Next, you can use the for loop to crawl multi-page comment data. You can compare the real addresses of different pages and discover the differences in their parameters, and change the discounted parameter value to achieve page replacement.

import requests
 import json

 def single_page_comment(link):
     headers = {
         'User-Agent': '/v1/comments/list?callback=jQuery112407919796508302595_1646571637355&limit=10&offset=2&repSeq=4272904&requestPath=/v1/comments/list&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&code=&_=1646571637361'}
     r = (link, headers=headers)
     # Get json's string
     json_string =
     json_string = json_string[json_string.find('{'):-2] # Extract from the first left brace, the last two characters - brackets and semicolons are not taken
     json_data = (json_string)
     comment_list = json_data['results']['parents']
     for each in comment_list:
         message = each['content']
         print(message)


 max_comments = 50 # Set the maximum number of comments to crawl
 for page in range(1,max_comments // 10 + 1):
     link1 = "/v1/comments/list?callback=jQuery112407919796508302595_1646571637355&limit=10&offset="
     link2 = "&repSeq=4272904&requestPath=/v1/comments/list&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&code=&_=1646571637361"
     page_str = str(page)
     link = link1 + page_str + link2
     print(link)
     single_page_comment(link)

3 Simulate browser crawling through Selenium

Some websites are very complex and it is difficult to find the page address of the call using the "check" function. In addition, the URLs of the real addresses of some data are also very complicated. In order to avoid these crawls, some websites will encrypt the addresses, which makes some of the variables confusing. Therefore, another method is introduced here, namely, using the browser rendering engine. Directly use the browser to parse HTML, apply CSS styles and execute JavaScript statements when displaying web pages. In layman's terms, it is to use browser rendering to turn crawl dynamic pages into crawl static pages.

Here, the crawl is done using Python's Selenium library to simulate a browser. Selenium is a tool for web application testing.

3.1 Basic introduction to Selenium

Browser driver download address:

Chrome: //driver/

Edge: Microsoft Edge Driver - Microsoft Edge Developer

Firefox: Releases · mozilla/geckodriver · GitHub

Safari: WebDriver Support in Safari 10 | WebKit

Use Selenium to open a browser and a web page, the code is as follows:

from selenium import webdriver

driver = ()
("/2018/07/04/hello-world/")

3.2 Selenium Practical Cases

Step 01Find the HTML code tag for the comment. Use Chrome to open the article page, right-click the page, and click the "Check" command in the pop-up shortcut menu.

Step 02Try to get a comment data. Use the following code on the code data of the original open page to get the first comment data.

from selenium import webdriver

 driver = ()
 ("/2018/07/04/hello-world/")
 # Convert iframe
 driver.switch_to.frame(driver.find_element(by='css selector', value='iframe[title="livere-comment"]'))
 # Get the css tag as -content
 comment = driver.find_element(by='css selector', value='-content')
 content = comment.find_element(by='tag name', value='p')

 print()

3.3 Selenium gets all comments from the article

If you want to get all comments, the script program needs to be able to automatically click "+10 to see more" so that all comments can be displayed. So we need to find the element address of "+10 See More" and then let Selenium simulate clicking and loading comments. The specific code is as follows:

from selenium import webdriver
 import time

 driver = ()
 driver.implicitly_wait(10) # Implicitly wait, wait up to 10 seconds
 ("/2018/07/04/hello-world/")
 (5)
 # Slide down to the bottom of the page (bottom left corner)
 driver.execute_script("(0,);")
 print("wait for 3 seconds")
 (3)
 for i in range(1, 19):
     # Convert iframe, then find and view more, click
     driver.switch_to.frame(driver.find_element(by="css selector", value="iframe[title='livere-comment']"))
     # load_more = driver.find_element(by="css selector", value="button[data-page=\'%d\']'%i")
     if i > 11:
         x_path = './/div[@class="more-wrapper"]/button[' + str(i - 9) + ']'
     else:
         x_path = './/div[@class="more-wrapper"]/button[' + str(i) + ']'
     load_more = driver.find_element(by="xpath", value=x_path)
     load_more.send_keys("Enter") # Press Enter before clicking to solve the problem of invalid click click due to loss of focus when jumping click
     load_more.click()
     # Turn the iframe back
     driver.switch_to.default_content()
     print("Click and waiting loading --- please waiting for 5s")
     (5)
     driver.switch_to.frame(driver.find_element(by="css selector", value="iframe[title='livere-comment']"))
     comments = driver.find_elements(by="css selector", value="-content")
     for each_comment in comments:
         content = each_comment.find_element(by="tag name", value="p")
         print()
     driver.switch_to.default_content()
     driver.execute_script("(0,);")
     (2)

Common methods of operating elements of Selenium are as follows:

Clear: Clear the content of the element
send_keys: simulate key input
Click: Click the element
Submit: Submit the form

from selenium import webdriver
 import time

 driver = ()
 driver.implicitly_wait(10) # Implicitly wait, wait up to 10 seconds
 ("/")
 (5)
 user = driver.find_element(by="name", value="username")
 user.send_keys("123456")
 pwd = driver.find_element(by="password") # Find password input box
 () # Clear the contents of the password input box
 pwd.send_keys("******") # Enter your password in the box
 driver.find_element(by="id", value="loginBtn").click() # Click to log in

In addition to implementing simple mouse operations, Selenium can also implement complex double-click, drag and drop operations. In addition, Selenium can also obtain the size of each element in the web page, and can even simulate the keyboard.

3.4 Advanced Operations of Selenium

Commonly used methods to speed up the crawling speed of Selenium are:

Control the loading of CSS
Control the display of picture files
Control the operation of JavaScript

3.4.1 Controlling CSS

During the crawling process, only the content of the page is crawled. The CSS style file is used to control the appearance of the page and the placement of elements, and has no impact on the content. Therefore, we can limit the loading of CSS on the web page to reduce the crawling time. The code is as follows:

# Control CSS
 from selenium import webdriver

 # Use fp to control CSS loading
 fp = ()
 fp.set_preference("", 2)
 driver = (options=fp)
 ("/2018/07/04/hello-world/")

3.4.2 Restrict image loading

If you do not need to crawl pictures on the web page, it is best to prohibit the picture from loading. Restricting the loading of pictures can help us greatly improve the efficiency of web crawlers.

# Restrict image loading
 from selenium import webdriver

 fp = ()
 fp.set_preference("", 2)
 driver = (options=fp)
 ("/2018/07/04/hello-world/")

3.4.3 Control the operation of JavaScript

If the content that needs to be crawled is not dynamically loaded through JavaScript, we can improve the crawling efficiency by prohibiting JavaScript execution. Because most web pages use JavaScript to load a lot of content asynchronously, these contents are not only not what we don't need, but their loading also wastes time.

# Restrict JavaScript execution
 from selenium import webdriver
 fp=()
 fp.set_preference("",False)
 driver=(options=fp)
 ("/2018/07/04/hello-world/")

References

[1] Tang Song. From Beginner to Practice (2nd Edition) [M]. Beijing: Machinery Industry Press