Description:
Extract data from web pages.
Syntax:
web_crawl(jsonStr)
Note:
The external library function (See External Library Guide) extracts data from web pages.
Parameter:
jsonStr |
The string for defining rules for traversing URLs, downloading pages, extracting and saving desired data; Details worthy of note that prone to parsing errors: The brackets [] under the node represented by braces {} supply a list, and the braces {} under a brace-represented node represent the structure of mapping keys; Explanation of the rule string: web_info: Information of the to-be-downloaded web, including domain name, local storage location, user agent information and user-defined applications; init_url: Specify the initial URL that is the web portal for URL traversal; help_url: Specify the web page rule to collect URLs in the content of the website without extracting the data; target_url: Specify rule of to-be-downloaded page for both collecting URLs and extracting data from the content of a web page; page_url: Specify the web data extraction rule according to which target_url downloads data. |
Return value:
Boolean value
Example:
|
A |
|
1 |
[{web_info:{save_path:'d:/tmp/data', save_post:'false'}},{init_url:['http://www.aigaogao.com/tools/history.html?s=600000']},{page_url:{extractby: "//div[@id='ctl16_contentdiv']/",class:'default'}}] |
The JSON string for defining a rule of data extraction. |
2 |
=web_crawl(A1) |
Extract data from web pages. |
3 |
=file("D:/tmp/data/600000.txt").import@cqt() |
Save the extracted web data in a local file. |