Users can access the WebcrawlCli external library through the interface embedded in esProc designer to extract specific data from certain websites. To deploy the external library:
1. The directory containing files of this external library is: installation directory\ esProc\extlib\WebcrawlCli. The Raqsoft core jar for this external library is scu-webcrawl-cli-2.10.jar.
accessors-smart-1.2.jar
asm-5.0.4.jar
assertj-core-1.5.0.jar
commons-codec-1.9.jar
commons-collections-3.2.2.jar
commons-io-1.3.2.jar
commons-lang3-3.1.jar
commons-logging-1.2.jar
commons-pool2-2.4.2.jar
fastjson-1.2.28.jar
hamcrest-core-1.3.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
jedis-2.9.0.jar
json-path-2.4.0.jar
json-smart-2.3.jar
jsoup-1.10.3.jar
junit-4.11.jar
log4j-1.2.17.jar
slf4j-api-1.7.6.jar
slf4j-log4j12-1.7.6.jar
webmagic-core-0.7.3.jar
webmagic-extension-0.7.3.jar
webStock-2.10.jar
xsoup-0.3.1.jar
Note: The third-party jars are encapsulated in the compression package and users can choose appropriate ones for specific scenarios.
2. A JRE version 1.7 or above is required. Users need to install a higher version if the esProc built-in JRE version does not meet the requirements, and then configure java_home in config.txt under installation directory \esProc\bin. Just skip this step if the JRE version is adequate.
3. esProc provides the function web_crawl() to extract data from websites. Look it up in【Help】-【Function reference】to find their uses.