创建爬虫

接口说明

此接口用于创建新的采集爬虫, 请求地址如下所示:

"https://www.shenjian.io/rest/crawler/create"

HTTP请求方式: POST

HTTP请求url: https://www.shenjian.io/rest/crawler/create?user_key=用户key&timestamp=秒级时间戳&sign=签名&crawler_name=爬虫名称

POST请求参数格式: application/text

请求参数说明

注意: 请求参数包括通用请求参数和下表中的参数

参数 是否必填 说明
crawler_name 采集爬虫名称, 需要utf8 Urlencode

POST请求的”Body”中需要添加采集爬虫的完整代码, 无需使用”key-value”形式, 需要utf8 Urlencode.

POST请求完整示例

POST /rest/crawler/create?user_key=OTM0Y2NiNj-934ccb671d
&timestamp=1490166100&sign=057DD7968772B1519AD256D2B59E2185
&crawler_name=<test> HTTP/1.1
Host: www.shenjian.io
Content-Type: application/text

var configs = {
domains: ["www.qiushibaike.com"],
scanUrls: ["http://www.qiushibaike.com/"],
contentUrlRegexes: [
/http:[\/\w\.]+qiushibaike\.com\/article\/\d+/
],
helperUrlRegexs: [
/http:[\/\w\.]+qiushibaike\.com\/(8hr\/page\/\d+\?s=\d+)?/
],
interval:3000,
fields:[
{
name: "title",
selector: "//title",
required: true
}
]
};

var crawler = new Crawler(configs);
crawler.start();

Java创建爬虫的例子

用Java调用RESTful接口创建爬虫的示例代码, 如下所示:

/**
* 由于该示例代码使用了"commons-io"库,
* 所以, 请在pom.xml中加入以下依赖:
* <dependency>
* <groupId>org.apache.commons</groupId>
* <artifactId>commons-io</artifactId>
* <version>1.3.2</version>
* </dependency>
*/
OutputStreamWriter out = null;

try {
//userKey(用户Key)和userSecret(用户密钥)在"用户中心"可以查看
String userKey = "your userKey";
String userSecret = "your userSecret";

long timestamp = new Date().getTime() / 1000;
MessageDigest md = null;
md = MessageDigest.getInstance("MD5");
String sign = userKey + timestamp + userSecret;
md.update(sign.getBytes());
sign = new BigInteger(1, md.digest()).toString(16);

String crawlerName = "简单的文章爬虫Demo-雷锋网文章";
// Configure and open a connection to the site you will send the request
String realUrl = "https://www.shenjian.io/rest/crawler/create?"
+ "user_key=" + userKey + "&timestamp=" + timestamp
+ "&sign=" + sign + "&crawler_name="
+ URLEncoder.encode(crawlerName,"UTF-8");
URL url = new URL(realUrl);
URLConnection urlConnection = url.openConnection();
// 设置doOutput属性为true表示将使用此urlConnection写入数据
urlConnection.setDoOutput(true);
urlConnection.setDoInput(true);
// 定义待写入数据的内容类型,我们设置为application/text类型
urlConnection.setRequestProperty("content-type", "application/text");
// 得到请求的输出流对象
out = new OutputStreamWriter(urlConnection.getOutputStream());
// 把数据写入请求的Body
// 如果您在IDEA环境下, 可以直接复制多行字符串,
// eclipse需要下载插件才能支持多行字符串
String str = "var configs = {\n"
+ " domains: [\"www.qiushibaike.com\"],\n"
+ " scanUrls: [\"http://www.qiushibaike.com/\"],\n"
+ " contentUrlRegexes: [\n"
+ " /http:[\\/\\w\\.]+qiushibaike\\.com\\/article\\/\\d+/\n"
+ " ],\n"
+ " helperUrlRegexs: [\n"
+ " /http:[\\/\\w\\.]+qiushibaike\\.com\\/"
+ " (8hr\\/page\\/\\d+\\?s=\\d+)?/\n"
+ " ],\n"
+ " interval:3000,\n"
+ " fields:[\n"
+ " {\n"
+ " name: \"title\",\n"
+ " selector: \"//title\",\n"
+ " required: true\n"
+ " }\n"
+ " ]\n"
+ "};\n"
+ "\n"
+ "var crawler = new Crawler(configs);\n"
+ "crawler.start();";
out.write(str);
out.flush();
out.close();

// 从服务器读取响应
InputStream inputStream = urlConnection.getInputStream();
String encoding = urlConnection.getContentEncoding();
String body = IOUtils.toString(inputStream, encoding);
System.out.println(body);
} catch(IOException e) {
e.printStackTrace();
} finally {
try {
if (out != null) {
out.close();
}
} catch(IOException e) {
e.printStackTrace();
}
}

点此查看返回参数说明, 返回码对照表, 接口调用成功示栗, 接口调用失败示栗