|
@@ -26,7 +26,7 @@ curl --location 'https://api.semanticscholar.org/graph/v1/paper/61822dc4ea365e14
|
|
|
--header 'x-api-key: B4YUQrO6w07Zyx9LN8V3p5Lg0WrrGDK520fWJfYD'
|
|
|
```
|
|
|
|
|
|
-~~相关论文 `related-papers`:~~(web接口存在人机验证)
|
|
|
+~~相关论文 `related-papers`:~~(此接口存在人机验证,使用推荐论文接口)
|
|
|
```bash
|
|
|
curl --location 'https://www.semanticscholar.org/api/1/paper/61822dc4ea365e1499fbdae7958aa317ad78f39f/related-papers?limit=15&recommenderType=relatedPapers' \
|
|
|
--header 'Cookie: tid=rBIABmR91dK7TwAJJIRdAg=='
|
|
@@ -39,6 +39,7 @@ https://api.semanticscholar.org/recommendations/v1/papers/forpaper/61822dc4ea365
|
|
|
|
|
|
### 数据字典
|
|
|
|
|
|
+以实际为准
|
|
|
```json
|
|
|
{
|
|
|
"_id": {
|
|
@@ -72,12 +73,25 @@ pymongo
|
|
|
|
|
|
## 运行方式
|
|
|
|
|
|
-add_paper
|
|
|
+先使用 `add_paper` 接口导入需要爬取到paper列表,会通过 `consumed` 字段识别已爬取到数据
|
|
|
``` python
|
|
|
python3 spider.py add_paper --path paper2000.json
|
|
|
```
|
|
|
|
|
|
-crawl_data
|
|
|
+再 `crawl_data` 进行数据爬取
|
|
|
``` python
|
|
|
python spider.py crawl_data
|
|
|
+```
|
|
|
+
|
|
|
+## 配置文件
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "db_url": "mongodb://localhost:27017/",
|
|
|
+ "db_name": "paper_spider",
|
|
|
+ "db_collection": "papers",
|
|
|
+ "s2_api_key": "your_api_key",
|
|
|
+ "num_threads": 10, // 线程数
|
|
|
+ "task_queue_len": 10 // 任务队列长度
|
|
|
+}
|
|
|
```
|