Przeglądaj źródła

fix: avoid judging each data in the loop

Ben 1 rok temu
rodzic
commit
a1279945fb
2 zmienionych plików z 23 dodań i 13 usunięć
  1. 17 3
      README.md
  2. 6 10
      spider.py

+ 17 - 3
README.md

@@ -26,7 +26,7 @@ curl --location 'https://api.semanticscholar.org/graph/v1/paper/61822dc4ea365e14
 --header 'x-api-key: B4YUQrO6w07Zyx9LN8V3p5Lg0WrrGDK520fWJfYD'
 ```
 
-~~相关论文 `related-papers`:~~(web接口存在人机验证)
+~~相关论文 `related-papers`:~~(此接口存在人机验证,使用推荐论文接口)
 ```bash
 curl --location 'https://www.semanticscholar.org/api/1/paper/61822dc4ea365e1499fbdae7958aa317ad78f39f/related-papers?limit=15&recommenderType=relatedPapers' \
 --header 'Cookie: tid=rBIABmR91dK7TwAJJIRdAg=='
@@ -39,6 +39,7 @@ https://api.semanticscholar.org/recommendations/v1/papers/forpaper/61822dc4ea365
 
 ### 数据字典
 
+以实际为准
 ```json
 {
     "_id": {
@@ -72,12 +73,25 @@ pymongo
 
 ## 运行方式
 
-add_paper
+先使用 `add_paper` 接口导入需要爬取到paper列表,会通过 `consumed` 字段识别已爬取到数据
 ``` python
 python3 spider.py add_paper --path paper2000.json
 ```
 
-crawl_data
+再 `crawl_data` 进行数据爬取
 ``` python
 python spider.py crawl_data
+```
+
+## 配置文件
+
+```json
+{
+    "db_url": "mongodb://localhost:27017/",
+    "db_name": "paper_spider",
+    "db_collection": "papers",
+    "s2_api_key": "your_api_key",
+    "num_threads": 10, // 线程数
+    "task_queue_len": 10 // 任务队列长度
+}
 ```

+ 6 - 10
spider.py

@@ -78,18 +78,14 @@ def crawl_data():
         threads.append(t)
 
     # 从数据库中读取 URL,加入任务队列
-    for data in papers.find():
-        if quit_flag is True:
+    for data in papers.find({'$or': [{'consumed': {'$exists': False}}, {'consumed': False}]}):
+        if quit_flag:
             break
-        url = data["url"]
-        corpusid = data["corpusid"]
-        if 'consumed' in data.keys() and data['consumed'] is True:
-            print(corpusid, "already inserted")
+        if 'consumed' in data and data['consumed']:
+            print(data['corpusid'], "already inserted")
             continue
-        # print(data['corpusid'])
-        # print(data['url'])
-        print('add {} to the task queue'.format(corpusid))
-        q.put((url, corpusid))
+        print('add {} to the task queue'.format(data['corpusid']))
+        q.put((data['url'], data['corpusid']))
 
     #
     print("Waitting for the task queue to complete...")