# sem-spider ## 实现步骤 1、读取配置文件中的数据库参数; 2、连接数据库,并创建 papers 集合; 3、实现 /add_paper 端点,用于添加样本数据; 4、实现 /crawl_data 端点,用于爬取数据; 5、实现 worker 函数,用于处理爬取任务。 ### APIs ```bash /graph/v1/paper/{paper_id} /graph/v1/paper/{paper_id}/citations /graph/v1/paper/{paper_id}/references # https://api.semanticscholar.org/api-docs/#tag/Paper-Data/operation/post_graph_get_papers ``` 例子: ```bash curl --location 'https://api.semanticscholar.org/graph/v1/paper/61822dc4ea365e1499fbdae7958aa317ad78f39f?fields=title%2CpaperId%2CreferenceCount&limit=100&offset=' \ --header 'x-api-key: B4YUQrO6w07Zyx9LN8V3p5Lg0WrrGDK520fWJfYD' ``` ~~相关论文 `related-papers`:~~(此接口存在人机验证,使用推荐论文接口) ```bash curl --location 'https://www.semanticscholar.org/api/1/paper/61822dc4ea365e1499fbdae7958aa317ad78f39f/related-papers?limit=15&recommenderType=relatedPapers' \ --header 'Cookie: tid=rBIABmR91dK7TwAJJIRdAg==' ``` 推荐论文 ```bash https://api.semanticscholar.org/recommendations/v1/papers/forpaper/61822dc4ea365e1499fbdae7958aa317ad78f39f?fields=url,abstract,authors ``` ### 数据字典 以实际为准 ```json { "_id": { "$oid": "64796d23c0a763eba149940a" }, "paper_id":3658586, "year": "年份", "title": "文章标题", "slug": "标语", "tldr": "概述", "numCiting": "被参考次数", "numCitedBy": "被引用次数", "bages": "", "authors": "作者", "corpusid": 96048797, "url": "...", "citations": [], "references": [], "relaterPapers": [], } ``` ## 依赖 requirements.txt ```sh requests pymongo ``` ## 运行方式 先使用 `add_paper` 接口导入需要爬取到paper列表,会通过 `consumed` 字段识别已爬取到数据 ``` python python3 spider.py add_paper --path paper2000.json ``` 再 `crawl_data` 进行数据爬取 ``` python python spider.py crawl_data ``` ## 配置文件 ```json { "db_url": "mongodb://localhost:27017/", "db_name": "paper_spider", "db_collection": "papers", "s2_api_key": "your_api_key", "num_threads": 10, // 线程数 "task_queue_len": 10 // 任务队列长度 } ```