# sem-spider ## 实现步骤 1、读取配置文件中的数据库参数; 2、连接数据库,并创建 papers 集合; 3、实现 /add_paper 端点,用于添加样本数据; 4、实现 /crawl_data 端点,用于爬取数据; 5、实现 worker 函数,用于处理爬取任务。 ### APIs ```bash /graph/v1/paper/{paper_id} /graph/v1/paper/{paper_id}/citations /graph/v1/paper/{paper_id}/references # https://api.semanticscholar.org/api-docs/#tag/Paper-Data/operation/post_graph_get_papers ``` 例子: ```bash curl --location 'https://api.semanticscholar.org/graph/v1/paper/61822dc4ea365e1499fbdae7958aa317ad78f39f?fields=title%2CpaperId%2CreferenceCount&limit=100&offset=' \ --header 'x-api-key: B4YUQrO6w07Zyx9LN8V3p5Lg0WrrGDK520fWJfYD' ``` 相关论文 `related-papers`: ```bash curl --location 'https://www.semanticscholar.org/api/1/paper/61822dc4ea365e1499fbdae7958aa317ad78f39f/related-papers?limit=15&recommenderType=relatedPapers' \ --header 'Cookie: tid=rBIABmR91dK7TwAJJIRdAg==' ``` ### 数据字典 ```json { "_id": { "$oid": "64796d23c0a763eba149940a" }, "paper_id":3658586, "year": "年份", "title": "文章标题", "slug": "标语", "tldr": "概述", "numCiting": "被参考次数", "numCitedBy": "被引用次数", "bages": "", "authors": "作者", "corpusid": 96048797, "url": "...", "citations": [], "references": [], "relaterPapers": [], } ``` ## 依赖 requirements.txt ```sh requests pymongo ``` ## 运行方式 add_paper ``` python python3 spider.py add_paper --path paper2000.json ``` crawl_data ``` python python spider.py crawl_data ```