BA_PY Wiki

BA_PY: Optimize Your Workflow with Python!

Brought to you by: bhmfly

paper

Authors:

API文档

generated by chatGPT

模块概述

该模块提供了一些用于处理科学论文的功能，包括解析RIS文件、在百度学术和PubMed上搜索论文、从Sci-Hub下载论文、提取PDF书签等。

模块导入

import os, re, requests
from typing import List, Dict

from lxml import etree
import rispy
import PyPDF2

函数和类

`parse_ris(ris_path:str, fill_none_doi:str = None) -> List[Dict[str, str]]`

解析RIS文件并将内容作为字典列表返回。

参数：
- ris_path (str): RIS文件的路径。
- fill_none_doi (str, optional): 用于填充缺失条目的DOI值。默认为None。

返回值：
- List[Dict[str, str]]: 包含解析后的RIS文件内容的字典列表。

`search_by_baidu(query:str, limit:int = 1, proxies = None) -> List[Dict[str, str]]`

使用给定的查询在百度学术上搜索文章。

参数：
- query (str): 搜索查询。
- limit (int, optional): 要获取的最大文章数量。默认为1。

返回值：
- List[Dict[str, str]]: 包含找到的文章信息的字典列表。每个字典具有以下键：
- title (str): 文章的标题。
- abstract (str): 文章的摘要。
- keyword (List[str]): 与文章相关的关键词列表。
- doi (str): 文章的DOI（数字对象标识符）。

`search_by_pubmed(query:str, email:str = None, limit:int = 1) -> List[Dict[str, str]]`

使用给定的查询在PubMed上搜索文章。

参数：
- query (str): 搜索查询。
- email (str, optional): 用于设置邮箱地址。默认为None。
- limit (int, optional): 要获取的最大文章数量。默认为1。

返回值：
- List[Dict[str, str]]: 包含找到的文章信息的字典列表。每个字典具有以下键：
- title (str): 文章的标题。
- abstract (str): 文章的摘要。
- doi (str): 文章的DOI（数字对象标识符）。
- journal (str): 文章的期刊。

`search(query:str, limit:int = 1, search_engine:str = 'baidu xueshu', email:str = None) -> List[Dict[str, str]]`

使用指定的搜索引擎搜索给定的查询，并返回结果。

参数：
- query (str): 要搜索的查询字符串。
- limit (int): 要返回的最大结果数。
- search_engine (str): 要使用的搜索引擎。默认为'baidu xueshu'。
- 允许的值：'baidu xueshu'、'science direct'、'publons'，如果不被识别，返回None。

返回值：
- List[Dict[str, str]]: 搜索结果作为字典列表返回，包含'title'、'abstract'、'keyword'和'doi'。

`download_from_scihub_by_doi(doi:str, proxies = None) -> Dict[str, Any]`

使用DOI从Sci-Hub数据库下载文件。

参数：
- doi (str): 要下载的文件的DOI。
- proxies (dict): 用于请求的代理字典。

返回值：
- Dict[str, Any]: 包含下载请求的标题、DOI和响应对象的字典。
- 如果遇到错误，返回None。

`download_from_scihub_by_title(title:str, proxies = None) -> Dict[str, Any]`

使用标题从Sci-Hub下载文档。

参数：
- title (str): 要下载的文档的标题。
- proxies (dict, optional): 用于HTTP请求的代理字典。

返回值：
- Dict[str, Any]: 包含下载请求的标题、DOI和响应对象的字典。
- 如果遇到错误，返回None。

`download_by_scihub(dir: str, doi: str = None, title:str = None, file_full_name:str = None, use_title_as_name: bool = True, valid_path_chr:str = '_') -> Dict[str, Any]`

使用DOI从Sci-Hub下载论文。

参数：
- dir (str): 下载文件将保存的目录。
- doi (str): 论文的DOI（数字对象标识符）。
- title (str): 论文的标题。
- file_full_name (str, optional): 下载文件的名称，包括文件扩展名（.pdf）。默认为None。
- use_title_as_name (bool, optional): 是否使用论文的标题作为文件名。默认为True。
- valid_path_chr (str, optional): 用于替换文件名中的无效字符的字符。默认为'_'。

返回值：
- Dict[str, Any]：如果成功，返回包含有关下载论文的信息的字典。如果失败，返回None。

`has_sci_bookmarks(pdf_path:str = None, pdf_obj = None, section_names:List[str]=[]) -> Union[List[str], bool]`

检查PDF文档是否具有科学部分的书签。

参数：
- pdf_path (str): PDF文档的路径。默认为None。
- pdf_obj: PDF对象（已打开）。默认为None。
- section_names (List[str]): 要检查书签的科学部分的名称列表。默认为空列表。

返回值：
- Union[List[str], bool]：如果PDF具有书签，则返回包含科学部分名称的列表；否则返回False。

`get_sci_bookmarks_from_pdf(pdf_path:str = None, pdf_obj = None, section_names:List[str]=[]) -> List[str]`

从科学PDF中返回一个包含部分名称的列表。

参数：
- pdf_path (str): PDF文件的路径。默认为None。
- pdf_obj: PDF对象。默认为None。
- section_names (List[str]): 要搜索的部分名称列表。如果为None，则搜索所有部分，包括'Abstract'、'Introduction'、'Materials'、'Methods'、'Results'、'Discussion'、'References'。

返回值：
- List[str]：包含在PDF中找到的部分名称的列表。

`get_section_bookmarks(pdf_path:str = None, pdf_obj = None) -> Union[List[str], None]`

从PDF中返回书签部分的标题列表。

参数：
- pdf_path (str): PDF文件的路径。默认为None。
- pdf_obj: PDF对象（已打开）。默认为None。

返回值：
- Union[List[str], None]：包含PDF中书签部分标题的列表。
- 如果没有书签部分或PDF文件不存在，则返回None。

`get_english_part_of_bookmarks(bookmarks:List[str]) -> List[str]`

从给定的书签列表中提取英文部分。

参数：
- bookmarks (List[str]): 书签列表。

返回值：
- List[str]：只包含书签英文部分的列表。

`get_section_from_paper(paper:str, key:str, keys:List[str] = ['Title', 'Authors', 'Abstract', 'Keywords', 'Introduction', 'Materials & Methods', 'Results', 'Discussion', 'References']) -> Union[str, None]`

通过关键字从科学论文中提取部分。

参数：
- paper (str): 科学论文。
- key (str): 论文中的一个部分，可以是'Title'、'Authors'、'Abstract'、'Keywords'、'Introduction'、'Materials & Methods'、'Results'、'Discussion'、'References'之一。
- keys (List[str], optional): 要提取的关键字列表。默认为['Title', 'Authors', 'Abstract', 'Keywords', 'Introduction', 'Materials & Methods', 'Results', 'Discussion', 'References']。

返回值：
- Union[str, None]：从论文中提取的部分。

`format_paper_from_txt(content:str, struct:List[str] = ['Title', 'Authors', 'Abstract', 'Keywords', 'Introduction', 'Materials & Methods', 'Results', 'Discussion', 'References']) -> Dict[str, str]`

从文本中格式化论文。

参数：
- content (str): 文本内容。
- struct (List[str], optional): 要提取的部分列表。默认为['Title', 'Authors', 'Abstract', 'Keywords', 'Introduction', 'Materials & Methods', 'Results', 'Discussion', 'References']。

返回值：
- Dict[str, str]：包含格式化后的论文信息的字典。

全局变量

`session`

一个requests.Session对象，用于发送HTTP请求。

`available_scihub_urls`

一个包含可用Sci-Hub网址的列表。

内部函数

`_get_a_search_page(query:str, page:int = 0) -> List[str]`

获取百度学术搜索页面的链接。

参数：
- query (str): 搜索查询。
- page (int, optional): 页面索引。默认为0。

返回值：
- List[str]：包含搜索结果链接的列表。

`_parse_links(links:list) -> List[Dict[str, str]]`

解析搜索结果链接。

参数：
- links (list): 搜索结果链接列表。

返回值：
- List[Dict[str, str]]：包含搜索结果信息的字典列表。

`_update_available_scihub_urls() -> List[str]`

更新可用的Sci-Hub网址列表。

返回值：
- List[str]：可用的Sci-Hub网址列表。

`_download_from_scihub_webpage(webpage:requests.Response, proxies = None) -> Dict[str, Any]`

从Sci-Hub网页下载文件。

参数：
- webpage (requests.Response): Sci-Hub网页的响应对象。
- proxies (dict, optional): 用于请求的代理字典。默认为None。

返回值：
- Dict[str, Any]：包含下载请求的标题、DOI和响应对象的字典。

`_flatten_pdf_bookmarks(*bookmarks) -> List[Any]`

解析书签列表并返回扁平化的书签列表。

参数：
- *bookmarks (List[Any]): 书签列表。

返回值：
- List[Any]：扁平化的书签列表。

主程序

`main`代码

该部分包含一些用于测试的代码，可以在开发模式下运行。

异常

Exception：如果DOI不存在或从Sci-Hub获取文件时出错，会引发异常。

模块使用示例

    from mbapy.base import rand_choose
    from mbapy.file import convert_pdf_to_txt, read_json

    # RIS parse
    ris = parse_ris('./data_tmp/savedrecs.ris', '')
    ris = rand_choose(ris)
    print(f'title: {ris["title"]}\ndoi: {ris["doi"]}')

    # search
    # search_result = search_by_baidu('linaclotide', 11)
    search_result = search_by_pubmed('linaclotide', read_json('./data_tmp/id.json')['edu_email'], 11)
    search_result2 = search(ris["title"])

    # download
    dl_result = download_by_scihub('./data_tmp/', title = search_result[0]['title'])
    download_by_scihub('./data_tmp/', '10.1097/j.pain.0000000000001905', ris["title"], file_full_name = f'{ris["title"]:s}.pdf')

    # extract section
    pdf_path = replace_invalid_path_chr("./data_tmp/{:s}.pdf".format(ris["title"]))
    sections = get_english_part_of_bookmarks(get_section_bookmarks(pdf_path))
    paper, section = convert_pdf_to_txt(pdf_path), rand_choose(sections, 0)
    print(sections, section, get_section_from_paper(paper, section, keys=sections))

Wiki: Home

BA_PY Wiki

BA_PY: Optimize Your Workflow with Python!

paper

API文档

模块概述

模块导入

函数和类

parse_ris(ris_path:str, fill_none_doi:str = None) -> List[Dict[str, str]]

search_by_baidu(query:str, limit:int = 1, proxies = None) -> List[Dict[str, str]]

search_by_pubmed(query:str, email:str = None, limit:int = 1) -> List[Dict[str, str]]

search(query:str, limit:int = 1, search_engine:str = 'baidu xueshu', email:str = None) -> List[Dict[str, str]]

download_from_scihub_by_doi(doi:str, proxies = None) -> Dict[str, Any]

download_from_scihub_by_title(title:str, proxies = None) -> Dict[str, Any]

download_by_scihub(dir: str, doi: str = None, title:str = None, file_full_name:str = None, use_title_as_name: bool = True, valid_path_chr:str = '_') -> Dict[str, Any]

has_sci_bookmarks(pdf_path:str = None, pdf_obj = None, section_names:List[str]=[]) -> Union[List[str], bool]

get_sci_bookmarks_from_pdf(pdf_path:str = None, pdf_obj = None, section_names:List[str]=[]) -> List[str]

get_section_bookmarks(pdf_path:str = None, pdf_obj = None) -> Union[List[str], None]

get_english_part_of_bookmarks(bookmarks:List[str]) -> List[str]

get_section_from_paper(paper:str, key:str, keys:List[str] = ['Title', 'Authors', 'Abstract', 'Keywords', 'Introduction', 'Materials & Methods', 'Results', 'Discussion', 'References']) -> Union[str, None]

format_paper_from_txt(content:str, struct:List[str] = ['Title', 'Authors', 'Abstract', 'Keywords', 'Introduction', 'Materials & Methods', 'Results', 'Discussion', 'References']) -> Dict[str, str]