pubchem的官方API - PUG REST的使用教程(持续更新)
以URL为基础的API分为4个部分input specification--------------------------------------------------------输入operation specification--------------------------------------------------操作output specification---...
以URL为基础的API分为4个部分
- input specification--------------------------------------------------------输入
- operation specification--------------------------------------------------操作
- output specification------------------------------------------------------输出
- ?<operation_options>---------------------------------------------------操作选项(以?接操作选项作为URL参数)
其具体格式一般为:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/<input specification>/<operation specification>/[<output specification>][?<operation_options>]
Input
input又分为3个参数:
<input specification> = <domain>/<namespace>/<identifiers>
<domain> = substance | compound | assay | <other inputs>
compound domain <namespace> = cid | name | smiles | inchi | sdf | inchikey | formula | <structure search> | <xref> | listkey | <fast search>
<structure search> = {substructure | superstructure | similarity | identity}/{smiles | inchi | sdf | cid}
<fast search> = {fastidentity | fastsimilarity_2d | fastsimilarity_3d | fastsubstructure | fastsuperstructure}/{smiles | smarts | inchi | sdf | cid} | fastformula
<xref> = xref / {RegistryID | RN | PubMedID | MMDBID | ProteinGI | NucleotideGI | TaxonomyID | MIMID | GeneID | ProbeID | PatentID}
substance domain <namespace> = sid | sourceid/<source id> | sourceall/<source name> | name | <xref> | listkey
<source name> = any valid PubChem depositor name
assay domain <namespace> = aid | listkey | type/<assay type> | sourceall/<source name> | target/<assay target> | activity/<activity column name>
<assay type> = all | confirmatory | doseresponse | onhold | panel | rnai | screening | summary | cellbased | biochemical | invivo | invitro | activeconcentrationspecified
<assay target> = gi | proteinname | geneid | genesymbol | accession
<identifiers> = comma-separated list of positive integers (e.g. cid, sid, aid) or identifier strings (source, inchikey, formula); in some cases only a single identifier string (name, smiles, xref; inchi, sdf by POST only)
<other inputs> = sources / [substance, assay] |sourcetable | conformers | annotations/[sourcename/<source name> | heading/<heading>]
举例(cid为2244的化合物的input):
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/<operation specification>/[<output specification>]
除开上述以参数进行分类的方法,分类方式还有按输入方式进行分类:
By Identifier
以具体的ID进行检索:
assay->aid
compound->cid
substance->sid
而且可以以以逗号分隔的ID列表进行检索,如:
By Name
通过名字进行检索,而且是可以只检索部分名字的,但如果只检索部分名字就需要精化检索类型为单个单词 ,如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/myxalamid/cids/XML?name_type=word
By Structure Identity
通过分子结构描述符如sdf,smiles,inchi,等进行输入检索,如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CCCC/cids/TXT
By Structure Search
进行结构检索,检索形式有(substracture|superstracture,similarity,identity),但是因为其是以整个pubchem的数百万个分子数据库来进行匹配检测来进行检索,其所需时间很长,结果中会返回一个"ListKey”如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/substructure/smiles/C1CCCCCC1/XML
以此"ListKey”可以进行后续操作获取信息,如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/12345678910/cids/TXT
By Fast (Synchronous) Structure Search
正因为上述操作耗时很长,聪明的人又进行了程序开发,得出了快速的结构检索方式.这个检索方式进行同步输入,不会得出"ListKey”,而是会进行单个调用数据而立即出结果.其具体的使用方式如下:
最后一个api的结果如下:
By Cross-Reference (XRef)
输入的信息不是pubchem的参数二十其他库的参数,能接受的其他库参数如下表所示:
Cross-reference | Meaning |
---|---|
RegistryID | external registry identifier |
RN | registry number |
PubMedID | NCBI PubMed identifier |
MMDBID | NCBI MMDB identifier |
DBURL | external database home page URL |
SBURL | external database substance URL |
ProteinGI | NCBI protein GI |
NucleotideGI | NCBI nucleotide GI |
TaxonomyID | NCBI taxonomy identifier |
MIMID | NCBI MIM identifier |
GeneID | NCBI gene identifier |
ProbeID | NCBI probe identifier |
PatentID | patent identifier |
SourceName | external depositor name |
SourceCategory | depositor category(ies) |
具体的使用方式如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/xref/PatentID/US20050159403A1/sids/JSON
Operation
(其实也就是对输入进行选择具体取何种数据)
compound domain <operation specification> = record | <compound property> | synonyms | sids | cids | aids | assaysummary | classification | <xrefs> | description | conformers
<compound property> = property / [comma-separated list of property tags]
substance domain <operation specification> = record | synonyms | sids | cids | aids | assaysummary | classification | <xrefs> | description
<xrefs> = xrefs / [comma-separated list of xrefs tags]
assay domain <operation specification> = record | concise | aids | sids | cids | description | targets/<target type> | <doseresponse> | summary | classification
target_type = {ProteinGI, ProteinName, GeneID, GeneSymbol}
<doseresponse> = doseresponse/sid
可以取的数据如下:
Available Data
Full Records
在默认不进行操作选择的情况下,会输出输入条件下的所有数据,其适用的输出格式有– ASN.1 (NCBI’s native format), XML, SDF,甚至是JSON格式,具体使用例子如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/record/XML
而且是可以一次取多个identity的数据的(以逗号隔开的列表),如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/1,2,3,4,5/SDF
Images
若是想的分子的图像,则为不进行操作以PNG格式进行输出就行,但是此方式一次只能输出一个image.如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/lipitor/PNG
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CCCCC=O/PNG
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/RZJQGNCSTQAWON-UHFFFAOYSA-N/PNG
结果如下:
Compound Properties
compound的所有property如下表所示:
Property | Notes |
---|---|
MolecularFormula | |
MolecularWeight | The molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in g/mol. In the absence of explicit isotope labelling, averaged natural abundance is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location. |
CanonicalSMILES | Canonical SMILES (Simplified Molecular Input Line Entry System) string. It is a unique SMILES string of a compound, generated by a “canonicalization” algorithm. |
IsomericSMILES | Isomeric SMILES string. It is a SMILES string with stereochemical and isotopic specifications. |
InChI | Standard IUPAC International Chemical Identifier (InChI). It does not allow for user selectable options in dealing with the stereochemistry and tautomer layers of the InChI string. |
InChIKey | Hashed version of the full standard InChI, consisting of 27 characters. |
IUPACName | Chemical name systematically determined according to the IUPAC nomenclatures. |
XLogP | Computationally generated octanol-water partition coefficient or distribution coefficient. XLogP is used as a measure of hydrophilicity or hydrophobicity of a molecule. |
ExactMass | The mass of the most likely isotopic composition for a single molecule, corresponding to the most intense ion/molecule peak in a mass spectrum. |
MonoisotopicMass | The mass of a molecule, calculated using the mass of the most abundant isotope of each element. |
TPSA | Topological polar surface area, computed by the algorithm described in the paper by Ertl et al. |
Complexity | The molecular complexity rating of a compound, computed using the Bertz/Hendrickson/Ihlenfeldt formula. |
Charge | The total (or net) charge of a molecule. |
HBondDonorCount | Number of hydrogen-bond donors in the structure. |
HBondAcceptorCount | Number of hydrogen-bond acceptors in the structure. |
RotatableBondCount | Number of rotatable bonds. |
HeavyAtomCount | Number of non-hydrogen atoms. |
IsotopeAtomCount | Number of atoms with enriched isotope(s) |
AtomStereoCount | Total number of atoms with tetrahedral (sp3) stereo [e.g., (R)- or (S)-configuration] |
DefinedAtomStereoCount | Number of atoms with defined tetrahedral (sp3) stereo. |
UndefinedAtomStereoCount | Number of atoms with undefined tetrahedral (sp3) stereo. |
BondStereoCount | Total number of bonds with planar (sp2) stereo [e.g., (E)- or (Z)-configuration]. |
DefinedBondStereoCount | Number of atoms with defined planar (sp2) stereo. |
UndefinedBondStereoCount | Number of atoms with undefined planar (sp2) stereo. |
CovalentUnitCount | Number of covalently bound units. |
Volume3D | Analytic volume of the first diverse conformer (default conformer) for a compound. |
XStericQuadrupole3D | The x component of the quadrupole moment (Qx) of the first diverse conformer (default conformer) for a compound. |
YStericQuadrupole3D | The y component of the quadrupole moment (Qy) of the first diverse conformer (default conformer) for a compound. |
ZStericQuadrupole3D | The z component of the quadrupole moment (Qz) of the first diverse conformer (default conformer) for a compound. |
FeatureCount3D | Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D) |
FeatureAcceptorCount3D | Number of hydrogen-bond acceptors of a conformer. |
FeatureDonorCount3D | Number of hydrogen-bond donors of a conformer. |
FeatureAnionCount3D | Number of anionic centers (at pH 7) of a conformer. |
FeatureCationCount3D | Number of cationic centers (at pH 7) of a conformer. |
FeatureRingCount3D | Number of rings of a conformer. |
FeatureHydrophobeCount3D | Number of hydrophobes of a conformer. |
ConformerModelRMSD3D | Conformer sampling RMSD in Å. |
EffectiveRotorCount3D | Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D) |
ConformerCount3D | The number of conformers in the conformer model for a compound. |
Fingerprint2D | Base64-encoded PubChem Substructure Fingerprint of a molecule. |
使用实例如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceid/DTP.NCI/747285/SDF
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceid/DTP.NCI/747285/PNG
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/PNG
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF?record_type=3d
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/PNG?record_type=3d&image_size=small
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/SDF
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/BPGDAMSIGCZZLK-UHFFFAOYSA-N/SDF
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/XML
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/CSV?sid=26736081,26736082,26736083
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/concise/CSV
同样的是可以一次性取多个property数据的,如:
其结果为:
Synonyms
取得一个化合物|物质的所用名字:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/vioxx/synonyms/XML
Cross-References (XRefs)
检索其他数据库的参数,可用参数已于上文中给出了:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/xrefs/MMDBID/XML
BioAssays
在PUBCHEM中一个assay分为两个部分:Assay Description 和 Assay Data
前者包括authorship, general description, protocol, and definitions of the data readout columns 作者,概括说明.协议和数据列的定义.
后者则包含实验中的各项数据
Assay Description
如果想单独获取实验的描述性内容,可进行如下操作:
其有效的输出格式有 XML, JSON(P), and ASNT/B,如
https://pubchem.ncbi.nl:m.nih.gov/rest/pug/assay/aid/504526/description/XML
还有另外一种格式,不会有上述形式这般详细的description 但是会包含相关靶点和有活性的和非活性的SID和CID统计信息,如:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/summary/JSON
结果如下:
Assay Data
Assay Data有关assay可取的数据如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/CSV
其结果为
一个assay是可以涉及多个sid即物质的,如果我们只想得到特定物质的结果可以进行指定,如:
首先查询一个实验设计多少个sid:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/sids/TXT
再进行指定:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/XML?sid=104169547,109967232
Assay Targets
获取assay所作用的靶点的信息:
有效的输出格式有 XML, JSON(P), ASNT/B, and TXT. 而有效的target类型如下表所示
Target Type | Notes |
---|---|
ProteinGI | NCBI GI of a protein sequence |
ProteinName | protein name |
GeneID | NCBI Gene database identifier |
GeneSymbol | gene symbol |
Example:
其输出结果为:
:
并非所有实验都有确定的蛋白质或者基因靶点
同样的我们也可以反过来以具体的靶点名进行检索,如以USP2基因靶点进行检索进行了多少次实验:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/target/genesymbol/USP2/aids/TXT
其结果为:
Activity Name
以实验获取的某项活性数据名进行检索,如查询做了EC50的所有实验如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/activity/EC50/aids/JSON
查一个化合物|物质的实验汇总记录:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1000,1001/assaysummary/CSV
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/104234342/assaysummary/XML
结果如下:
获取物质|化合物的计量反应实验的结果:
一个aid(实验)最多返回1000个sid的计量反应数据,有效的输出格式为XML, JSON(P), ASNT/B, and CSV:
Option | Allowed Values | Meaning |
---|---|---|
sid | listkey, or comma-separated integers | SID rows to retrieve for an assay |
listkey | valid SID listkey | listkey containing SIDs, if using sid=listkey |
Examples:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/doseresponse/XML
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/doseresponse/CSV?sid=104169547,109967232
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/doseresponse/XML (with “aid=504526&sid=104169547,109967232” in the POST body)
followed by
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/602332/doseresponse/CSV?sid=listkey&listkey=xxxxxx&listkey_count=100 (where ‘xxxxxx’ is the listkey returned by the previous URL)
Output
选择输出格式:
<output specification> = XML | ASNT | ASNB | JSON | JSONP [ ?callback=<callback name> ] | SDF | CSV | PNG | TXT
operation_options
其实"ListKey"的使用并不仅仅限于结构检索(structure search),如果返回的输出对象是大型的数据列表的话,是可以不输出用operation_options将其保存在服务器上的,使用方式如下:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/sids/XML?list_return=listkey返回的"ListKey" 结果为1757602293094779987 (aid640是一个非常大的实验设计了诸多物质(sid))
再对返回的"ListKey"进行读取,同时以'listkey_start=""listkey_count='来限制结果的输出量,让时间不要太长.使用实例如下:
以上述的 "ListKey"进行操作
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/CSV?sid=listkey&listkey=1757602293094779987&listkey_start=0&listkey_count=1000
其结果为
共1000行数据
只要限制好了'listkey_start=""listkey_count='甚至可以以"ListKey"作物输入条件,以上述"ListKey"为例:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/listkey/1757602293094779987/synonyms/XML?&listkey_start=0&listkey_count=10
其结果为:
综上所述可以进行多种选择配合以此来获取pubchem中具体的数据,举例:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/property/MolecularFormula/JSON
其结果为:
以上都是以URL的API形式获取信息的例子
但并不是每一个api都能得到检索结果,有些情况下会发生错位.系统会返回错位信息,我们就需要能识别这些信息发现错误再哪里:
invalid input:输入无效
nothing was found for the given query:再所给定的输入条件下不存在匹配项
the request was too broad and took too long to complete: 数据太大而导致耗时太长,无法完成检索.(pug rest的网站服务请求的最大时间设置是30s)
会返回具体的代码,代码即其所代表的含义如下表所示:
HTTP Status | Error Code | General Error Category |
---|---|---|
200 | (none) | Success |
202 | (none) | Accepted (asynchronous operation pending) |
400 | PUGREST.BadRequest | Request is improperly formed (syntax error in the URL, POST body, etc.) |
404 | PUGREST.NotFound | The input record was not found (e.g. invalid CID) |
405 | PUGREST.NotAllowed | Request not allowed (such as invalid MIME type in the HTTP Accept header) |
504 | PUGREST.Timeout | The request timed out, from server overload or too broad a request |
503 | PUGREST.ServerBusy | Too many requests or server is busy, retry later |
501 | PUGREST.Unimplemented | The requested operation has not (yet) been implemented by the server |
500 | PUGREST.ServerError | Some problem on the server side (such as a database server down, etc.) |
500 | PUGREST.Unknown | An unknown error occurred |
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
所有评论(0)