pubchem的官方API - PUG REST的使用教程(持续更新)

以URL为基础的API分为4个部分input specification--------------------------------------------------------输入operation specification--------------------------------------------------操作output specification---...

贺俊宏

11870人浏览 · 2019-08-20 09:50:40

贺俊宏 · 2019-08-20 09:50:40 发布

以URL为基础的API分为4个部分

input specification--------------------------------------------------------输入
operation specification--------------------------------------------------操作
output specification------------------------------------------------------输出
?<operation_options>---------------------------------------------------操作选项(以?接操作选项作为URL参数)

其具体格式一般为:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/<input specification>/<operation specification>/[<output specification>][?<operation_options>]

Input

input又分为3个参数:

<input specification> = <domain>/<namespace>/<identifiers>

<domain> = substance | compound | assay | <other inputs>

<structure search> = {substructure | superstructure | similarity | identity}/{smiles | inchi | sdf | cid}

<fast search> = {fastidentity | fastsimilarity_2d | fastsimilarity_3d | fastsubstructure | fastsuperstructure}/{smiles | smarts | inchi | sdf | cid} | fastformula

<xref> = xref / {RegistryID | RN | PubMedID | MMDBID | ProteinGI | NucleotideGI | TaxonomyID | MIMID | GeneID | ProbeID | PatentID}

<source name> = any valid PubChem depositor name

<assay target> = gi | proteinname | geneid | genesymbol | accession

<identifiers> = comma-separated list of positive integers (e.g. cid, sid, aid) or identifier strings (source, inchikey, formula); in some cases only a single identifier string (name, smiles, xref; inchi, sdf by POST only)

<other inputs> = sources / [substance, assay] |sourcetable | conformers | annotations/[sourcename/<source name> | heading/<heading>]

举例(cid为2244的化合物的input):

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/<operation specification>/[<output specification>]

除开上述以参数进行分类的方法,分类方式还有按输入方式进行分类:

By Identifier

以具体的ID进行检索:

assay->aid

compound->cid

substance->sid

而且可以以以逗号分隔的ID列表进行检索,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/property/MolecularFormula,MolecularWeight,CanonicalSMILES/CSV

By Name

通过名字进行检索,而且是可以只检索部分名字的,但如果只检索部分名字就需要精化检索类型为单个单词 ,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/myxalamid/cids/XML?name_type=word

By Structure Identity

通过分子结构描述符如sdf,smiles,inchi,等进行输入检索,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CCCC/cids/TXT

By Structure Search

进行结构检索,检索形式有(substracture|superstracture,similarity,identity),但是因为其是以整个pubchem的数百万个分子数据库来进行匹配检测来进行检索,其所需时间很长,结果中会返回一个"ListKey”如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/substructure/smiles/C1CCCCCC1/XML

以此"ListKey”可以进行后续操作获取信息,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/12345678910/cids/TXT

By Fast (Synchronous) Structure Search

正因为上述操作耗时很长,聪明的人又进行了程序开发,得出了快速的结构检索方式.这个检索方式进行同步输入,不会得出"ListKey”,而是会进行单个调用数据而立即出结果.其具体的使用方式如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/cid/5793/cids/TXT?identity_type=same_connectivity

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsubstructure/cid/2244/cids/XML?StripHydrogen=true

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/cid/2244/property/MolecularWeight,MolecularFormula,RotatableBondCount/XML?Threshold=99

最后一个api的结果如下:

By Cross-Reference (XRef)

输入的信息不是pubchem的参数二十其他库的参数,能接受的其他库参数如下表所示:

Cross-reference	Meaning
RegistryID	external registry identifier
RN	registry number
PubMedID	NCBI PubMed identifier
MMDBID	NCBI MMDB identifier
DBURL	external database home page URL
SBURL	external database substance URL
ProteinGI	NCBI protein GI
NucleotideGI	NCBI nucleotide GI
TaxonomyID	NCBI taxonomy identifier
MIMID	NCBI MIM identifier
GeneID	NCBI gene identifier
ProbeID	NCBI probe identifier
PatentID	patent identifier
SourceName	external depositor name
SourceCategory	depositor category(ies)

具体的使用方式如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/xref/PatentID/US20050159403A1/sids/JSON

Operation

(其实也就是对输入进行选择具体取何种数据)

<compound property> = property / [comma-separated list of property tags]

<xrefs> = xrefs / [comma-separated list of xrefs tags]

target_type = {ProteinGI, ProteinName, GeneID, GeneSymbol}

<doseresponse> = doseresponse/sid

可以取的数据如下:

Available Data

Full Records

在默认不进行操作选择的情况下,会输出输入条件下的所有数据,其适用的输出格式有– ASN.1 (NCBI’s native format), XML, SDF,甚至是JSON格式,具体使用例子如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/record/XML

而且是可以一次取多个identity的数据的(以逗号隔开的列表),如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/1,2,3,4,5/SDF

Images

若是想的分子的图像,则为不进行操作以PNG格式进行输出就行,但是此方式一次只能输出一个image.如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/lipitor/PNG

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CCCCC=O/PNG

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/RZJQGNCSTQAWON-UHFFFAOYSA-N/PNG

结果如下:

Compound Properties

compound的所有property如下表所示:

Property	Notes
MolecularFormula	Molecular formula.
MolecularWeight	The molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in g/mol. In the absence of explicit isotope labelling, averaged natural abundance is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location.
CanonicalSMILES	Canonical SMILES (Simplified Molecular Input Line Entry System) string. It is a unique SMILES string of a compound, generated by a “canonicalization” algorithm.
IsomericSMILES	Isomeric SMILES string. It is a SMILES string with stereochemical and isotopic specifications.
InChI	Standard IUPAC International Chemical Identifier (InChI). It does not allow for user selectable options in dealing with the stereochemistry and tautomer layers of the InChI string.
InChIKey	Hashed version of the full standard InChI, consisting of 27 characters.
IUPACName	Chemical name systematically determined according to the IUPAC nomenclatures.
XLogP	Computationally generated octanol-water partition coefficient or distribution coefficient. XLogP is used as a measure of hydrophilicity or hydrophobicity of a molecule.
ExactMass	The mass of the most likely isotopic composition for a single molecule, corresponding to the most intense ion/molecule peak in a mass spectrum.
MonoisotopicMass	The mass of a molecule, calculated using the mass of the most abundant isotope of each element.
TPSA	Topological polar surface area, computed by the algorithm described in the paper by Ertl et al.
Complexity	The molecular complexity rating of a compound, computed using the Bertz/Hendrickson/Ihlenfeldt formula.
Charge	The total (or net) charge of a molecule.
HBondDonorCount	Number of hydrogen-bond donors in the structure.
HBondAcceptorCount	Number of hydrogen-bond acceptors in the structure.
RotatableBondCount	Number of rotatable bonds.
HeavyAtomCount	Number of non-hydrogen atoms.
IsotopeAtomCount	Number of atoms with enriched isotope(s)
AtomStereoCount	Total number of atoms with tetrahedral (sp3) stereo [e.g., (R)- or (S)-configuration]
DefinedAtomStereoCount	Number of atoms with defined tetrahedral (sp3) stereo.
UndefinedAtomStereoCount	Number of atoms with undefined tetrahedral (sp3) stereo.
BondStereoCount	Total number of bonds with planar (sp2) stereo [e.g., (E)- or (Z)-configuration].
DefinedBondStereoCount	Number of atoms with defined planar (sp2) stereo.
UndefinedBondStereoCount	Number of atoms with undefined planar (sp2) stereo.
CovalentUnitCount	Number of covalently bound units.
Volume3D	Analytic volume of the first diverse conformer (default conformer) for a compound.
XStericQuadrupole3D	The x component of the quadrupole moment (Qx) of the first diverse conformer (default conformer) for a compound.
YStericQuadrupole3D	The y component of the quadrupole moment (Qy) of the first diverse conformer (default conformer) for a compound.
ZStericQuadrupole3D	The z component of the quadrupole moment (Qz) of the first diverse conformer (default conformer) for a compound.
FeatureCount3D	Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)
FeatureAcceptorCount3D	Number of hydrogen-bond acceptors of a conformer.
FeatureDonorCount3D	Number of hydrogen-bond donors of a conformer.
FeatureAnionCount3D	Number of anionic centers (at pH 7) of a conformer.
FeatureCationCount3D	Number of cationic centers (at pH 7) of a conformer.
FeatureRingCount3D	Number of rings of a conformer.
FeatureHydrophobeCount3D	Number of hydrophobes of a conformer.
ConformerModelRMSD3D	Conformer sampling RMSD in Å.
EffectiveRotorCount3D	Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)
ConformerCount3D	The number of conformers in the conformer model for a compound.
Fingerprint2D	Base64-encoded PubChem Substructure Fingerprint of a molecule.

使用实例如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceid/IBM/5F1CA2B314D35F28C7F94168627B29E3/ASNT

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceid/DTP.NCI/747285/SDF

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceid/DTP.NCI/747285/PNG

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/PNG

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF?record_type=3d

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/PNG?record_type=3d&image_size=small

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/SDF

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/BPGDAMSIGCZZLK-UHFFFAOYSA-N/SDF

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/XML

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/CSV?sid=26736081,26736082,26736083

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/concise/CSV

同样的是可以一次性取多个property数据的,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/property/MolecularWeight,MolecularFormula,HBondDonorCount,HBondAcceptorCount,InChIKey,InChI/CSV

其结果为:

Synonyms

取得一个化合物|物质的所用名字:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/vioxx/synonyms/XML

Cross-References (XRefs)

检索其他数据库的参数,可用参数已于上文中给出了:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/xrefs/MMDBID/XML

BioAssays

在PUBCHEM中一个assay分为两个部分:Assay Description 和 Assay Data

前者包括authorship, general description, protocol, and definitions of the data readout columns 作者,概括说明.协议和数据列的定义.

后者则包含实验中的各项数据

Assay Description

如果想单独获取实验的描述性内容,可进行如下操作:

其有效的输出格式有 XML, JSON(P), and ASNT/B,如

https://pubchem.ncbi.nl:m.nih.gov/rest/pug/assay/aid/504526/description/XML

还有另外一种格式,不会有上述形式这般详细的description 但是会包含相关靶点和有活性的和非活性的SID和CID统计信息,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/summary/JSON

结果如下:

Assay Data

Assay Data有关assay可取的数据如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/CSV

其结果为

一个assay是可以涉及多个sid即物质的,如果我们只想得到特定物质的结果可以进行指定,如:

首先查询一个实验设计多少个sid:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/sids/TXT

再进行指定:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/XML?sid=104169547,109967232

Assay Targets

获取assay所作用的靶点的信息:

有效的输出格式有 XML, JSON(P), ASNT/B, and TXT. 而有效的target类型如下表所示

Target Type	Notes
ProteinGI	NCBI GI of a protein sequence
ProteinName	protein name
GeneID	NCBI Gene database identifier
GeneSymbol	gene symbol

Example:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/490,1000/targets/ProteinGI,ProteinName,GeneID,GeneSymbol/XML

其输出结果为:

并非所有实验都有确定的蛋白质或者基因靶点

同样的我们也可以反过来以具体的靶点名进行检索,如以USP2基因靶点进行检索进行了多少次实验:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/target/genesymbol/USP2/aids/TXT

其结果为:

Activity Name

以实验获取的某项活性数据名进行检索,如查询做了EC50的所有实验如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/activity/EC50/aids/JSON

查一个化合物|物质的实验汇总记录:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1000,1001/assaysummary/CSV

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/104234342/assaysummary/XML

结果如下:

获取物质|化合物的计量反应实验的结果:

一个aid(实验)最多返回1000个sid的计量反应数据,有效的输出格式为XML, JSON(P), ASNT/B, and CSV:

Option	Allowed Values	Meaning
sid	listkey, or comma-separated integers	SID rows to retrieve for an assay
listkey	valid SID listkey	listkey containing SIDs, if using sid=listkey

Examples:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/doseresponse/XML

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/doseresponse/CSV?sid=104169547,109967232

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/doseresponse/XML (with “aid=504526&sid=104169547,109967232” in the POST body)

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/602332/sids/XML?sids_type=doseresponse&list_return=listkey

followed by

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/602332/doseresponse/CSV?sid=listkey&listkey=xxxxxx&listkey_count=100 (where ‘xxxxxx’ is the listkey returned by the previous URL)

Output

选择输出格式:

<output specification> = XML | ASNT | ASNB | JSON | JSONP [ ?callback=<callback name> ] | SDF | CSV | PNG | TXT

operation_options

其实"ListKey"的使用并不仅仅限于结构检索(structure search),如果返回的输出对象是大型的数据列表的话,是可以不输出用operation_options将其保存在服务器上的,使用方式如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/sids/XML?list_return=listkey返回的"ListKey" 结果为1757602293094779987 (aid640是一个非常大的实验设计了诸多物质(sid))

再对返回的"ListKey"进行读取,同时以'listkey_start=""listkey_count='来限制结果的输出量,让时间不要太长.使用实例如下:

以上述的 "ListKey"进行操作

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/CSV?sid=listkey&listkey=1757602293094779987&listkey_start=0&listkey_count=1000

其结果为

共1000行数据

只要限制好了'listkey_start=""listkey_count='甚至可以以"ListKey"作物输入条件,以上述"ListKey"为例:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/listkey/1757602293094779987/synonyms/XML?&listkey_start=0&listkey_count=10

其结果为:

综上所述可以进行多种选择配合以此来获取pubchem中具体的数据,举例:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/property/MolecularFormula/JSON

其结果为:

以上都是以URL的API形式获取信息的例子

但并不是每一个api都能得到检索结果,有些情况下会发生错位.系统会返回错位信息,我们就需要能识别这些信息发现错误再哪里:

invalid input:输入无效

nothing was found for the given query:再所给定的输入条件下不存在匹配项

the request was too broad and took too long to complete: 数据太大而导致耗时太长,无法完成检索.(pug rest的网站服务请求的最大时间设置是30s)

会返回具体的代码,代码即其所代表的含义如下表所示:

HTTP Status	Error Code	General Error Category
200	(none)	Success
202	(none)	Accepted (asynchronous operation pending)
400	PUGREST.BadRequest	Request is improperly formed (syntax error in the URL, POST body, etc.)
404	PUGREST.NotFound	The input record was not found (e.g. invalid CID)
405	PUGREST.NotAllowed	Request not allowed (such as invalid MIME type in the HTTP Accept header)
504	PUGREST.Timeout	The request timed out, from server overload or too broad a request
503	PUGREST.ServerBusy	Too many requests or server is busy, retry later
501	PUGREST.Unimplemented	The requested operation has not (yet) been implemented by the server
500	PUGREST.ServerError	Some problem on the server side (such as a database server down, etc.)
500	PUGREST.Unknown	An unknown error occurred