以URL为基础的API分为4个部分

  • input specification--------------------------------------------------------输入
  • operation specification--------------------------------------------------操作
  • output specification------------------------------------------------------输出
  • ?<operation_options>---------------------------------------------------操作选项(以?接操作选项作为URL参数)

其具体格式一般为:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/<input specification>/<operation specification>/[<output specification>][?<operation_options>]

 

Input

input又分为3个参数:

<input specification> = <domain>/<namespace>/<identifiers>

<domain> = substance | compound | assay | <other inputs>

compound domain <namespace> = cid | name | smiles | inchi | sdf | inchikey | formula | <structure search> | <xref> | listkey | <fast search>

<structure search> = {substructure | superstructure | similarity | identity}/{smiles | inchi | sdf | cid}

<fast search> = {fastidentity | fastsimilarity_2d | fastsimilarity_3d | fastsubstructure | fastsuperstructure}/{smiles | smarts | inchi | sdf | cid} | fastformula

<xref> = xref / {RegistryID | RN | PubMedID | MMDBID | ProteinGI | NucleotideGI | TaxonomyID | MIMID | GeneID | ProbeID | PatentID}

substance domain <namespace> = sid | sourceid/<source id> | sourceall/<source name> | name | <xref> | listkey

<source name> = any valid PubChem depositor name

assay domain <namespace> = aid | listkey | type/<assay type> | sourceall/<source name> | target/<assay target> | activity/<activity column name>

<assay type> = all | confirmatory | doseresponse | onhold | panel | rnai | screening | summary | cellbased | biochemical | invivo | invitro | activeconcentrationspecified

<assay target> = gi | proteinname | geneid | genesymbol | accession

<identifiers> = comma-separated list of positive integers (e.g. cid, sid, aid) or identifier strings (source, inchikey, formula); in some cases only a single identifier string (name, smiles, xref; inchi, sdf by POST only)

<other inputs> = sources / [substance, assay] |sourcetable | conformers | annotations/[sourcename/<source name> | heading/<heading>]

 

举例(cid为2244的化合物的input):

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/<operation specification>/[<output specification>]

除开上述以参数进行分类的方法,分类方式还有按输入方式进行分类:

 

By Identifier

以具体的ID进行检索:

assay->aid

compound->cid

substance->sid

而且可以以以逗号分隔的ID列表进行检索,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/property/MolecularFormula,MolecularWeight,CanonicalSMILES/CSV

 

By Name

 

通过名字进行检索,而且是可以只检索部分名字的,但如果只检索部分名字就需要精化检索类型为单个单词 ,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/myxalamid/cids/XML?name_type=word

 

By Structure Identity

通过分子结构描述符如sdf,smiles,inchi,等进行输入检索,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CCCC/cids/TXT

 

By Structure Search

进行结构检索,检索形式有(substracture|superstracture,similarity,identity),但是因为其是以整个pubchem的数百万个分子数据库来进行匹配检测来进行检索,其所需时间很长,结果中会返回一个"ListKey”如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/substructure/smiles/C1CCCCCC1/XML

以此"ListKey”可以进行后续操作获取信息,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/12345678910/cids/TXT

 

By Fast (Synchronous) Structure Search

 

正因为上述操作耗时很长,聪明的人又进行了程序开发,得出了快速的结构检索方式.这个检索方式进行同步输入,不会得出"ListKey”,而是会进行单个调用数据而立即出结果.其具体的使用方式如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/cid/5793/cids/TXT?identity_type=same_connectivity

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsubstructure/cid/2244/cids/XML?StripHydrogen=true

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/cid/2244/property/MolecularWeight,MolecularFormula,RotatableBondCount/XML?Threshold=99

最后一个api的结果如下:

 

By Cross-Reference (XRef)

 

输入的信息不是pubchem的参数二十其他库的参数,能接受的其他库参数如下表所示:

 

Cross-referenceMeaning

RegistryID

external registry identifier

RN

registry number

PubMedID

NCBI PubMed identifier

MMDBID

NCBI MMDB identifier

DBURL

external database home page URL

SBURL

external database substance URL

ProteinGI

NCBI protein GI

NucleotideGI

NCBI nucleotide GI

TaxonomyID

NCBI taxonomy identifier

MIMID

NCBI MIM identifier

GeneID

NCBI gene identifier

ProbeID

NCBI probe identifier

PatentID

patent identifier

SourceName

external depositor name

SourceCategory

depositor category(ies)

具体的使用方式如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/xref/PatentID/US20050159403A1/sids/JSON

 

Operation

(其实也就是对输入进行选择具体取何种数据)

compound domain <operation specification> = record | <compound property> | synonyms | sids | cids | aids | assaysummary | classification | <xrefs> | description | conformers

<compound property> = property / [comma-separated list of property tags]

substance domain <operation specification> = record | synonyms | sids | cids | aids | assaysummary | classification | <xrefs> | description

<xrefs> = xrefs / [comma-separated list of xrefs tags]

assay domain <operation specification> = record | concise | aids | sids | cids | description | targets/<target type> | <doseresponse> | summary | classification

target_type = {ProteinGI, ProteinName, GeneID, GeneSymbol}

<doseresponse> = doseresponse/sid

可以取的数据如下:

Available Data

Full Records

 

在默认不进行操作选择的情况下,会输出输入条件下的所有数据,其适用的输出格式有– ASN.1 (NCBI’s native format), XML, SDF,甚至是JSON格式,具体使用例子如下:

 

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/record/XML

而且是可以一次取多个identity的数据的(以逗号隔开的列表),如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/1,2,3,4,5/SDF

 

Images

若是想的分子的图像,则为不进行操作以PNG格式进行输出就行,但是此方式一次只能输出一个image.如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/lipitor/PNG

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/CCCCC=O/PNG

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/RZJQGNCSTQAWON-UHFFFAOYSA-N/PNG

结果如下:

 

Compound Properties

compound的所有property如下表所示:

PropertyNotes

MolecularFormula

Molecular formula.

MolecularWeight

The molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in g/mol. In the absence of explicit isotope labelling, averaged natural abundance is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location.

CanonicalSMILES

Canonical SMILES (Simplified Molecular Input Line Entry System) string.  It is a unique SMILES string of a compound, generated by a “canonicalization” algorithm.

IsomericSMILES

Isomeric SMILES string.  It is a SMILES string with stereochemical and isotopic specifications.

InChI

Standard IUPAC International Chemical Identifier (InChI).  It does not allow for user selectable options in dealing with the stereochemistry and tautomer layers of the InChI string.

InChIKey

Hashed version of the full standard InChI, consisting of 27 characters.

IUPACName

Chemical name systematically determined according to the IUPAC nomenclatures.

XLogP

Computationally generated octanol-water partition coefficient or distribution coefficient. XLogP is used as a measure of hydrophilicity or hydrophobicity of a molecule.

ExactMass

The mass of the most likely isotopic composition for a single molecule, corresponding to the most intense ion/molecule peak in a mass spectrum.

MonoisotopicMass

The mass of a molecule, calculated using the mass of the most abundant isotope of each element.

TPSA

Topological polar surface area, computed by the algorithm described in the paper by Ertl et al.

Complexity

The molecular complexity rating of a compound, computed using the Bertz/Hendrickson/Ihlenfeldt formula.

Charge

The total (or net) charge of a molecule.

HBondDonorCount

Number of hydrogen-bond donors in the structure.

HBondAcceptorCount

Number of hydrogen-bond acceptors in the structure.

RotatableBondCount

Number of rotatable bonds.

HeavyAtomCount

Number of non-hydrogen atoms.

IsotopeAtomCount

Number of atoms with enriched isotope(s)

AtomStereoCount

Total number of atoms with tetrahedral (sp3) stereo [e.g., (R)- or (S)-configuration]

DefinedAtomStereoCount

Number of atoms with defined tetrahedral (sp3) stereo.

UndefinedAtomStereoCount

Number of atoms with undefined tetrahedral (sp3) stereo.

BondStereoCount

Total number of bonds with planar (sp2) stereo [e.g., (E)- or (Z)-configuration].

DefinedBondStereoCount

Number of atoms with defined planar (sp2) stereo.

UndefinedBondStereoCount

Number of atoms with undefined planar (sp2) stereo.

CovalentUnitCount

Number of covalently bound units.

Volume3D

Analytic volume of the first diverse conformer (default conformer) for a compound.

XStericQuadrupole3D

The x component of the quadrupole moment (Qx) of the first diverse conformer (default conformer) for a compound.

YStericQuadrupole3D

The y component of the quadrupole moment (Qy) of the first diverse conformer (default conformer) for a compound.

ZStericQuadrupole3D

The z component of the quadrupole moment (Qz) of the first diverse conformer (default conformer) for a compound.

FeatureCount3D

Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)

FeatureAcceptorCount3D

Number of hydrogen-bond acceptors of a conformer.

FeatureDonorCount3D

Number of hydrogen-bond donors of a conformer.

FeatureAnionCount3D

Number of anionic centers (at pH 7) of a conformer.

FeatureCationCount3D

Number of cationic centers (at pH 7) of a conformer. 

FeatureRingCount3D

Number of rings of a conformer.

FeatureHydrophobeCount3D

Number of hydrophobes of a conformer.

ConformerModelRMSD3D

Conformer sampling RMSD in Å.

EffectiveRotorCount3D

Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D)

ConformerCount3D

The number of conformers in the conformer model for a compound.

Fingerprint2D

Base64-encoded PubChem Substructure Fingerprint of a molecule.

 

使用实例如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceid/IBM/5F1CA2B314D35F28C7F94168627B29E3/ASNT 

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceid/DTP.NCI/747285/SDF

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceid/DTP.NCI/747285/PNG

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/PNG

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/SDF?record_type=3d

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/PNG?record_type=3d&image_size=small

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/SDF

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/BPGDAMSIGCZZLK-UHFFFAOYSA-N/SDF

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/XML

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/CSV?sid=26736081,26736082,26736083

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/concise/CSV

 

同样的是可以一次性取多个property数据的,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/property/MolecularWeight,MolecularFormula,HBondDonorCount,HBondAcceptorCount,InChIKey,InChI/CSV

其结果为:

Synonyms

取得一个化合物|物质的所用名字:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/vioxx/synonyms/XML

 

Cross-References (XRefs)

检索其他数据库的参数,可用参数已于上文中给出了:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/xrefs/MMDBID/XML

 

BioAssays

在PUBCHEM中一个assay分为两个部分:Assay Description 和  Assay Data    

前者包括authorship, general description, protocol, and definitions of the data readout columns 作者,概括说明.协议和数据列的定义.

后者则包含实验中的各项数据

 

Assay Description

如果想单独获取实验的描述性内容,可进行如下操作:

其有效的输出格式有 XML, JSON(P), and ASNT/B,如

https://pubchem.ncbi.nl:m.nih.gov/rest/pug/assay/aid/504526/description/XML

还有另外一种格式,不会有上述形式这般详细的description 但是会包含相关靶点和有活性的和非活性的SID和CID统计信息,如:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1000/summary/JSON

结果如下:

 

 

Assay Data

Assay Data有关assay可取的数据如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/CSV

其结果为

一个assay是可以涉及多个sid即物质的,如果我们只想得到特定物质的结果可以进行指定,如:

首先查询一个实验设计多少个sid:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/sids/TXT

再进行指定:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/XML?sid=104169547,109967232

Assay Targets

获取assay所作用的靶点的信息:

有效的输出格式有 XML, JSON(P), ASNT/B, and TXT. 而有效的target类型如下表所示 

Target TypeNotes

ProteinGI

NCBI GI of a protein sequence

ProteinName

protein name

GeneID

NCBI Gene database identifier

GeneSymbol

gene symbol

 

Example:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/490,1000/targets/ProteinGI,ProteinName,GeneID,GeneSymbol/XML

其输出结果为:

:

并非所有实验都有确定的蛋白质或者基因靶点

同样的我们也可以反过来以具体的靶点名进行检索,如以USP2基因靶点进行检索进行了多少次实验:

 

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/target/genesymbol/USP2/aids/TXT

其结果为:

Activity Name

以实验获取的某项活性数据名进行检索,如查询做了EC50的所有实验如下:

 

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/activity/EC50/aids/JSON

 

查一个化合物|物质的实验汇总记录:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1000,1001/assaysummary/CSV

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/104234342/assaysummary/XML

结果如下:

 


获取物质|化合物的计量反应实验的结果:

一个aid(实验)最多返回1000个sid的计量反应数据,有效的输出格式为XML, JSON(P), ASNT/B, and CSV:

 

OptionAllowed ValuesMeaning
sidlistkey, or comma-separated integersSID rows to retrieve for an assay
listkeyvalid SID listkeylistkey containing SIDs, if using sid=listkey

Examples:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/doseresponse/XML

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/504526/doseresponse/CSV?sid=104169547,109967232

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/doseresponse/XML (with “aid=504526&sid=104169547,109967232” in the POST body)

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/602332/sids/XML?sids_type=doseresponse&list_return=listkey

followed by

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/602332/doseresponse/CSV?sid=listkey&listkey=xxxxxx&listkey_count=100 (where ‘xxxxxx’ is the listkey returned by the previous URL)

 

Output

选择输出格式:

<output specification> = XML | ASNT | ASNB | JSON | JSONP [ ?callback=<callback name> ] | SDF | CSV | PNG | TXT

operation_options

其实"ListKey"的使用并不仅仅限于结构检索(structure search),如果返回的输出对象是大型的数据列表的话,是可以不输出用operation_options将其保存在服务器上的,使用方式如下:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/sids/XML?list_return=listkey返回的"ListKey" 结果为1757602293094779987 (aid640是一个非常大的实验设计了诸多物质(sid))

再对返回的"ListKey"进行读取,同时以'listkey_start=""listkey_count='来限制结果的输出量,让时间不要太长.使用实例如下:

以上述的 "ListKey"进行操作

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/640/CSV?sid=listkey&listkey=1757602293094779987&listkey_start=0&listkey_count=1000

其结果为

共1000行数据

 

只要限制好了'listkey_start=""listkey_count='甚至可以以"ListKey"作物输入条件,以上述"ListKey"为例:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/listkey/1757602293094779987/synonyms/XML?&listkey_start=0&listkey_count=10

其结果为:

综上所述可以进行多种选择配合以此来获取pubchem中具体的数据,举例:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/property/MolecularFormula/JSON

其结果为:

  

 

 

以上都是以URL的API形式获取信息的例子

但并不是每一个api都能得到检索结果,有些情况下会发生错位.系统会返回错位信息,我们就需要能识别这些信息发现错误再哪里:

 

invalid input:输入无效

nothing was found for the given query:再所给定的输入条件下不存在匹配项

 the request was too broad and took too long to complete: 数据太大而导致耗时太长,无法完成检索.(pug rest的网站服务请求的最大时间设置是30s)

会返回具体的代码,代码即其所代表的含义如下表所示:

HTTP StatusError CodeGeneral Error Category

200

(none)

Success

202

(none)

Accepted (asynchronous operation pending)

400

PUGREST.BadRequest

Request is improperly formed (syntax error in the URL, POST body, etc.)

404

PUGREST.NotFound

The input record was not found (e.g. invalid CID)

405

PUGREST.NotAllowed

Request not allowed (such as invalid MIME type in the HTTP Accept header)

504

PUGREST.Timeout

The request timed out, from server overload or too broad a request

503

PUGREST.ServerBusy

Too many requests or server is busy, retry later

501

PUGREST.Unimplemented

The requested operation has not (yet) been implemented by the server

500

PUGREST.ServerError

Some problem on the server side (such as a database server down, etc.)

500

PUGREST.Unknown

An unknown error occurred

 

 

Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐