View on GitHub

WURCS

Web3 Unique Representation of Carbohydrate Structures

WURCS - Web3 Unique Representation of Carbohydrate Structures -

This document is beeing migrated to Wiki pages

Introduction

Theoretical considerations

Figures

WURCS: Introduction

The WURCS format is follows:

WURCS=<Version>/<Unit Count>/<UniqueRES List>/<RES Sequence>/<LIN List>

Definitions of each section and component

<Version>: 2.0 (current version)

<Unit Count>: <UniqueRES Count>,<RES Count>,<LIN Count>

<Uncertain LIN Count>: <LIN Count>+

<UniqueRES List>: <UniqueRES #1><UniqueRES #2>...<UniqueRES #n>

<UniqueRES>: [<ResidueCode>]

<ResidueCode>: <BackboneCode>_<MOD #1>_<MOD #2>_..._<MOD #n>

<BackboneCode>: <SkeletonCode>-<Anomeric Information>

<MOD>: <LIP #1>-<LIP #2>-...-<LIP #n><MAP>

<LIP>: <Position><Direction><Star Index>

<RES Sequence>: <RES #1>-<RES #2>-...-<RES #n>

<LIN List>: <LIN #1>_<LIN #2>_..._<LIN #n>

<LIN>: <GLIP #1>-<GLIP #2>-...-<GLIP #n><MAP>

<GLIP>: <RES Index><Position><Direction><Star Index>

<Statistic LIP>: %<Probability Range>%<LIP> or <LIP>%<Probability Range>%

<Probability Range>: <Upper Probability>-<Lower Probability> or <Probability>

<Alternative GLIP>: <GLIP #1>|<GLIP #2>|...|<GLIP #n>

<Statistic GLIP>: %<Probability Range>%<GLIP> or <GLIP>%<Probability Range>%

<Alternative GLIP>: <GLIP #1>|<GLIP #2>|...|<GLIP #n>

<RES Alternative GLIP>: {<Alternative GLIP> or <Alternative GLIP>}

<Repeated LIN>: <LIN>~<Repeat Count Range>

<Repeat Count Range>: <Max Repeat Count>:<Min Repeat Count> or <Repeat Count>

Omission rules

Definitions of WURCS component

Glycan

definition description
Backbone The carbon backbone of a monosaccharide (residue) in a glycan containing only carbons. The backbone has three carbon atoms at least.
Modification Components of monosaccharides other than the backbone carbons.
Aglycone The non-sugar component of a glycan.
Connection The connection information between a backbone and a modification or an aglycone.
definition description
Monosaccharide A combination of a Backbone, Modification(s) on the Backbone and their Connection(s).
Linkage A combination of a Modification bridging two or more Backbones and their Connections.

Backbone

Definition:

Modification

Definition:

Aglycone

WURCS=2.0/1,1,0/[a2122h-1x_1-5_1@]/1/

Connection

WURCS Format Components

SkeletonCode

Fig

CarbonDescriptor

Table 3.2-1 CarbonDescriptors for the generic functional groups of monosaccharide. In Fischer projection, X and Y are atoms other than hydrogen and backbone carbons. A is a hetero atom in ether ring.

| Fischer projection | Typical functional group | CarbonDescriptor | Position in the backbone | |—|—|—|—|

Anomeric Information

Anomeric Symbol Descriptin
a alpha
b beta
u up (absolute representation)
d down (absolute representation)
x unknown
o no anomeric center (only when it must be indicated that the anomeric center is not existed when the anomer is not represented in SkeletonCode)
Table 3.3-1. Comparisons of Anomeric Symbols “a”, “b”, “u” and “d”, corresponding to monosaccharide structures represented by Haworth representation and Fischer projection, and the configurational relationships between the anomeric center and the anomeric reference atom in the Fischer projection.

| Symbol | Haworth representation1 | Fischer projection1 | Configurational relationship*2 | |—|—|—|—|

Fig.

MAP (Modification Atom Path)

Table 3.4-1
Abbreviation Substituent group MAP
-H Hydrogen *  (H is omitted)
-OH Hydroxyl *O
-O- Ether O
-N Primary Amine *N
-Me C-linked Methyl *C
-OMe O-linked Methyl *OC
-Ac C-linked Acetyl *CC/2=O
-OAc O-linked Acetyl *OCC/3=O
-NAc N-Acetyl *NCC/3=O
-NGc N-Glycolyl *NCCO/3=O
-P Phosphate *OPO/3O/3=O
-S Sulfate *OSO/3=O/3=O
-Pyr Pyruvate *OCCC/4=O/3=O
-PC Phosphocholine *OPOCCNC/7C/7C/3O/3=O
-PEtn Phosphoethanolamine *OPOCCN/3O/3=O
-PPEtn Diphosphoethanolamine *OPOPOCCN/5O/5=O/3O/3=O
Figure 3.4-1 2-aminoethanol (ethanolamine) having two linkages to monosaccharide backbones.
Figure 3.4-2 phytosphingosine (2S,3R,4R-2-amino-octadecane-1,3,4-triol) having four linkages to monosaccharide(s) on the oxygen and nitrogen atoms and its MAP string.

Definition of symbols in MAP

Table 3.4.1-1. Description for the MAP symbols.
Symbol Description
H, C, O, … Atom symbol. An atom cannot be represented as abbreviation such as “Me” and “Et”.
*[n] Backbone carbon in MAP. If multiple backbone carbons are contained in MAP and these carbons are not equivalent, StarIndex is added after “*”.
/n Branching atom appeared in MAP already. n is an index number of atom at branching point in MAP atoms.
$n Cyclic. n is an index number of atom at cyclic point in MAP atoms.
=, # Double and triple bond, respectively.
^R, ^S, ^X Chiral information to the previous atom. R and S are based on CIP rule. X indicate unknown chirality. 
^E, ^Z, ^X Geometrical isormerism to the previous bond. E and Z are based on CIP rule. X indicate unknown E/Z.
(, ) Start and end of aromatic atom group. In this group, double bond “=” is omitted.
Figure 3.4.1-1. Example of numbering for index numbers of atoms in large MAP.

Ambiguous MAP string

Generation of MAP string

Generation of MAP

MAP vs substituent name

Table 6.2-1. Monovalent substituents

Abbreviation Substituent group Substituent name in MonosaccharideDB*1 MAP string
-Ac C-linked acetyl (nd:1)acetyl *CC/2=O
-OAc O-acetyl (no:1)acetyl *OCC/3=O
-NAc N-acetyl (nd:1)n-acetyl *NCC/3=O
-Br C-linked bromo (nd:1)bromo *Br
-Cl C-linked chloro (nd:1)chloro *Cl
-Et C-linked ethyl (nd:1)ethyl *CC
-OEt O-linked ethyl (no:1)ethyl *OCC
-F Fluoro  (nd:1)fluoro *F
-CHO C-linked formyl (nd:1)formyl *C=O
-OCHO O-formyl (no:1)formyl *OC=O
-NCHO N-formyl (nd:1)n-formyl *NC=O
-Gc C-linked glycolyl (nd:1)glycolyl *CCO/2=O
-OGc O-glycolyl (no:1)glycolyl *OCCO/3=O
-NGc N-glycolyl (nd:1)n-glycolyl *NCCO/3=O
-CO Hydroxymethyl (nd:1)hydroxymethyl *CO
-I Iodo (nd:1)iodo *I
-Me C-linked Methyl (nd:1)methyl *C
-OMe O-methyl (no:1)methyl *OC
-NMe N-methyl (nd:1)n-methyl *NC
-NAra N-alanine (nd:1)n-alanine *NCC^XC/4N/3=O
0 N-dimetyl (nd:1)n-dimethyl *NC/2C
-OC(=O)–(CH2)2–COOH O-succinate (no:1)succinate *OCCCCO/6=O/3=O
-NHC(=O)–(CH2)2–COOH N-succinate (nd:1)n-succinate *NCCCCO/6=O/3=O
-NC(=O)CF3 N-trifluoroacetyl (nd:1)n-trifluoroacetyl *NCCF/4F/4F/3=O
-C(=O)=O Nitrate (nd:1)nitrate *C=O/2=O
  C-linked (R)-lactate (nd:1)(r)-lactate *CC^RC/3O/2=O
  C-linked (S)-lactate (nd:1)(s)-lactate *CC^SC/3O/2=O
  O-linked (R)-lactate (no:1)(r)-lactate *OCC^RC/4O/3=O
  O-linked (S)-lactate (no:1)(s)-lactate *OCC^SC/4O/3=O
-SH Thio (no:1)thio *S
-C(=NH)NH2 C-linked amidino amidino *CN/2=N
-NHC(=NH)N N-amidino n-amidino *NCN/3=N
-CCOOH C-linked carboxymethyl carboxymethyl *CCO/3=O
    (r)-carboxyethyl *C^RCO/3=O/2C
    (s)-carboxyethyl *C^SCO/3=O/2C
    (no:1)phospho-choline *OPOCCNC/7C/7C/3O/3=O
    phosphate *OPO/3O/3=O

*1 n is number of linkage position.