Skip to content

Tokenizer


Regular Expression Patterns

Tokenization is mainly done through regular expression patterns.

Regular expression Patterns

Bond = '-|=|#|$'

Atom = 'S|P|O|N|I|F|Cl|C|Br|B'

Aromatic = 'o|c|n|p|s|b'

AtomExtend = r'(?:\[)(?P[\d]{1,3})? (?Po|c|n|p|s|b|Zr|Zn|Yb|Y|Xe|W|V|U|Ts|Tm|Tl|Ti|Th|Te|Tc|Tb|Ta|Sr |Sn|Sm|Si|Sg|Se|Sc|Sb|S|Ru|Rn|Rh|Rg|Rf|Re|Rb|Ra|Pu|Pt|Pr|Po|Pm|Pd|Pb|Pa|P|Os|Og|O|Np|No|Ni|Nh|Ne|Nd|Nb|Na|N|Mt|Mo|Mn|Mg |Md|Mc|Lv|Lu|Lr|Li|La|Kr|K|Ir|In|I|Hs|Ho|Hg|Hf|He|H|Ge|Gd|Ga|Fr|Fm|Fl|Fe|F|Eu|Es|Er|Dy|Ds|Db|Cu|Cs|Cr|Co|Cn|Cm|Cl|Cf|Ce |Cd|Ca|C|Br|Bk|Bi|Bh|Be|Ba|B|Au|At|As|Ar|Am|Al|Ag|Ac) (?P@{1,2})?(?PH[\d]?)?(?P[-|\+]{1,3}[\d]?)?(?P:\d{1,3})?(?:\])'

BranchStart = r'('

BranchEnd = r')'

Ring = r'[\d]{1}'

Ring2 = r'%[\d]{2}'

BondEZ = r'/|\'

Disconnected = r"."

BondDescriptorLadder = r"[[$<>][\d][[$<>][\d]?][\d]?]"

BondDescriptor = r"[[$<>][\d]?[\d]?]"

StochasticSeperator = r",|;"

StochasticStart = r'{'

StochasticEnd = r'}'

ImplictEndGroup = r'[]'

Rxn = r'>>|>'

Atom tokenizer

bigsmiles.constructors.tokenizer.tokenize_atom_symbol(symbol)

convert atom symbol into a dictionary of the following values:

  • symbol
  • isotope (default: None)
  • stereo (default: None)
  • hydrogens (default: None)
  • charge (default: 0)
  • class_ (default: None)
PARAMETER DESCRIPTION
symbol

atom string

TYPE: str

RETURNS DESCRIPTION
result

{"symbol": str, "isotope": int, "stereo": str, "hydrogens": int, "charge": int, "class_": int}

TYPE: dict[str

Examples:

>>> tokenize_atom_symbol("[13C@H+:1]")
{"symbol": "C", "isotope": 13, "stereo": "@", "hydrogens": 1, "charge": 1, "class_": 1}

Bonding Descriptor Tokenizer

bigsmiles.constructors.tokenizer.tokenize_bonding_descriptor(symbol)

convert bonding descriptor symbol into the following values:

  • symbol
  • index
PARAMETER DESCRIPTION
symbol

bonding_descriptor string

TYPE: str

RETURNS DESCRIPTION
result

[symbol, index]

TYPE: tuple[str, int]

Examples:

>>> tokenize_atom_symbol("[$1]")
["$", 1]
Notes
  • default bonding descriptor index = 1

BigSMILES Tokenizer

The tokenizer leverage python's regular expression to identify and label valid BigSMILE tokens.

bigsmiles.constructors.tokenizer.tokenize_text(text)

tokenizes a bigSMILES string into a list of strings

PARAMETER DESCRIPTION
text

BigSMILES string

TYPE: str

RETURNS DESCRIPTION
result

A list of strings, one for each token

TYPE: list[str]

RAISES DESCRIPTION
TokenizeError

invalid symbol detected

Examples:

>>> tokenize_text("CC{[>][<]CC(C)[>][<]}CC(C)=C")
['C', 'C', '{', '[>]', '[<]', 'C', 'C', '(', 'C', ')', '[>]', '[<]', '}', 'C', 'C', '(', 'C', ')', '=', 'C']

bigsmiles.constructors.tokenizer.tokenize(text)

tokenizes a bigSMILES string into a list of Token objects.

PARAMETER DESCRIPTION
text

BigSMILES string

TYPE: str

RETURNS DESCRIPTION
result

BigSMILES as a token list

TYPE: list[Token]

RAISES DESCRIPTION
TokenizeError

invalid symbol detected

Examples:

>>> tokenize("C(C)C")
[Token(TokenKind.Atom, "C"), Token(TokenKind.BranchStart, "("), Token(TokenKind.Atom, "C"),
Token(TokenKind.BranchEnd, ")"), Token(TokenKind.Atom, "C")]

bigsmiles.constructors.tokenizer.Token(kind, value)

token; what a BigSMILES string gets broken up into

PARAMETER DESCRIPTION
kind

token kind

TYPE: TokenKind

value

token text

TYPE: str

Examples:

>>> Token(TokenKind.AtomExtend, "[13CH2]")

bigsmiles.constructors.tokenizer.TokenKind

Enum

the kind of tokens that will be extracted from BigSMILES string

ATTRIBUTE DESCRIPTION
Bond

Atom

Aromatic

AtomExtend

BranchStart

BranchEnd

Ring

Ring2

BondEZ

Disconnected

Rxn

BondDescriptor

StochasticSeperator

StochasticStart

StochasticEnd

ImplictEndGroup

BondDescriptorLadder