Tokenizer
Regular Expression Patterns¶
Tokenization is mainly done through regular expression patterns.
Regular expression Patterns
Bond = '-|=|#|$'
Atom = 'S|P|O|N|I|F|Cl|C|Br|B'
Aromatic = 'o|c|n|p|s|b'
AtomExtend = r'(?:\[)(?P
BranchStart = r'('
BranchEnd = r')'
Ring = r'[\d]{1}'
Ring2 = r'%[\d]{2}'
BondEZ = r'/|\'
Disconnected = r"."
BondDescriptorLadder = r"[[$<>][\d][[$<>][\d]?][\d]?]"
BondDescriptor = r"[[$<>][\d]?[\d]?]"
StochasticSeperator = r",|;"
StochasticStart = r'{'
StochasticEnd = r'}'
ImplictEndGroup = r'[]'
Rxn = r'>>|>'
Atom tokenizer¶
bigsmiles.constructors.tokenizer.tokenize_atom_symbol(symbol)
¶
convert atom symbol into a dictionary of the following values:
- symbol
- isotope (default: None)
- stereo (default: None)
- hydrogens (default: None)
- charge (default: 0)
- class_ (default: None)
PARAMETER | DESCRIPTION |
---|---|
symbol |
atom string
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
result
|
{"symbol": str, "isotope": int, "stereo": str, "hydrogens": int, "charge": int, "class_": int}
TYPE:
|
Examples:
Bonding Descriptor Tokenizer¶
bigsmiles.constructors.tokenizer.tokenize_bonding_descriptor(symbol)
¶
convert bonding descriptor symbol into the following values:
- symbol
- index
PARAMETER | DESCRIPTION |
---|---|
symbol |
bonding_descriptor string
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
result
|
[symbol, index]
TYPE:
|
Examples:
Notes¶
- default bonding descriptor index = 1
BigSMILES Tokenizer¶
The tokenizer leverage python's regular expression to identify and label valid BigSMILE tokens.
bigsmiles.constructors.tokenizer.tokenize_text(text)
¶
tokenizes a bigSMILES string into a list of strings
PARAMETER | DESCRIPTION |
---|---|
text |
BigSMILES string
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
result
|
A list of strings, one for each token
TYPE:
|
RAISES | DESCRIPTION |
---|---|
TokenizeError
|
invalid symbol detected |
Examples:
bigsmiles.constructors.tokenizer.tokenize(text)
¶
tokenizes a bigSMILES string into a list of Token
objects.
PARAMETER | DESCRIPTION |
---|---|
text |
BigSMILES string
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
result
|
BigSMILES as a token list
TYPE:
|
RAISES | DESCRIPTION |
---|---|
TokenizeError
|
invalid symbol detected |
Examples:
bigsmiles.constructors.tokenizer.Token(kind, value)
¶
token; what a BigSMILES string gets broken up into
PARAMETER | DESCRIPTION |
---|---|
kind |
token kind
TYPE:
|
value |
token text
TYPE:
|
Examples:
bigsmiles.constructors.tokenizer.TokenKind
¶
Enum
the kind of tokens that will be extracted from BigSMILES string
ATTRIBUTE | DESCRIPTION |
---|---|
Bond |
|
Atom |
|
Aromatic |
|
AtomExtend |
|
BranchStart |
|
BranchEnd |
|
Ring |
|
Ring2 |
|
BondEZ |
|
Disconnected |
|
Rxn |
|
BondDescriptor |
|
StochasticSeperator |
|
StochasticStart |
|
StochasticEnd |
|
ImplictEndGroup |
|
BondDescriptorLadder |
|