Decoding Fingerprints to IUPAC/Natural Chemical Names

Sulstice
3 min readJul 16, 2022

--

Demo

Skip right to the code and documentation:

Code: https://colab.research.google.com/drive/1z0ilrakoRJ8maapMNHwtPf83pKK1THST?usp=sharing

Technical Documentation: https://sulstice.gitbook.io/globalchem-your-chemical-graph-network/cheminformatics/decoding-fingeringprints-and-smiles-to-iupac

Philosophy

I classify my data according to a functional group or name that makes sense to me. Most of the time my philosophy is if I can’t speak it then it’s not natural to me. That being said, when I explore the chemical universe I often need a way to remember the functional groups I am going after and if I want to navigate that chemical space then I need to be armed with reference patterns. Should we look at food? narcotics? war? poison? or medicine? you decide. It’s not my choice, I can only send a message. The idea is simple convert this:

00000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

to this

['benzene', 'ammonia']

Hyperparameters Encoding SMILES

I’ve talked about Morgan before and it’s been discussed a lot as an implementation in RDKit. We need to pick standards for fingerprinting and the best one I have found in my experience as a cheminformatician is 512 bit length with a radius of 2. I have found this big enough to capture the chemical environment. If we create a reference standard for fingerprints with common chemical lists we can decode the information readily and allow us to amply explore chemical space with purpose.

bit_string = AllChem.GetMorganFingerprintAsBitVect(
molecule,
2,
nBits=512
).ToBitString()

This is the meat of the code and you can reference the RDKit documentation for more. So that’s what I did. Created nearly 3000 bit vectors organized into GlobalChem:

From here to classify a fingerprint was pretty simple to implement with a common similarity score function called Tanimoto Similarity. Where I use a similarity checker and anything bit vector that is above a score of 90% similarity is marked as being a functional group that exists within that space.

You can select a node and then pull a similarity by rebuilding the structure from the reference standard.

fingerprint = DataStructs.cDataStructs.CreateFromBitString(fingerprint)

for smiles, reference_fingerprint in bit_strings.items():

reference_fingerprint = DataStructs.cDataStructs.CreateFromBitString(reference_fingerprint)
score = DataStructs.FingerprintSimilarity(fingerprint, reference_fingerprint)

So let’s look at the example in the demo:

Install the Package:

!pip install -q global-chem[cheminformatics] --upgrade

Load the cheminformatics and decoder engine package:

from global_chem import GlobalChem
from global_chem_extensions import GlobalChemExtensions
gc = GlobalChem()
cheminformatics = GlobalChemExtensions().cheminformatics()
decoder_engine = cheminformatics.get_decoder_engine()

Load the benzene molecule written in SMILES into a Morgan fingerprint:

morgan_fingerprint = decoder_engine.generate_morgan_fingerprint('C1=CC=CC=C1')print(decoder_engine.classify_fingerprint(
morgan_fingerprint,
node='organic_and_inorganic_bronsted_acids'
))

And then using a node that’s not a bad reference standard for functional groups we can retrieve the IUPAC name.

['benzene']

So this helps us decode fingerprints by having an annotated reference and safe passage of chemical information as our predecessors before us.

Decoding Bigger SMILES

Now let’s take a more complex example where a SMILES string might be more complex and not have a reference standard directly into GlobalChem. Well we can used the BRICS module implemented in RDKit to fragment molecules and perhaps smaller molecules will have a reference in the fingerprint. Then we can join all the functional groups together to get an idea of the chemical space inside a fingerprint written in english. Let’s take a look at an example of fentanyl:

print(decoder_engine.classify_smiles_using_bits(
'CCC(=O)N(C1CCN(CC1)CCC2=CC=CC=C2)C3=CC=CC=C3',
node='organic_and_inorganic_bronsted_acids'
))

Here we can see something more complex and maybe using a small set of functional groups in the bronsted acids is not enough to match compounds in there. So we fragment the molecule:

fragments = list(BRICS.BRICSDecompose(Chem.MolFromSmiles(smiles)))

And then convert to fingerprints, loop through for similarity, and there it is:

Hopefully this makes it easy for everyone to understand their fingerprints and for my ai models to learn efficiently.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Sulstice
Sulstice

No responses yet

Write a response