Re: [Rdkit-discuss] remove redundant bits from bitvector fingerprints
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Greg L. <gre...@gm...> - 2014-06-13 12:23:26
|
hmm, this got lost in my mailbox. Sorry.
You can do what I think you want to do using the information theory
machinery that the rdkit has available. Here's a short snippet that finds
the bits that are not redundant in a data set (redundancy here calculated
using information entropy):
In [48]: ms = [Chem.MolFromSmiles(x.split()[1]) for x in
file('./Target_no_107_58879.txt')]
In [49]: nbits = 2048
In [50]: fps =
[rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(x,nbits) for x in
ms]
In [58]: entropies = []
In [59]: for i in range(nbits):
arr = numpy.array([x[i] for x in fps])
e = InfoTheory.InfoEntropy(arr)
entropies.append(e)
....:
In [60]: entropies = numpy.array(entropies)
In [61]: goodbits = numpy.array(range(nbits))[entropies>0.0]
In [62]: len(goodbits)
Out[62]: 891
your case is pretty big, so this may take a bit, but it shouldn't be too
slow.
-greg
On Wed, Jun 4, 2014 at 5:08 AM, Stephen O'hagan <SO...@ma...>
wrote:
> Hi,
>
>
>
> I have a set of say 1000 generated fingerprints each of length 39972;
> across all 1000 fingerprints many bits are the same – they contain no
> information about the differences between the 1000 molecules.
>
>
>
> e.g. for list
>
>
>
> 010100001
>
> 010110100
>
> 010101110
>
> 010100010
>
>
>
> The first four bits are redundant, I could just record them as:
>
>
>
> 00001
>
> 10100
>
> 01110
>
> 00010
>
>
>
> In reality, the redundant bits are distributed through the bit string, so
> I need a method to determine which bits are redundant, and then remove them
> from each fingerprint.
>
>
>
> Cheers,
>
> Steve.
>
>
>
>
>
>
>
> *From:* Greg Landrum [mailto:gre...@gm...]
> *Sent:* 04 June 2014 04:40
> *To:* Stephen O'hagan
> *Cc:* rdk...@li...
> *Subject:* Re: [Rdkit-discuss] remove redundant bits from bitvector
> fingerprints
>
>
>
> Hi Steve,
>
>
>
> On Tue, Jun 3, 2014 at 2:08 PM, Stephen O'hagan <SO...@ma...>
> wrote:
>
> I have a fragment of code generating fingerprints for a long list of
> molecules (length ~ 1000)
>
>
>
> for index in range(0,len(smi)):
>
> smiles=smi[index]
>
> mol=Chem.MolFromSmiles(smiles)
>
> AllChem.EmbedMolecule(mol)
>
> AllChem.UFFOptimizeMolecule(mol)
>
> dm = Chem.Get3DDistanceMatrix(mol)
>
> fp = Generate.Gen2DFingerprint(mol,factory, dMat=dm)
>
> fp = fp.ToBitString()
>
> bs[index]=fp
>
>
>
> The length of each bitvectors generated is 39972, and the list has a lot
> of redundant ‘1’s and ‘0’s.
>
>
>
> Is there an easy method to filter out these redundant bits?
>
>
>
> What do you mean by redundant bits?
>
>
>
> The length of the bit vectors is determined by the parameters you provide
> for building the pharmacophore fingerprints (number of points, number of
> features, and number of distance bins). The length of the strings that you
> get from fp.ToBitString() should be equal to this number of bits.
>
>
>
> -greg
>
>
>
|