Re: [Rdkit-discuss] remove redundant bits from bitvector fingerprints

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

hmm, this got lost in my mailbox. Sorry.

You can do what I think you want to do using the information theory
machinery that the rdkit has available. Here's a short snippet that finds
the bits that are not redundant in a data set (redundancy here calculated
using information entropy):

In [48]: ms = [Chem.MolFromSmiles(x.split()[1]) for x in
file('./Target_no_107_58879.txt')]

In [49]: nbits = 2048

In [50]: fps =
[rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(x,nbits) for x in
ms]

In [58]: entropies = []

In [59]: for i in range(nbits):
    arr = numpy.array([x[i] for x in fps])
    e = InfoTheory.InfoEntropy(arr)
    entropies.append(e)
   ....:

In [60]: entropies = numpy.array(entropies)

In [61]: goodbits = numpy.array(range(nbits))[entropies>0.0]

In [62]: len(goodbits)
Out[62]: 891

your case is pretty big, so this may take a bit, but it shouldn't be too
slow.

-greg

On Wed, Jun 4, 2014 at 5:08 AM, Stephen O'hagan <SO...@ma...>
wrote:

>  Hi,
>
>
>
> I have a set of say 1000 generated fingerprints each of length 39972;
> across all 1000 fingerprints many bits are the same – they contain no
> information about the differences between the 1000 molecules.
>
>
>
> e.g. for list
>
>
>
> 010100001
>
> 010110100
>
> 010101110
>
> 010100010
>
>
>
> The first four bits are redundant, I could just record them as:
>
>
>
> 00001
>
> 10100
>
> 01110
>
> 00010
>
>
>
> In reality, the redundant bits are distributed through the bit string, so
> I need a method to determine which bits are redundant, and then remove them
> from each fingerprint.
>
>
>
> Cheers,
>
> Steve.
>
>
>
>
>
>
>
> *From:* Greg Landrum [mailto:gre...@gm...]
> *Sent:* 04 June 2014 04:40
> *To:* Stephen O'hagan
> *Cc:* rdk...@li...
> *Subject:* Re: [Rdkit-discuss] remove redundant bits from bitvector
> fingerprints
>
>
>
> Hi Steve,
>
>
>
> On Tue, Jun 3, 2014 at 2:08 PM, Stephen O'hagan <SO...@ma...>
> wrote:
>
> I have a fragment of code generating fingerprints for a long  list of
> molecules (length ~ 1000)
>
>
>
> for index in range(0,len(smi)):
>
>                 smiles=smi[index]
>
> mol=Chem.MolFromSmiles(smiles)
>
> AllChem.EmbedMolecule(mol)
>
> AllChem.UFFOptimizeMolecule(mol)
>
> dm = Chem.Get3DDistanceMatrix(mol)
>
> fp = Generate.Gen2DFingerprint(mol,factory, dMat=dm)
>
> fp = fp.ToBitString()
>
> bs[index]=fp
>
>
>
> The length of  each bitvectors generated is 39972, and the list has a lot
> of redundant ‘1’s and ‘0’s.
>
>
>
> Is there an easy method to filter out these redundant bits?
>
>
>
> What do you mean by redundant bits?
>
>
>
> The length of the bit vectors is determined by the parameters you provide
> for building the pharmacophore fingerprints (number of points, number of
> features, and number of distance bins). The length of the strings that you
> get from fp.ToBitString() should be equal to this number of bits.
>
>
>
> -greg
>
>
>

Re: [Rdkit-discuss] remove redundant bits from bitvector fingerprints

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] remove redundant bits from bitvector fingerprints