Package CHEM :: Package Kernel :: Module SpectrumKernel :: Class SpectrumKernel
[hide private]
[frames] | no frames]

Class SpectrumKernel



BaseKernel.BaseKernel --+
                        |
                       SpectrumKernel

Simple kernel to calculate similarity between pairs of strings. Similarity is based on the number of "k-mers" in common between the two strings. That is, all substrings of length k.

Conceptually, a feature vector of all possible k-mers is created for each string and has counts assigned to the elements for each respective k-mer that exists in the string. The dot-product between these two vectors is then taken as the similarity score.

This is a very large vector of length (n^k) where n is the number of letters in the "alphabet" of the string. That is, the number of possible distinct characters the string can contain. This is a sparse vector, mostly 0's, thus actual such arrays are not used to represent these arrays. Instead, a "feature dictionary" containing only found k-mers and their counts is created.

Instance Methods [hide private]
 
__init__(self, k)
Constructor.
 
similarity(self, obj1, obj2)
Primary abstract method where, given two objects, should return an appropriate, non-negative, similarity score between the two.
 
buildFeatureDictionary(self, aString)
Create a dictionary keyed by all the k-mers (k-length substrings) of aString, with values equal to the number of times that k-mer appears in aString.
 
outputFeatures(self, objIterFactory, outFile)
Iterate through the input objects and generate the kernel features for each.
 
objectDescription(self, object)
Given one of the data objects, return a string description.

Inherited from BaseKernel.BaseKernel: dictionaryDotProduct, dictionaryEuclideanDistanceSquared, ensureListCapacity, getFeatureDictionary, normalizeFeatureDictionary, outputMatrix, prepareFeatureDictionaryList

Class Variables [hide private]
  k = -1

Inherited from BaseKernel.BaseKernel: featureDictList, objIndex1, objIndex2

Method Details [hide private]

__init__(self, k)
(Constructor)

 
Constructor. Takes the value k as an argument to specify the length of the "k-mer" substrings to find in common.

similarity(self, obj1, obj2)

 
Primary abstract method where, given two objects, should return an appropriate, non-negative, similarity score between the two. Up to the implementing class to define what this is.
Overrides: BaseKernel.BaseKernel.similarity
(inherited documentation)

buildFeatureDictionary(self, aString)

 
Create a dictionary keyed by all the k-mers (k-length substrings) of aString, with values equal to the number of times that k-mer appears in aString.
Overrides: BaseKernel.BaseKernel.buildFeatureDictionary

outputFeatures(self, objIterFactory, outFile)

 
Iterate through the input objects and generate the kernel features for each. Then write all of them in feature dictionary format to the output file.

objectDescription(self, object)

 
Given one of the data objects, return a string description. Individual kernel classes should implement the proper thing to do here. For example, if the input is a SMILES string, just return the string itself, but if the input object is an OEMolBase object, it should generate a SMILES string for it, etc.