Simple kernel to calculate similarity between pairs of strings.
Similarity is based on the number of "k-mers" in common between
the two strings. That is, all substrings of length k.
Conceptually, a feature vector of all possible k-mers is created for
each string and has counts assigned to the elements for each respective
k-mer that exists in the string. The dot-product between these two
vectors is then taken as the similarity score.
This is a very large vector of length (n^k) where n is the number of
letters in the "alphabet" of the string. That is, the number
of possible distinct characters the string can contain. This is a sparse
vector, mostly 0's, thus actual such arrays are not used to represent
these arrays. Instead, a "feature dictionary" containing only
found k-mers and their counts is created.
|
__init__(self,
weightFactor=1)
Constructor. |
|
|
|
similarity(self,
obj1,
obj2)
Primary abstract method where, given two objects, should return an
appropriate, non-negative, similarity score between the two. |
|
|
|
buildFeatureDictionary(self,
aString)
Create a dictionary keyed by all the k-mers (k-length substrings)
of aString, with values equal to the number of times that k-mer
appears in aString. |
|
|
|
weightCalc(self,
stringLen)
This function will determine the weight that a string of length
stringLen (int) should be given |
|
|
Inherited from BaseKernel.BaseKernel :
dictionaryDotProduct ,
dictionaryEuclideanDistanceSquared ,
ensureListCapacity ,
getFeatureDictionary ,
normalizeFeatureDictionary ,
outputMatrix ,
prepareFeatureDictionaryList
|