Simple class to extract substrings as features of string objects.
Features are based on the number of "k-mers" in a string. That
is, all substrings of length k.
Conceptually, a feature vector of all possible k-mers is created for
each string and has counts assigned to the elements for each respective
k-mer that exists in the string.
This is a very large vector of length (n^k) where n is the number of
letters in the "alphabet" of the string. That is, the number
of possible distinct characters the string can contain. This is a sparse
vector, mostly 0's, thus actual such arrays are not used to represent
these arrays. Instead, a "feature dictionary" containing only
found k-mers and their counts is created.
|
|
|
loadOptions(self,
options)
Load relevant options derived from an optparse.OptionParser into
the state of this object. |
|
|
|
__call__(self,
obj)
Create a dictionary keyed by all the k-mers (k-length substrings)
of the input string object, with values equal to the number of times
that k-mer appears in the string. |
|
|
|
objectDescription(self,
obj)
Input is a string itself, so just return the input object
itself |
|
|
Inherited from BaseFeatureExtractor.BaseFeatureExtractor :
getNameID ,
loadArgs ,
main ,
outputFeatures
|