Class BaseFeatureExtractor
Abstract base class for all feature extractor classes.
Feature extractor classes are those that can map input data objects
into feature vectors (actually represented as feature dictionaries given
the typical sparseness of such vectors).
Most commonly this is used ultimately for kernel functions which
should simply take any pair of objects and calculate some similarity
score between them (given positive definite limitations) for use in a
support vector machine (SVM) style machine-learning application, without
ever requiring an explicit vector representation of the data objects.
For current practical purposes, most of our kernels are based on some
data -> feature dictionary mapping, on which we can then apply any of
a variety of vector-based similarity measures to complete the kernel
function. These include scalar dot product, Tanimoto, MinMax, Gaussian,
etc.
The input objects need not be of any particular type as far as this
interface is concerned. They may be strings, molecules (OEMolBase),
vectors, etc. It is up to the implementing class to make those
distinctions.
The results of these extractors will be feature dictionaries for each
input object. These are simple dictionary objects representing sparse
feature vectors with the most common interpretation having the dictionary
item keys as string representations of the features, and the item values
as the number of times that feature appears in the data object.
The general output of these extractors will be a text file
representation of these feature dictionaries, the specifics of which are
specified by the FeatureDictWriter class. Modules in the Similarity
package can then apply the assorted similarity measures on these feature
dictionaries to produce a Gram matrix of similarity scores for input into
an SVM or other learning machine.
|
|
|
loadOptions(self,
options)
Load relevant options derived from an optparse.OptionParser into
the state of this object. |
|
|
|
loadArgs(self,
args)
Similar to loadOptions, handle the arguments that come out of
optparse.OptionParser. |
|
|
|
main(self,
argv,
closeOutfile=True)
Main method, callable from command line. |
|
|
|
__call__(self,
obj)
Primary abstract method. |
|
|
|
|
|
|
|
outputFeatures(self,
objIter,
outFile)
Convenience method shared by all extractors to generate and output
features for all input objects. |
|
|
|
parser = <CHEM.DB.rdb.search.NameRxnPatternMatchingModel.Searc...
|
|
inputIter = <CHEM.DB.rdb.search.NameRxnPatternMatchingModel.Se...
|
|
outFile = <CHEM.DB.rdb.search.NameRxnPatternMatchingModel.Sear...
|
|
inputFunction = <CHEM.DB.rdb.search.NameRxnPatternMatchingMode...
|
__init__(self)
(Constructor)
|
|
Default constructor. Sets up expected command-line options.
Sub-classes can add their own options on top of these, though should
beware of overwriting an option letter.
|
loadOptions(self,
options)
|
|
Load relevant options derived from an optparse.OptionParser into the
state of this object.
Sub-classes should have this handle any of the options it added to the
command-line parser via the constructor.
|
Similar to loadOptions, handle the arguments that come out of
optparse.OptionParser.
Subclass is responsible for translating the command-line arguments
into an actual input iterator and output file object.
A default implementation is available here, assuming the arguments
should be interpreted as a simple input and simple output file. If this
is not the case, the sub-class should override this method or modify the
self.inputFunction member in its constructor. For example using something
like an oemolistream or FeatureDictReader.
|
main(self,
argv,
closeOutfile=True)
|
|
Main method, callable from command line.
Setup several common options that all of the sub-classes will
share.
|
__call__(self,
obj)
(Call operator)
|
|
Primary abstract method. Build a dictionary of an input object's
important features.
Should be such that it is easy to compare any two objects' feature
dictionaries.
This uses the "callable" interface, which means the object
is a functor which should be used like a function call. For example:
>>> from SpectrumExtractor import SpectrumExtractor;
>>> featureExtractor = SpectrumExtractor(); >>>
featureExtractor.k = 1; >>> featureDict =
featureExtractor("teststring"); # Note that the extractor
object looks like a function call >>> features =
featureDict.keys(); >>> features.sort(); >>> for
feature in features: ... print feature, featureDict[feature] e 1 g 1
i 1 n 1 r 1 s 2 t 3
|
objectDescription(self,
obj)
|
|
Abstract method. Return a string description of the input object.
Individual extractor classes should implement the proper thing to do
here. For example, if the input is a SMILES string, just return the
string itself, but if the input object is an OEMolBase object, it could
generate a SMILES string for it, etc.
|
Overridable method. Return a string name or ID label for the input
object.
For molecule objects, this will probably be mol.GetTitle(). Otherwise,
default to a sentinel value.
|
outputFeatures(self,
objIter,
outFile)
|
|
Convenience method shared by all extractors to generate and output
features for all input objects.
Generated features will be output in feature dictionary format to the
output file.
|