Package CHEM :: Package ML :: Package features :: Module BaseFeatureExtractor :: Class BaseFeatureExtractor

Class BaseFeatureExtractor

Abstract base class for all feature extractor classes.

Feature extractor classes are those that can map input data objects into feature vectors (actually represented as feature dictionaries given the typical sparseness of such vectors).

Most commonly this is used ultimately for kernel functions which should simply take any pair of objects and calculate some similarity score between them (given positive definite limitations) for use in a support vector machine (SVM) style machine-learning application, without ever requiring an explicit vector representation of the data objects.

For current practical purposes, most of our kernels are based on some data -> feature dictionary mapping, on which we can then apply any of a variety of vector-based similarity measures to complete the kernel function. These include scalar dot product, Tanimoto, MinMax, Gaussian, etc.

The input objects need not be of any particular type as far as this interface is concerned. They may be strings, molecules (OEMolBase), vectors, etc. It is up to the implementing class to make those distinctions.

The results of these extractors will be feature dictionaries for each input object. These are simple dictionary objects representing sparse feature vectors with the most common interpretation having the dictionary item keys as string representations of the features, and the item values as the number of times that feature appears in the data object.

The general output of these extractors will be a text file representation of these feature dictionaries, the specifics of which are specified by the FeatureDictWriter class. Modules in the Similarity package can then apply the assorted similarity measures on these feature dictionaries to produce a Gram matrix of similarity scores for input into an SVM or other learning machine.

Instance Methods

[hide private]

__init__(self)
Default constructor.

loadOptions(self, options)
Load relevant options derived from an optparse.OptionParser into the state of this object.

loadArgs(self, args)
Similar to loadOptions, handle the arguments that come out of optparse.OptionParser.

main(self, argv, closeOutfile=True)
Main method, callable from command line.

__call__(self, obj)
Primary abstract method.

objectDescription(self, obj)
Abstract method.

getNameID(self, obj)
Overridable method.

outputFeatures(self, objIter, outFile)
Convenience method shared by all extractors to generate and output features for all input objects.

Class Variables

[hide private]

parser = <CHEM.DB.rdb.search.NameRxnPatternMatchingModel.Searc...

inputIter = <CHEM.DB.rdb.search.NameRxnPatternMatchingModel.Se...

outFile = <CHEM.DB.rdb.search.NameRxnPatternMatchingModel.Sear...

inputFunction = <CHEM.DB.rdb.search.NameRxnPatternMatchingMode...

Method Details

[hide private]

init(self)
(Constructor)

Default constructor. Sets up expected command-line options.

Sub-classes can add their own options on top of these, though should beware of overwriting an option letter.

loadOptions(self, options)

Load relevant options derived from an optparse.OptionParser into the state of this object.

Sub-classes should have this handle any of the options it added to the command-line parser via the constructor.

loadArgs(self, args)

Similar to loadOptions, handle the arguments that come out of optparse.OptionParser.

Subclass is responsible for translating the command-line arguments into an actual input iterator and output file object.

A default implementation is available here, assuming the arguments should be interpreted as a simple input and simple output file. If this is not the case, the sub-class should override this method or modify the self.inputFunction member in its constructor. For example using something like an oemolistream or FeatureDictReader.

main(self, argv, closeOutfile=True)

Main method, callable from command line.

Setup several common options that all of the sub-classes will share.

call(self, obj)
(Call operator)

Primary abstract method. Build a dictionary of an input object's important features.

Should be such that it is easy to compare any two objects' feature dictionaries.

This uses the "callable" interface, which means the object is a functor which should be used like a function call. For example: >>> from SpectrumExtractor import SpectrumExtractor; >>> featureExtractor = SpectrumExtractor(); >>> featureExtractor.k = 1; >>> featureDict = featureExtractor("teststring"); # Note that the extractor object looks like a function call >>> features = featureDict.keys(); >>> features.sort(); >>> for feature in features: ... print feature, featureDict[feature] e 1 g 1 i 1 n 1 r 1 s 2 t 3

objectDescription(self, obj)

Abstract method. Return a string description of the input object.

Individual extractor classes should implement the proper thing to do here. For example, if the input is a SMILES string, just return the string itself, but if the input object is an OEMolBase object, it could generate a SMILES string for it, etc.

getNameID(self, obj)

Overridable method. Return a string name or ID label for the input object.

For molecule objects, this will probably be mol.GetTitle(). Otherwise, default to a sentinel value.

outputFeatures(self, objIter, outFile)

Convenience method shared by all extractors to generate and output features for all input objects.

Generated features will be output in feature dictionary format to the output file.

Class Variable Details

[hide private]

parser

Value:

None

inputIter

Value:

None

outFile

Value:

None

inputFunction

Value:

None

Class BaseFeatureExtractor

__init__(self) (Constructor)

loadOptions(self, options)

loadArgs(self, args)

main(self, argv, closeOutfile=True)

__call__(self, obj) (Call operator)

objectDescription(self, obj)

getNameID(self, obj)

outputFeatures(self, objIter, outFile)

parser

inputIter

outFile

inputFunction

init(self)
(Constructor)

call(self, obj)
(Call operator)