Package CHEM :: Package CombiCDB :: Module PatternMatchCounter
[hide private]
[frames] | no frames]

Module PatternMatchCounter



Given a set of SMARTS patterns and molecules, counts how many
times each SMARTS pattern (i.e. functional group) is found in
each molecule.

Also includes a script to generate the output in a format easily inserted
into the application database.  Assuming starting with some molecule
and SMARTS files that have NOT been inserted to the database, a complete
run, including inserting the product info into the database could be
accomplished with the following from the command line:

===========================================================================
python PatternMatchCounter.py molecule.smi example.smarts match.counter
python DBUtil.py -imolecule.smi     -tMOLECULE      -omolecule.smi.id    CAN_SMILES LABEL
python DBUtil.py -iexample.smarts   -tPATTERN       -oexample.smarts.id  SMARTS LABEL
python PatternMatchCounter.py -dmatch.txt -cmatch.counter molecule.smi.id example.smarts.id
python DBUtil.py -imatch.txt        -tPATTERN_MATCH -omatch.txt.id      MOLECULE_ID PATTERN_ID COUNT
===========================================================================

Alternatively, if you wish to use reactants and SMARTS from the database, something like this:

===========================================================================
python DBUtil.py "select CAN_SMILES, LABEL, MOLECULE_ID from MOLECULE"  molecule.smi
python DBUtil.py "select SMARTS, LABEL, PATTERN_ID from PATTERN"      example.smarts
python PatternMatchCounter.py molecule.smi example.smarts match.counter
python PatternMatchCounter.py -dmatch.txt -cmatch.counter molecule.smi example.smarts
python DBUtil.py -imatch.txt -tPATTERN_MATCH -omatch.txt.id MOLECULE_ID PATTERN_ID COUNT
===========================================================================

Input: 
- Molecule file
    Can be any format understandable by oemolistream, assuming a properly 
    named extension.  For example, "molecules.smi" for SMILES format
- SMARTS pattern file
    File containing one SMARTS pattern string per line that will 
    be used to search the molecules

Either of the above can take stdin as their source by specifying the 
filename "-" or ".smi" or something similar.  See documentation of 
oemolistream for more information

Output:
- Match counter file
    For each molecule read from the molecule file, will output one 
    line of counts, tab delimited.  For each line, there will be one count per 
    SMARTS pattern read.  The values will appear in the same order as 
    the SMARTS patterns were read, and the value will equal the number
    of times that SMARTS pattern was matched in the respective molecule.
    Again, redirection to stdout possible by specifying the filename "-".  



Functions [hide private]
 
main(argv)
Command-line main method
 
countPatternMatchesByFilename(moleculeFilename, smartsFilename, counterFilename)
Opens files with respective names and delegates most work to "countPatternMatches"
 
countPatternMatches(moleculeOEIS, smartsFile, counterFile)
Primary method, reads the source files to count pattern matches for the output file.
 
readSMARTSFile(smartsFile)
Read the contents of the smartsFile as a list of SMARTS strings.
 
formatDBFileByFilename(counterFilename, moleculeIDFilename, smartsIDFilename, dbFilename, sparse=True)
Opens files with respective names and delegates most work to "formatDBFile"
 
formatDBFile(counterFile, moleculeIDFile, smartsIDFile, dbFile, sparse=True)
Given the database IDs of molecules, patterns (SMARTS) and a counter matrix relating the two, generate a simple text file that should be very easy to import into the database to persist that association information.
Function Details [hide private]

countPatternMatches(moleculeOEIS, smartsFile, counterFile)

 

Primary method, reads the source files to count pattern matches for the output file. See module documentation for more information.

Note: This method takes actual File objects, not filenames, to allow the caller to pass "virtual Files" for the purpose of testing and interfacing. Use the "main" method to have the module take care of opening files from filenames.

One extra catch, the molecule source is not a file, but an oemolistream necessary to take advantage of that classses high-level management of different molecule file formats

readSMARTSFile(smartsFile)

 
Read the contents of the smartsFile as a list of SMARTS strings.
Comment lines prefixed with "#" will be ignored.  
Expects one SMARTS string per line of the file.  Each SMARTS string can be followed
    by any title / comment, etc. separated by whitespace.  These will be ignored.
Returns a list of OESubSearch objects, instantiated with the respective SMARTS string.

formatDBFile(counterFile, moleculeIDFile, smartsIDFile, dbFile, sparse=True)

 

Given the database IDs of molecules, patterns (SMARTS) and a counter matrix relating the two, generate a simple text file that should be very easy to import into the database to persist that association information.

To trim the output a bit, you can set the sparse option to True to not generate rows for matches that yielded a count = 0 (no matches, which will be most common)

Each line produced should correspond to a row in the PATTERN_MATCH table, with values to insert respective to MOLECULE_ID, PATTERN_ID and COUNT