Methodology
- MUpro is a set of machine learning programs to predict how single-site amino acid mutation affects protein stability. We developed two machine learning methods: Support Vector Machines and Neural Networks. Both of them were trained on a large mutation dataset and show accuracy above 84% via 20 fold cross validation, which is better than other methods in the literature. One advantage of our methods is that they do not require tertiary structures to predict protein stability changes. Our experimental results show that the prediction accuracy using sequence information alone is comparable to that of using tertiary structures. So even you do not have protein tertiary structures available, you still can use this server to get rather accurate prediction. Of course, if you provide tertiary structures, our methods will take advantage of them and you might get slightly better predictions.
Input Format
- Mutation name, optional.
- Mutation position.
- Original amino acid. The amino acid is a single standard letter. Don't use 3-letter abbreviation.
- Substitute amion acid.
- Sequence: a raw text of sequence, white space are ignored.
- Structure file in PDB format, optional.
Output Format
- Request name
- Protein sequence
- Prediction of the value of energy change (delta delta G) using Support Vector Machine (Recommended)
- Prediction of the sign of energy change using Support Vector Machines and neural networks: method used, effects of mutation on protein stability, and a confidence score between -1 and 1 to measure the confidence of the prediction.
A score less than 0 means the mutation decreases the protein stability. The smaller the score, the more confident the prediction is. Conversely, a score more than 0 means the mutation increases the protein stability. The bigger the score, the more confident the prediction is.
NOTICE: These methods use different methodologies and input information somewhat differently. For instance, prediction of delta delta G uses regression methods, while prediction of sign uses classification methods. So they may yield different predictions when the prediction is hard to make. In those cases, you can take the consensus as the predictions, or you can stick to one method. It is useful to know which method works better in practice and your feedback is very welcome.
Evaluation and Performance
Our methods are evaluated on a large dataset (compiled by Capriotti et al., Bioinformatics, vol. 20, pages 190-201, 2004) containing 1615 mutations using 20 fold cross validation procedure.
Under this procedure, the dataset is splited evenly into 20 folds. Any one fold is used as test dataset,
another remaining 19 folds are used as training dataset.
Thus there are 20 pairs of testing and training datasets.
For each pair, the SVM and neural network are trained on the training dataset and tested on the testing dataset.
The performance on all test datasets are combined and reported as the performance of tested methods. Here are the prediction accuracy (correct num / total num) by support vector machines using sequence information and tertiary structure information respectively.
-
The accuracy for Support Vector Machine using sequence information only is 84.2%.
-
The accuracy for Support Vector Machine using tertiary structure information is 84.5%.
References
- J. Cheng A. Randall, P. Baldi. Prediction of Protein Stability Changes for Single-Site Mutations Using Support Vector
Machines. Proteins, vol. 62, no. 4, pp. 1125-1132, 2005. [PDF]
- J. Cheng, A. Z. Randall, M. Sweredoski, and P. Baldi. SCRATCH: a Protein Structure and Structural Feature Prediction Server. Nucleic Acids Research, vol. 33 (server issue), w72-76, 2005.
Download MUpro 1.0