Assignment 4. Multiple Sequence Alignment
You can search for ClustalW or use the program at http://www.ebi.ac.uk/clustalw/. You should not change any defaults. It is not an acceptable excuse to say that ClustalW was unavailable when you wanted to use it. Therefore do this assignment early.
- On the same viral sequences as the last assignment, run the ClustalW program to produce a multiple alignment. In your *.doc file, put the alignment of the 10 viruses that end with position 300. This is just part of entire output of ClustalW.
- Columns in the alignment are marked with a star, colon, period and a blank. Select the columns corresponding to the first occurrence of each of these symbols and form the Probability Weight Matrix.
The full probability weight matrix for these 4 columns would have 20 rows, but you can simplify your answer to only include those amino acids that have varying probabilities plus an extra row for all the rest.
- What is the entropy of each column. Entropy was defined in ics171 and is sum of -pi*log(pi) where the log is taken base 2 and pi is the probability of entry i, in this case of amino acid i. Recall that we define 0*log(0) as 0 since the limit e*log(e) as e goes to 0 is 0.
- What is the match score of the sequence NNNN with this PWM. Recall that the match score is the sum of the corresponding probabilities.
- What sequence of four amino acids would yield the highest score and what is that score?
- Does the ClustalW program guarantee to find the optimal multiple alignment?