untitled

The Main Goal of this class is to enable biologists and computer scientists to work together. This requires learning something of each others concepts and methodologies.

Grading Your grade is based on your homework (minor), your project presentation and write-up (major). If you believe your homework assignment has been misgraded (a distinct possibility), simply resubmit with an explanation and I will look at it again. Or see me during office hours. A project does not have to be successful. It need only be a reasonable effort. The project write-up should follow the standard paper format.

NOTE: The following is a tentative schedule. The general flow is correct, but some later topics may be replaced. Time may vary with the requirements for understanding the work. Questions in class or via email are welcomed. Only next week's homework is guarantee to be correct.

Default assignment: For each paper or chapter that is required, hand in a short typed observation, question, correction or criticism about the work. These are due on first meeting of the class for the week, usually tuesdays. A paragraph is plenty. Default assignments may be overruled or supplanted

The best type of comment would take the form of constructive criticism: how could the author make the paper more convincing or more significant. Pretend it is a relative that has written the paper and you are helping them get their ideas published.

Example Projects: Projects do not need to be successful, in the sense of discovering something new. WARNING: Use other peoples data and other peoples algorithm. If data is not in hand, the project will not finish. If the algorithms are not known, the time to code them is too short. A project may reconfirm what is already known.

1. Working with a biologist. Understand the data and the question that the biologist has and apply either some machine learning algorithm or download existing software and apply it to the problem. This is a 2-person project and the final report should reflect this. The write-up will be about 8 pages including a description of the data, the problem, and the algorithm.

2. Working another CS student. Using existing biocomputational programs. This is similar to 1, but with much less chance of success. An example would be to apply some gene-finding program to some genome. You would explain the algorithm, the results and alternative approaches.

3. Working with another CS student. Instead of using existing bioinformatics tools, use machine learning methods. Numerous algorithms are available over the web. A suite of such algorithms is available at the Weka site, a unified collection of about 30 machine learning algorithms. Write-up would be similar.

4. Working with another CS students or alone. Any idea you have for analyzing genomic information (genomes, gene expression data, protein data, metabolic data etc). You may implement your own algorithm, but the aim of the project, which may not be realized, should be the discovery of new Biological knowledge.

I will be suggested many projects in the lectures. Unless noted, readings are from the text "Bioinformatics" by Mount. There are many sites with lectures notes for Computational Biology, but I think the notes from Martin Tompa's class are particular useful. Here's the url: http://www.cs.washington.edu/homes/tompa/. Tompa is a computer scientist and Mount is a biologist. The differences in the way they think should be apparent. Another good source for lectures notes is from Princeton at http://www.cs.princeton.edu/courses/archive/fall01/cs551/. For notes on mathematical aspects of this course , such as probability, entropy and hidden markov models see: http://www-2.cs.cmu.edu/~awm/tutorials/.

Algorithms can be and should be understood at multiple levels. At the minimum you should clearly understand the inputs, outputs, and assumptions, i.e. you should know what the algorithm computes. A different level of understanding is how the computation is carried out. That level is necessary if you want to code or improve the algorithm.

Week 1: Introduction to Molecular biology and Computation
- Read Chapter 1 and the first 11 pages of Chapter 10.
- Read Lecture 1 and Lecture 2 from Tompa's lecture notes.
- Optional: Computer Scientists: use search engine on "Computation Molecular Biology at NIH", then Cold Springs Harbor, then Dolan Learning Center: look at mini-lesson 19.
- Find and talk with your bio-computer mate
- Optional: Look at Jacques van Helden's site
- Homework: Hand in next Monday a paragraph about chapter 1 and the first part of chapter 10.
Week 2 Pairwise Alignment
- Read Chapter 3
- Optional: Read Tompa's lecture notes 3 and 4.
- Dot-matrix
- Needleman-Wunsh
- Smith-Waterman
- Homework due next Monday: (replaces default assignment) Hand in:
  1. Compute (by hand) the Dot-Matrix for the strings actgact and gactatca.
  2. What do you notice?
  3. Compute (by hand) the global alignment of aact and gatc. Show the matrix.
  4. What is the minimum and maximum global similarity for two strings of length n and m?
Week 3: Multiple Alignment
- Claimed gentle introduction at http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/mulali.html
- Read Chapter 4 and Tompa's notes, lecture 6.
- Gusfield's star-alignment
- ClustalW available on web
- Markov Models
- Homework
  1. Run ClustalW on the 500 bp upstream region of the NIT family. The genes are listed in Van Helden's paper and you can retrieve the sequences from his site. There are many servers for Clustalw. One is at http://www.ch.embnet.org/software/ClustalW.html.
  2. Report the results. In particular can you identify the regulatory elements noted in Van Helden's paper.
Week 4 & 5 Finding Regulatory Elements
- Van Helden (Extracting Regulatory Sites form Upstream Region of Yeast Genes by Computational Analysis of Oligonucletide Frequencies; JMB 1998, 827-842)
- Consensus (Stormo)
- EM Algorithm (Elkan paper)
- Homework: Hand in next monday
  1. Get the upstream regions for the Hap family of Yeast genes (in his paper) from Van Helden's site.
  2. Run the Oligo-nucleotide analysis program on these genes using different models of what is expected, i.e. different background models.
  3. Report(hand in) your results with some explanation and comparison with Van Helden's results.
Week 6: Micro-Array Analysis
- Read
  - Cluster Analysis and display of genome-wide expression patterns, Eisen , Spellman, Brown, Botstein. PNAS 1998
  - Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Golub, ... Lander. Science 1999
- Visit the Dolan Learning site, lesson 36
- Clustering Approaches: SOM, K-Means, Hierarchical Clustering
- Homework: This homework is cancelled. Concentrate on Project.
  1. Hand in critique of Golub et al paper
  2. Consider the data {1, 2, 5,6, 9,10,11}, i.e these 7 points.
  3. a) Do a hierarchical clustering of this data (draw the tree)
  4. b) Show the steps in k-means when k = 2 and the starting points are two initial centroids are the points 1 and 2. Again, draw the pictures.
  5. Note: drawings should be done by hand.
Week 7: Phylogeny analysis
- No assignment: work on project
- Read Chapter 6
Week 8: Structure Prediction
- No assignment: work on project
- Read Chapter 8
- RNA structure prediction
- DNA structure prediction
- Protein Folding: contact maps, threading
Week 9: November 26 Evaluating Weka algorithms for analyzing SNPs
Eric Wang, Scott Murphy, & Peter Hebben
Identify candidate response elements to Androgen Receptor
Chris Wasserman, Chin-Yi Chu, & Greg Kodama
Happy Thanksgiving.
Week 10: December 3
Analyze Life-Cycle Gene Expression data for Chlamydia (1000 ORFS) to determine which genes are responsible for transforming from RB (reticular body/non-infectious) to EB(elementary body/infectious). Next analyze upstream regions for regulatory binding sites.
Johnny Akers, Jianlin Cheng & Arlo Randall
Study of protein-protein interactions in yeast
Kevin Lin, Yimeng Dou & Haiying Deng
December 5
Correlate Gene Expression data with Protein-Protein Interaction Data
Lin Wu & Yu-Chyuan Su