Assignment 1
- Part 1. Write a Java or Python program to find candidate genes in a DNA sequence of about 20,000 base pairs. The file is in the Files directory of masterhit\instructional\ics-174\files and called MysteryDNA.txt. The definition of a candidate gene is a subsequence beginning with atg, ending with taa, tag, or tga, having no other stop codons interior to the sequence in the same open reading frame, and having a length of at least 60 codons. Note and warning: the beginning atg establishes the reading frame. You only check codons that are a multiple of 3 away from the start codon, i.e. codons in the proper reading frame. There may be other atg's in the sequence. The output of the program should be a list of four numbers, one per line. This quadruple consists of
- the start codon position where you count, like most humans, from 1. You count the number of nucleotides. It would be senseless to count codons since no reading frame is established. That's the way biologists count.
- the length in codons, including the start and end codon. Also we continue to count from 1.
- The probability of seeing the candidate gene. This is the product of the probability of a start codon times the probability of the number of non-stop codons times the probability of a stop codon. Here we assume (and it is not true) that each codon is equally likely.
- The expected number of such candidate genes in the entire given dna sequence. This is the product of the number of possible places the candidate gene could occur times the probability of the candidate gene.
You need only deal with candidate strings in the given dna sequence and need not worry about candidate on the other strand (the reverse complement).
- Part 2. In a separate document (text or word) submit the answers to the following questions.
- A pseudo-code description of the algorithm.
- The output of your program.
- A worse case time-complexity analysis, i.e. given the O notation for the time complexity.
- A description of a string of length n where the time complexity is O(n^2).
- A description of a string of length n where the time complexity is O(n).
- The reverse complement of the start and stop codons.
- A description, similar to the one in part 1, of what you would search for if you wanted to find candidate genes on the complementary DNA strand.