Assignment #3:
  1. If the entire human genome were cleaved into a single set of small nonovelapping fragments, we could not determine the genome sequence by sequencing the fragments. Explain why this is the case. (5.1.2 question 7)
  2. Use the program fragmenter.pl in ~clement/cs418/ch5 to generate fragments from a long sequence of data that you download from genbank. Vary the minimum and maximum fragment size, the fold coverage, the mutation/misread rate for a set of fragments, and the mutation/misread rate for a single fragment.
  3. Use velvet to assemble the fragments

    Using Velvet
    While the best instructions on velvet are in the documentation, for basic use all you need is the following (Note: run this on psoda64.cs.byu.edu - it won't work on psoda4 or dna or the other machines):

    bash$  velveth outputDirectory 21 fragments.fa
    bash$  velvetg outputDirectory
    
    This will create outputDirectory and put the assembled contigs into outputDirectory/contigs.fa.
    The '21' has to do with the size of the hash (sec. 5.1 in the Velvet manual), and there are other options for optimizing the algorithm on short and long fragments with -short and -long. These two commands should get you along for this project, though please feel free to look at the documentation and play around with velvet's options if you want.
  4. Determine the quality of the assembly when compared to the original sequence. You might want to try clustalw, blast or your own method of looking at the assembly. Remember that some of the fragments were generated for the reverse strand, so you should include the reverse compliment in your comparison.
  5. Compare the quality of the assembly for each setting of the fragmentation parameters.
    Working Example
    The following example should get you going. Copy everything from ~clement/cs418/ch5 into your own directory so you can change things.
    1. Run fragmenter.pl with the following settings
      • Minimum fragment size: 10
      • Maximum fragment size: 15
      • Fold coverage: 1
      • Misread rate for set of fragments: 0
      • Misread rate for a single fragment: 0
    2. You will see coverage values at the bottom of the file. Delete those lines and write the remaining fasta entries to a file frag.fa. (Try grep: grep -v ^coverage fragout.txt > frag.fa )
    3. run "velveth velvetout 5 -short frags.fa"
    4. run velvetg velvetout
    5. In the velvetout directory, you will find a "contigs.fa" file. Edit this file and include the original sequence at the top.
      >original
      actaggacattagagacata
      gggtttatacaggattaaaa
      >NODE_1_length_10_cov_2.700000
      ACATAGGGTTTATA
      >NODE_2_length_8_cov_4.750000
      ACATTAGGACAT
      >NODE_3_length_8_cov_3.500000
      GATTACAGGATT
      
    6. Now run clustal to align the sequences so you can see the quality of the assembly "clustalw2 contigs.fa". You should find the output in contigs.aln
      CLUSTAL 2.0.12 multiple sequence alignment
      
      
      original                           ACTAGGACATTAGAGACATAGGGTTTATACAGGATTAAAA
      NODE_2_length_8_cov_4.750000       ------ACATTAG-GACAT---------------------
      NODE_1_length_10_cov_2.700000      ------ACATAGGGTTTATA--------------------
      NODE_3_length_8_cov_3.500000       -------GATTACAGGATT---------------------
                                                 **        *