Bio 465 Bioinformatics

BYU | Bioinformatics | CS Dept.
Bio 465


The project provides an opportunity for you to explore a topic of your own choosing. You may choose to expand one of the labs we have worked on in class, or you can pursue something totally different.

2012 Projects

Group Number Students Project Milestones
1 Andrew Noyce LC-MS peak simulator 2/15 Implement Intensity Predictor.
2/19 Implement Intensity variance model.
3/15 Implement M/Z variance model.
3/30 Implement White Noise Model.
4/10 Implement Write Paper.
2 Ben Ainscough, Shane Dooley, David Patty, Krista Klinger, Mo Lee, Matt Bailey Structural genes Alzheimer disease (2/3) Web tool: fix write Batch SAS script, make a results page. Create 3+ test cases for extracting overlapping SNPs from NCBI and our data.
(2/10) Web tool: fix JavaScript functionality in chrome. Finish script for merging data sets.
(2/17) Design analysis tool: Create a program that that grabs top p-values in individual plink runs. Finish script for getting SNPS from a gene region.
(2/24) Test analysis tool. Have Gene Regions for all 190 Proteins Mapped.
(3/9) Begin running analysis on Cache, RBM, and ADNI data. Integrate structural gene analysis portion into web tool & start running analysis.
(3/30) Finish running analysis and start writing results. Finish running analysis and start writing results.
(4/16) Have report finished (or earlier).
3 Ijesh Giri, Michelle Farmer Bacterial Genomics 02/06 - Further testing of the SNP calling program is necessary.
02/15 - Assembly on the remaining bacterial data (Burkholderia mallei and Burkholderia pseudo mallei).
02/30 - Once we have the SNP calling program fully functional, we need to expand its functionalities. We are going to assign each SNP mutation to a Gene Ontology term. In the end, we are going to have something like a pie chart showing what cellular functions are being affected.
03/30 - Then we can start integrating the pipelines together, including some added functionalities of automatically running some basic analyses: phylogeny (raxml), network (splitstree, tcs), diversity (variscan), recombination rates (ldhat), and dN/dS (PAML)
04/10 - We are going to consolidate everything with a pipeline and a text command file. This would be similar to programs like PAML in which you need to provide a text file with the options you want to apply to your analysis to automate the process from the raw data to final results.
Tentative - If we can finish these than we can further check the reads that aren't used, probably by blasting unused reads.
4 Stan Fujimoto, Justin Page Cotton Genomics 2/3 Gather pi values for all gene regions (exon, intron, utr) by accession and genome.
2/10 Select pi values, excluding multi-copy genes and unknown coding vs non-coding regions.
2/17 Communicate with Emmanuel (Iowa collaborator) to coordinate further efforts on 3 papers.
3/2 Finalize analysis for genome-comparison (second paper).
3/16 Finalize analysis for domestication-comparison (third paper).
3/23 Draft Materials and Methods section for genome-comparison paper.
5 Krijan Shrestha, Kason McEwen Ipad app 1/15 - Purchase apple developer account. Create development and distribution profiles for the app. (Kason)
1/28 - Create custom drag bar components that works better with ipad and incorporate in simulation activity. Also use in future for scrollable content pages. (Kason)
2/5 - Find and develop good sound effects/music. (Krijan)
2/5 - Create buttons & design components (swipeable scroll content) (Kason)
2/19 - Design content (formatting, look & feel, text, images, etc.) (Krijan)
2/25 - Reprogram user interface & navigation - incorporate content. (Kason)
3/6 - Functionality for Natural selection game - (Menu Screen, Levels, Timer, Score, Allele Frequencies, etc.) (Kason)
3/15 - Natural Selection Game artwork/animation improvements. (Krijan/Kason)
3/15 (optional if extra time) - Add selection coefficients to population genetic simulation. (Kason)
3/20 - Submit to App store for rejection or approval! (Kason)
4/10 - Have a polished report and presentation prepared. (Krijan/Kason)
6 Steven Nevers, David Brandt, Matt Biggs Improving Supervised Decision Support for Diagnosing ADAMTS13 Deficiency Feb 3, 2012 Meet with ARUP/Include qualifications on web tool.
Feb 24, 2012 Update web decision tree to allow missing values.
Mar 2, 2012 Begin work with Neural Networks.
Mar 16, 2012 Complete Neural Networks work.
Mar 23, 2012 Intermediate work with classification trees.
Mar 30, 2012 Complete work with classification trees.
Apr 6, 2012 Finalize project.
7 Kimberly Holcombe, Eden Jensen Down Syndrome gene expression Eden - Research Transcription Factors
Eden - Research best methods to find the start of genes
Kimberly - Code program to use scoring matrix to find likely transcription factor sites (3/12)
Kimberly - Improve code to use gene start spots (3/19)
Eden, Kimberly - Testing, debugging, and web interface (4/2)
Eden, Kimberly - Test with James and Adam's project, finish paper (4/8)
8 James Jensen, Adam Rogers Integrated Microarray and Pathway Analysis Computational Tool 2/3 James: Get current start and target gene lists to agree with identifiers in microarray data; check literature. Adam:
2/10 James: Rework Dijkstra script to include edge objects. Adam: Figure out the best way to compute the MIM or ARACNE matrix (possibly Airnet, or R on supercomputer).
2/17 James: Polish up last semester's tool; check literature. Adam:
2/24 James: Find ways to speed up last semester’s tool; check literature. Adam: Figure out the best way to handle different gene identifiers (Ajax query of gene ids in data).
3/2 James: integrate last semester’s tool with Dijkstra script; check literature. Adam:
3/9 James: Write optimized C++ version of all scripts. Adam: ID matching with Pathgen.
3/16 James: Write optimized C++ version of all scripts. Adam: Figure out how to query Pathgen with edges.
3/23 James: issues and problems. Adam:
3/30 James: find ways quantify and compare accuracy. Adam: Website and display.
4/6 James: any last adjustments. Adam:
4/16 (Project presentation). James: prepare for presentation. Adam:
9 Ken DeCelle Genome Assembly 1/30: Proposal Due
2/4: 454 Codon Documentation Read
2/18: Parser for 454 completed
3/3: Rough Algorithm for Heterotids implemented
3/17: Fine-Tuned Algorithm implemented
3/31: Algorithm thoroughly tested with Raspberry data
4/7: Final Project Report created

2011 Projects (This will give you an idea of how the projects will work)

Group Number Students Project Milestones
1 Chris Conley Minimizing LC-MS peak detection error through the Kalman filter 2/8 - Improve ghost scan
2/21 - Manually validated code
2/28 - Optimize code
3/18 - Extended Kalman Filter
3/23 - Define several nonlinear models of a peak
3/30 - Curated data set
3/30 - Clean up, speed up code
4/5 - Improve intensity estimation
2 Chris Decker Polyomavirus Phylogeny and Molecular Evolution 3/16 - BEAST data set complete
3/20 - Run and obtain preliminary BEAST results
3/18 - Include BEAST results paper
4/1 - Complete report
3 Nate Jensen Mutual Exclusivity between CRISPR Loci and Temperate Bacteriophages 2/2 - Compile list of organisms
2/11 Populate database
2/18 - Run PhageFinder
2/22 - Compare spacer database with prophage database
4 Sukhbat Tumur-Ochir Improving and benchmarking asynchronous inference of regulatory network algorithm against different gene network inference algorithms
5 Alan Colver Pancrustacean tree of life: origins and innovations 2/12 - Modify an existing Perl program to scan GenBank
3/21 - Run alignments on found sequences
4/8 - Have several possible phylogenetic trees
4/8 - Wrap up remaining discrepancies in data
6 YoungHoon Gim Raspberry genome assembly 3/11 - Use gene finding programs
3/18 - Write second draft
3/25 - Set up gBrowser and post annotated genes
4/3 - Finish writing program for connecting scaffolds using 5k paired end reads
4/10 - Finish running maker for at least 100 scaffolds and update at gBrowser
4/13 - Finish writing
7a Benjamin Haynes, Leilani Williams Assembly and Analysis of the Stomatopod Vision Transcriptome 2/28 First draft of manuscript
3/18 Second draft of manuscript
4/1 Assembly completed
4/4 Transcriptome annotation completed
7b Sunni Swain Candidate Genes for Vision in Invertebrates 2/14 Transcriptome data on FTP site (Sunni)
2/25 Write first draft
3/2 Gather annotated gene list and run through CAESAR
3/7 Verify top ranked genes by research
3/11 BLAST verified genes against T. castaneum, D. pulex, and D. Melanogaster
3/18 Write second draft
4/1 Finalize project
8 Gian Molina, Andrew Holm, Dohyup Kim Exploring Phylogenetic Relationships and molecular evolution of Superorder Pericarida 2/1 Sequences mined from Genbank
3/25 Sequences cleaned and verified
4/1 Transcriptome annotation completed

Project deliverables

The following deliverables are associated with the project:
  1. The project proposal includes:
    • Title (Be descriptive here)
    • Abstract (A paragraph introducing why the work is important and what you hope the results will be. Include the hypothesis that you are going to test as part of your project. If this is a software project like putting a GUI on an existing piece of software your hypothesis may be something like "A GUI will make the software easier to use", but it still provides a focus for your work.)
    • Related Work (You should reference similar projects and should have spent some time looking for methodologies that have been tried in the past)
    • Outline (A sentence or two for each section describing what you expect to include there. Each section should point back to the hypothesis and you will want to show how it relates in your outline.)
    • Milestones (Give specific dates of when you are going to have things done. You should have a approximate completion date for each task included in the outline.)
    • Conclusions (A sentence or two describing what you hope your findings will be)
    The final project report will include:
    • Title
    • Abstract (One paragraph describing your hypothesis and contribution)
    • Introduction(Start with the big picture "There is world hunger", move down "Better DNA analysis will provide more bread", and show how your contribution will help in solving a bigger problem.)
    • Related work (Show how your work relates to what has been done. You may not be doing anything that is vastly different from existing work, but you should be aware of the current literature)
    • Materials and Methods (What did you do and how did you do it)
    • Results (Even if things didnt work out the way you would have expected, describe the results and how they prove or disprove your hypothesis).
    • Conclusions (Review how your work demonstrates your hypothesis)

Project Ideas

  1. SOLID next Generation Assembly with Matt Dyer at Applied Biosystems. This project might give you a connection to Applied Biosystems if you are interested in working there in the future.

    Twelve samples from the host-pathogen system Glycine max – Phytophthora sojae sequenced on two slides: one slide contained four coinfected samples and the other slide contained 8 uninfected samples (four host and four pathogen). Samples were sequenced using SOLiDTM V3. G. max is has about a 940Mb genome with about 75k predicted transcripts and P. sojae is about a 77Mb genome with about 25k predicted transcripts.


    • Compare RNA mapping tools: TopHat/Cufflinks/ Bowtie, Bioscope
    • Use available tools to analyze the data: Genespring, Spotfire, Bioconductor, R
      • Detect novel transcript (from non-annotated exons)
      • Indentify differentially expressed genes in infected vs. uninfected samples
      • Detect pathways that are differentially expressed in infected vs. uninfected samples
      • Cluster genes based on expression patterns
      • Study the correlation of coverage bias and error rate with genomic features, and find the most relavent genomic features (GC%, local 5-8 mer? etc) with a HMM or other tools
    • If needed, write new tools to answer the above questions and make them available on public resources like SeqAnswers. Ideas for new tools also include
      • A consensus sequence generator that uses the anchored-alignments
      • Converting color-space to base-space using the local alignments
      • Detecting fusion transcripts
      • Counting module that converts mapped reads into gene counts, but uses uniquely mapped reads to weight non-uniquely mapped reads
    • Compare results with microarray data for the same samples *
      • Identify low-expressed genes that are not detected by microarrays
      • Show that you can perform the same types of analysis on sequencing data that you can with microarrays, and show HOW to do that - build tools to facilitiate this if necessary
      • Identify additional information you get from sequencing that you could not get from microarrays
  2. Raspberry Assembly We have about 30X coverage of the raspberry genome and Dr Udall would like to publish the genome in a journal in the next 6 months. This will be the first genome sequenced at BYU, and your name could be on the paper. Dr Udall indicated that he would like to hire some of the students who work on this project to finish the assembly during the spring and summer. Dr Udall will work with these groups and provide lots of help on what to do.
    • Assembly - Put together the best assembly possible with existing tools (newbler, velvet, CAP3, celera). Also perform assisted assembly using the related genomes strawberry and peach.
    • Repeat Regions - Characterize repeat regions. Blast against plant genome repeat region databases. Blast against self to find common patterns. Look for strings that are overrepresented vs related genomes.
    • Gene Prediction - Use FGENESH, BESTORF, TSSP_TCM, glimmer, to find genes and then produce statistics on the number of genes and the average length of genes. Produce a functional characterization of raspberry genes by blasting them against related genomes. Look for Phenolic synthesis genes.
    • annotation of the plastid genomes (new reference)

      References include
      The Genome Assembly Archive: A New Public Resource
      Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse
      Widespread genome duplications throughout the history of flowering plants
      Analysis of recent segmental duplications in the bovine genome
      Recall that dogs and human genomes are about 10x the size of raspberry ... and that plants are different.
      Sensitive and accurate detection of copy number variants using read depth of coverage
      Mapping DNA structural variation in dogs
      The genomic architecture of segmental duplications and associated copy number variants in dogs