Bioinformatics

Bio 465 Bioinformatics

BYU | Bioinformatics | CS Dept.
Bio 465
HMM Tutorial This tutorial is intended to give you a feel for working with Hidden Markov Models.

We will be using the ghmm package for this tutorial. It allows you to predict state transitions and train the model on real data. The documentation is sparse, but hopefully this tutorial will help you to be successsful. You can find the source distribution on psoda.cs.byu.edu in /usr/local/src/ghmm-0.7.0 and /usr/local/src/ghmmwrapper-0.7.

So, lets work through an example or two and then let you try some things on your own.

  1. Before we get started, you will need to set the LD_LIBRARY_PATH (export LD_LIBRARY_PATH=/usr/local/lib)
  2. The ghmm toolkit has python bindings and we will be using them for this tutorial. Dont worry about learning python. It is a lot like perl and we will guide you through the syntax. To start things up, ssh to psoda4.cs.byu.edu and type "python".
  3. You should import ghmm and to save time in typing ghmm. before any class or funcction, you can import all of the contents.
    import ghmm
    from ghmm import *
    help('ghmm')
    
  4. Take a look at the help, it is the most complete documentation
  5. Now tell the model that you are using dna data
    dna = ['a','c','t','g']
    sigma = Alphabet(dna)
    print sigma
    
  6. Now create the transition matrix A. This translates to the following:
    • State 0 has a 0.9 probability of staying in state 0, and a 0.1 chance of moving to state 1
    • State 1 has a 0.3 probability of moving to state 0, and a 0.7 probability of staying in state 1.
    A = [[0.9, 0.1], [0.3, 0.7]]
    
  7. Now set up the emmission probabilities. The first one is for 'a', the second for 'c', and so forth.
    normal = [.25,.15,.35,.25]
    island = [.25,.25,.25,.25]
    B=[normal,island]
    
  8. Now set up the initial probabilities of being in each state. We will assume that they are both 0.5 for now.
    pi = [0.5] * 2
    
  9. Create the HMM from the matrices
    m=HMMFromMatrices(sigma,DiscreteDistribution(sigma),A,B,pi)
    print m
    
  10. You can generate a sequence using this model
    obs_seq = m.sampleSingle(200)
    print obs_seq
    
  11. But what you really want is to train from real data. ghmmm has a funky xml format for their data. You can create a file called "dna.py" with the following in it.
    train_seq = EmissionSequence(sigma, [
    'g', 'c', 'c', 'g', 'g', 'c', 'g', 'c', 'g', 'c', 'g', 'c', 'c', 'g', 'c', 'g', 'c', 'g', 'c', 'g', 'c', 'c', 'g', 'c', 'g', 'c', 'c', 'c',
    't', 't', 't', 't', 't', 't', 'a', 't', 'a', 'a', 'a', 'a', 't', 't', 't', 'a', 't', 'a', 't', 'a', 'a', 'a', 't', 'a', 't', 't', 't', 't',
    'g', 'c', 'c', 'g', 'g', 'c', 'g', 'c', 'g', 'c', 'g', 'c', 'c', 'g', 'c', 'g', 'c', 'g', 'c', 'g', 'c', 'c', 'g', 'c', 'g', 'c', 'c', 'c',
    't', 't', 't', 't', 't', 't', 'a', 't', 'a', 'a', 'a', 'a', 't', 't', 't', 'a', 't', 'a', 't', 'a', 'a', 'a', 't', 'a', 't', 't', 't', 't'
    ])
    
    You can also use the "char for char" construct.
    seqList = [ char for char in 'atgtgtgactctagctacgta' ]
    
    I know it is different from normal fasta, but you guys are perl experts, so you should be able to convert fasta to this format and import it from a file.
    m.baumWelch(train_seq)
    
  12. You can generate a sequence using this model
    obs_seq = m.sampleSingle(200)
    print obs_seq
    
  13. And you can predict the states based on data
    m.viterbi(obs_seq)
    
  14. Lets generate something else to learn from. This has 20 c's, followed by 10 t's, followed by 40 t's
    seq = EmissionSequence(sigma, ['c'] * 20 + ['t'] * 10 + ['c'] * 40)
    
  15. Train the model based on this sequence
    m.baumWelch(seq)
    m.viterbi(seq)
    
  16. You can generate a sequence using this model
    obs_seq = m.sampleSingle(800)
    print obs_seq
    
  17. And you can predict the states based on data
    m.viterbi(obs_seq)