Cse182-L9 Modeling Protein domains using hmms

CSE182-L9

Profiles Revisited

Profiles might need to be extended

Profiles & Indels

Profiles and Compositional signals

Protein targeting

Protein targeting

Predicting Transmembrane Regions

The answer is HMMs

Revisiting automaton

Profile HMMs

Emission Probability for states

Transition probabilities

Emission of a sequence

Computing Pr(x|π)

HMM: Formal definition

Pr(x|M)

Computing v(i,j)

The Viterbi Algorithm

Probability of being in specific states

Dostları ilə paylaş:

Cse182-L9 Modeling Protein domains using hmms

CSE182-L9

Modeling Protein domains using HMMs

Profiles Revisited

Note that profiles are a powerful way of capturing domain information

Pr(sequence x| profile P) = Pr(x1|P1) P(x2|P2)…

Pr(AGCTTGTA|P) = 0.9*0.2*0.7*0.4*0.3*….

Why do we want a different formalism?

Profiles might need to be extended

A Domain might only be a part of a sequence, and we need to identify and score that part

A sequence containing a domain might contain indels in the domain

Profiles & Indels

Indels can be allowed in matching sequences to a profile.

However, how do we construct the profile in the first place?

Profiles and Compositional signals

While most protein domains have position dependent properties, other biological signals do not.

Ex:

Protein targeting

Proteins act in specific compartments of the cell, often distinct from where it is ‘manufactured’.

How does the cellular machinery know where to send each protein.

Protein targeting

In 1970, Gunter Blobel showed that proteins have an N-terminal signal sequence which directs proteins to the membrane.

Proteins have to be transported to other organelles: nucleus, mitochondria,…

Can we computationally identify the ‘signal’ which distinguishes the cellular compartment?

Predicting Transmembrane Regions

For transmembrane proteins, can we predict the transmembrane, outer, and inner regions?

The answer is HMMs

Certainly, it is not the only answer! Many other formalisms have been proposed, such as Neural Networks, with varying degree of success.

Revisiting automaton

Recall that Profiles were introduced as a generalization of automata

The OR operator is replaced by a distribution. However, the * operator (Kleene’s closure) has no direct analog.

Instead we introduce probabilities directly into automata.

Profile HMMs

Problem: Given Sequence x, Profile P of length l,

Construct an automaton with a state (node) for every column

Extra states at the beginning and end.

In each step, the automaton emits a symbol, and moves to a new state. Unlike finite automata, the emission and transistion are governed by probabilistic rules

Emission Probability for states

Each state π emits a base (residue) with a probability (defined by the profile)

Thus Pr(A|s1) = 0.9, Pr(T| s4)=0.4,…..

Transition probabilities

The probability of emitting a base depends upon a state.

An initial state π0 is defined. We move to subsequent states using a probabilistic transition

EX: Pr(s0-> s1)= Pr(s0-> s0)=0.5, Pr(s1-> s2) = 1.0

Emission of a sequence

What is the probability that the automaton emitted the sequence x1..xn

The problem becomes easier if we know the sequence of states π=π0..πn that the automaton was in

Computing Pr(x|π)

Example: Consider the sequence GGGAACTTGTACCC

Pr(xi|πi): .25 .25 .25 .9 .4 .7 .4 .3 1 .5 1 .25 .25 .25

Pr(πi->πi+1)= .5 .5 .5 .5 1 1 1 1 1 1 1 1 1 1

HMM: Formal definition

M=(∑,Q,A,E), where

∑=is the alphabet (EX: {A,C,G,T})

Q: set of states, with an initial state q0, and perhaps, a final state. (EX: {s0,…,s9})

A: Transition Probabilities, A[i,j] = Pr(si->sj) (EX: A[0,1]=0.5)

E: Emission probabilities ei(b) = Pr(b|si) (EX: e3(T)=0.7)

Pr(x|M)

Given the state sequence π, we know how to compute Pr(x| π).

We would like to compute

Let |x| = n, and |Q|=m, with sm being the final state.

As is common in Dynamic programming, we consider the probability of generating a prefix of x

Define v(i,j) = Probability of generating x1..xi, and ending in state sj

Is it sufficient to compute v(i,j) for all i,j?

Computing v(i,j)

If the previous state (at time i-1) was sl,

The previous state must be one of the m states.

Therefore v(i,j) = max lQ {v(i-1,l).A[l,j] }.ej(xi)

The Viterbi Algorithm

v(i,j) = max lQ {v(i-1,l).A[l,j] }.ej(xi)

We take logarithms and an approximation to maintain precision

The Viterbi Algorithm

For i = 1..n

Probability of being in specific states

What is the probability that we were in state k at step I?

Pr[All paths that passed through state k at step I, and emitted x]

Pr[All paths that emitted x]

Pr(AGCTTGTA|P) = 0.90.20.70.40.3*….