In our final year of engineering we are given the liberty to choose from a set of subjects as to be our “elective”. Now elective is a choice… It’s a choice you make based on what your interests are and you are “expected” to be interested in something which the college offers you. It is an altogether different issue that college always offers something boring and students are always never sure of what they are interested in and hence usually prefer to take a course wherein teachers don’t turn up.

Entering into my final semester of engineering I was sure of what my interests are but I was also sure college authorities wouldn’t have anything to offer that would interest me. But scrolling through the list put up at department notice board, I found a new offering. “Computational Biology”. The syllabus seemed interesting (it had algorithms). Most importantly it was going to be taught by people from the industry. I really appreciate people from industry who volunteer for teaching assignments. Taking time out of their busy schedule to interact with students is indeed commendable. Besides I have always being in favor of continuous interaction between academia and industry. Producers and consumers should always be in sync with each other. So I signed in for the course.

And so in my final semester of engineering I was admitted to a school.I experienced things which I last saw/felt in high school.

  • All the course instructors always come on time

  • Basically they never miss any of their allotted slots i.e. lectures take place as per schedule.

  • They are so enthusiastic about giving assignments and making us scratch our brains.

  • They actually read our answer sheets (they did. Believe me)

Of the many algorithms that we studied the ones which I found really interesting were pair-wise sequence alignment algorithms. One of the basic units in biology is DNA which is a sequence of 4 letters A, C, T, G. DNA defines us. It contains the code which makes us what we are. Now comparing DNA sequences from two different organisms is really helpful in deciding the relationship between organisms. It helps biologists conclude about any common ancestry amongst them to name just one.

Pair-wise sequence alignment can basically be of two types. The ones aiming at global alignment and ones aiming at local alignment. Calculating a global alignment is a form of global optimization that forces the alignment to span the entire length of all query sequences. An example of this is Needleman–Wunsch algorithm. By contrast to global alignments, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Let’s consider Smith- Waterman algorithm which aims to align two sequences with emphasis on local alignment. Here are the basic steps in the algorithm along with an example.

Consider two DNA sequences: 1: ACTGAAGG

2: AATGGACTT

1) Decide upon match, mismatch scores (similarity measure, s (a, b)) and gap penalty.

Let us assume the match score to be 1. What we mean by this is,

s (A, A) = s (C, C) = s (G, G) = s (T, T) = 1, where s is similarity measure.

Let mismatch score to be – (1/3). What we mean by this is s (A, C) = – (1/3).

And gap penalty = 1 + k/3, where k = extent of gap. So for gap length =1, penalty = 1.3, for gap length = 2, penalty = 1.7 and so on.

Here I have decided these scores randomly. But these scores are usually decided keeping biological aspects into consideration. What we are technically doing by deciding these scores is deciding what our distance matrix is. In this example my distance matrix is:

A

T

C

G

A

1

-0.3

-0.3

-0.3

T

-0.3

1

-0.3

-0.3

C

-0.3

-0.3

1

-0.3

G

-0.3

-0.3

-0.3

1

2) Construct matrix.

Now in sequence alignment algorithms we create a matrix using the distance matrix. We write one sequence along one axes and other sequence along other axes. The rule in SW algorithm to fill up the matrix is as follows:

Set Hi, 0 = H0, j = 0 for 1 ≤ i n and 1 ≤ j m, where n and m are lengths of the two sequences under consideration. Then,

Hi, j = max { { Hi-1, j-1 + s( ai , bj )} , max1 ≤ k ≤ i {Hi-k, j – Wk}, max1 ≤ k ≤ j {Hi, j-k – Wk}, 0}. Here W is the gap penalty and the k is the extent of gap. Initially the matrix looks like (i.e. putting 0 in row 0 and column 0):

A

C

T

G

A

A

G

G

0

0

0

0

0

0

0

0

0

A

0

?

A

0

T

0

G

0

G

0

A

0

C

0

T

0

T

0

Now let’s fill up the matrix one element at the time from top left hand corner to bottom right hand corner (always moving from left to right) using the equation given above. Consider the cell containing ‘?’. The corresponding letter from sequence 1 is ‘A’ and that from sequence 2 is ‘A’. Hence substituting values in our equation we have,

H = max {{0 + s (A, A)}, {0-1.3}, {0-1.3}, 0} = max {1, -1.3, -1.3, 0} = 1.

Similarly we can fill up other positions in matrix. Here’s one more example: Consider row 4 and column 6. The letter in sequence1 is ‘A’ and that in sequence2 is ‘G’. So we have,

H4, 6 = max { {H3, 5 + s(A, G)}, max { (H4, 5 – 1.3) , (H4,4 – 1.7), (H4,3 ­ – 2)}, max {(H3, 6 – 1.3), (H2,6 – 1.7), (H1,6) – 2}, 0 }

= max {-0.3, max {0.1, 1,-1.6}, max {-0.6, 0.3, -1}, 0} = max {-0.3, 1, 0.3} = 1

Thus our final matrix is:

A

C

T

G

A

A

G

G

0

0

0

0

0

0

0

0

0

A

0

1

0

0

0

1

1

0

0

A

0

1

0.7

0

0

1

2

0.7

0.3

T

0

0

0.7

1.7

0.4

0

0.7

1.7

0.4

G

0

0

0

0.4

2.7

1.4

1

1.7

2.7

G

0

0

0

0

1.4

2.4

1.1

2

2.7

A

0

1

0

0

1

2.4

3.4

2.1

1.7

C

0

0

2

0.7

0.3

1.1

2.1

3.1

1.8

T

0

0

0.7

3

1.7

0.4

1.7

1.8

2.8

T

0

0

0.3

1.7

2.7

1.4

0.4

1.4

1.5

3) Trace- back

The next step is to trace a path in this matrix which will give us the optimal local alignment. The steps here are:

a) Locate the maximum element and start the trace from that cell.

b) Now check the diagonal (i-1, j-1), immediate left (i , j-1) and immediate top(i-1, j). Move to the maximum of these 3 and choose diagonal element in case of a tie.

c) Stop the trace when you reach a stage where a diagonal element is 0.

The trace path in the matrix is shown with bold numbers.

A

C

T

G

A

A

G

G

0

0

0

0

0

0

0

0

0

A

0

1

0

0

0

1

1

0

0

A

0

1

0.7

0

0

1

2

0.7

0.3

T

0

0

0.7

1.7

0.4

0

0.7

1.7

0.4

G

0

0

0

0.4

2.7

1.4

1

1.7

2.7

G

0

0

0

0

1.4

2.4

1.1

2

2.7

A

0

1

0

0

1

2.4

3.4

2.1

1.7

C

0

0

2

0.7

0.3

1.1

2.1

3.1

1.8

T

0

0

0.7

3

1.7

0.4

1.7

1.8

2.8

T

0

0

0.3

1.7

2.7

1.4

0.4

1.4

1.5

4) Alignment

This is the last step. Using the trace- back we have our maximum similar segment as:

–ACTGAA–

–AATGGA–

Our alignment has 4 matches, 2 mismatches and no gaps.

So this was about Smith – Waterman Algorithm. Thanks to my faculty and Swati (we usually end up fighting trying to figure out how things work before happily coming to a consensus) I hope I have got things right. Also here’s the link to the original paper by Smith which is very well written (and I don’t say that about every paper I go through) : http://www-hto.usc.edu/papers/msw_papers/msw-040.pdf.

I referred this paper and it has NW algorithm explained as well.