pdf.io >> Retail >> A MULTITHREADED PARALLEL IMPLEMENTATION OF A DYNAMIC ....pdf

A MULTITHREADED PARALLEL IMPLEMENTATION OF A DYNAMIC ...

 FileName: martins.pdf
[previewonline]


 FileSize: 250 KB download
 Shared by: helixweb_stanford_edu 107 month ago
 Category: Retail
 Report us: delete it


Abstract: programming algorithm for biological sequence comparison on a general ... Parallel hardware for sequence comparison and align ment. Computer Applications in the Biosciences, 12:473 ...

Pacific Symposium on Biocomputing 6:311322 (2001)
A MULTITHREADED PARALLEL IMPLEMENTATION OF A
DYNAMIC PROGRAMMING ALGORITHM FOR SEQUENCE
COMPARISON
W.S. MARTINS, J.B. DEL CUVILLO, F.J. USECHE,
K.B. THEOBALD, G.R. GAO
Department of Electrical and Computer Engineering
University of Delaware, Newark, DE 19716, USA
Abstract
This paper discusses the issues involved in implementing a dynamic
programming algorithm for biological sequence comparison on a general
purpose parallel computing platform based on a negrain eventdriven
multithreaded program execution model. Finegrain multithreading per
mits e cient parallelism exploitation in this application both by taking
advantage of asynchronous pointtopoint synchronizations and commu
nication with low overheads and by e ectively tolerating latency through
the overlapping of computation and communication. We have imple
mented our scheme on EARTH, a negrain eventdriven multithreaded
execution and architecture model which has been ported to a number of
parallel machines with o theshelf processors. Our experimental results
show that the dynamic programming algorithm can be e ciently imple
mented on EARTH systems with high performance e.g., speedup of 90
on 120 nodes, good programmability and reasonable cost.
1 Introduction
Today, one of the most powerful methods for inferring the biological function
of a gene or the protein that it encodes is by sequence similarity searching on
protein and DNA sequence databases. With the development of rapid methods
for sequence comparison, discoveries based solely on sequence homology have
become routine. A good introduction can be found in a book by Waterman.1
Although sequence comparison algorithms based on the dynamic program
ming method  such as NeedlemanWunsch 2 and SmithWaterman 3  pro
vide optimal solutions, they are computationally expensive. Therefore, most
current sequence comparison methods used in practice, e.g., BLAST 4 and
FASTA,5 are based on heuristics which are much faster, but do not produce
optimal results. Speed is important given the sizes of the sequence databases
currently available, but it comes at the price of getting incomplete results.
In this paper, we are interested in studying how to apply the computational
power of parallel computers to speed up the process of comparing sequences,
Pacific Symposium on Biocomputing 6:311322 (2001)
but without having to compromise by missing some optimal results. We look
at dynamic programming algorithms for sequence comparison. The rst al
gorithm introduced for nding the optimal alignment between sequences, the
NeedlemanWunsch algorithm,2 used this technique. This algorithm had a
great impact on later sequence alignment algorithms, such as the wellknown
SmithWaterman method 3 and others.6 7 Therefore, speeding up dynamic pro
;
gramming algorithms for nding optimal solutions to sequence comparisons is
an important problem in computational biology and bioinformatics.
The dynamic programming algorithm works by computing the socalled
similarity matrix. As will be discussed in detail in Section 2, the computation
at each element in this matrix depends on the results of three other elements:
its nearest west, northwest and north neighbors in the matrix. Such negrain
data dependences present serious challenges for e cient parallel execution on
current parallel computers. To meet such challenges, we exploit the power of
a multithreaded execution and architecture model, such as the EARTH E 
cient Architecture for Running THreads model,8 where negrain parallelism
can be e ciently exploited on top of a parallel machine based on o theshelf
microprocessors. Under the EARTH model, the computation of an element
or a block of elements of the similarity matrix can be assigned to one thread.
The thread scheduling under EARTH is eventdriven; a thread will become
enabled if and only if the events on which it depends have arrived. There
fore, we can map the negrain data dependences into such events, and the
enabling and execution of the threads in di erent points are performed in an
asynchronous fashion. With similarity matrices above a reasonable threshold,
this mapping provides ample thread parallelism to keep the processors usefully
busy. Maintaining multiple enabled threads in the same processor also pro
vides the ability to tolerate interprocessor communication and communication
latencies, and to sustain high and smooth scalability.
2 Sequence Comparison Using Dynamic Programming
The rst algorithm for comparing biological sequences using the dynamic pro
gramming technique was proposed by Needleman and Wunsch in 1970.2 The
algorithm consists of two parts: the calculation of the total score indicating
the similarity between the two given sequences, and the identi cation of the
alignments that lead to the score. In this paper we will concentrate on the
calculation of the score, since this is the most computationally expensive part.
The idea behind using dynamic programming is to build up the solution
by using previous solutions for smaller subsequences. The comparison of the
two sequences X and Y, using the dynamic programming algorithm, is illus
Pacific Symposium on Biocomputing 6:311322 (2001)
trated in Figure 1. This algorithm nds global alignments by comparing entire
sequences. The sequences are placed along the left margin X and on the top
Y. A similarity matrix is initialized with decreasing values 0; ,1; ,2; ,3; : : :
along the rst row and rst column to penalize for consecutive gaps insertions
or deletions.
Sequence
A T G C A G T Y
0 1 2 3 4 5 6 7
A 1 1 0 1 2 3 4 5
T 2 0 2 1 0 1 2 3
sequence X A T  A A G T
Sequence A 3 1 1 2 1 1 0 1 sequence Y A T G C A G T
X score 1 1 1 0 1 1 1
4 2 0 1 2 2 1 0 TOTAL= 4
A
sequence X A T A  A G T
G 5 3 1 1 1 2 3 2 sequence Y A T G C A G T
score 1 1 0 1 1 1 1
T 6 4 2 0 1 1 2 4 TOTAL= 4
Figure 1: Similarity matrix and global alignments
The other elements of the matrix are calculated by nding the maximum
value among the following three values: the left element plus gap penalty, the
upperleft element plus the score of substituting the horizontal symbol for the
vertical symbol, and the upper element plus the gap penalty. For the general
case where X = x1 ; : : : ; x and Y = y1 ; : : : ; y , for i = 1; : : : ; n and j = 1; : : : ; m,
i j
the similarity matrix SM n; m is built by applying the following recurrence
equation, where gp is the gap penalty and ss is the substitution score:
8 SM i; j , 1 + gp
SM i; j = max : SM i , 1; j , 1 + ss
SM i , 1; j + gp
In our example, gp is 1, and ss is 1 if the elements match and 0 otherwise.
However, other general values can be used instead.
Following this recurrence equation, the matrix is lled from top left to
bottom right with entry i; j requiring the entries i; j , 1 , i , 1; j , 1 , and
i , 1; j . Notice that SM i; j corresponds to the best score of the subsequences
x1 ; : : : ; x and y1 ; : : : ; y . Since global alignment takes into account the entire
i j
sequences, the nal score will always be found in the bottom right hand corner
Pacific Symposium on Biocomputing 6:311322 (2001)
of the matrix. In our example, the nal score 4 gives us a measure of how
a
similar the two sequences are. Figure 1 shows the similarity matrix and the
two possible alignments arrows going up and left.
3 Parallel Computation  Challenges and Problem Formulation
A parallel version of the sequence comparison algorithm using dynamic pro
gramming must handle the data dependences presented by this method, yet
it should perform as many operations as possible independently. This may
present a serious challenge for e cient parallel execution on current general
purpose parallel computers, i.e., MIMD Multiple Instruction stream, Multiple
Data stream computers.
Given the data dependences presented by the algorithm, the similarity
matrix can be lled row by row, column by column, or antidiagonal by anti
diagonal i.e., all elements i; j for which i + j is a xed value. The problem
with the rst two approaches is that most of the elements in a row or col
umn depend on other elements in the same row column. This means the
row or column cannot be computed in parallel. On the other hand, the ele
ments in an antidiagonal depend only on previously calculated antidiagonals.
This means that parallel computation can proceed as a wave front across the
similarity matrix, i.e., by computing successive antidiagonals of the matrix
simultaneously, during successive time steps.
Although it exposes parallelism, the antidiagonal approach faces a few
challenges when it comes to an e cient parallel implementation. First, the
sizes of the antidiagonals vary during the computation, which leads to unbal
anced work among processors. For example, assume six processors, one per
symbol row of sequence X, are available to compute the matrix in Figure 1.
The computation would start with processor 1 calculating the element 1,1
assuming rows and columns are numbered 0,1,2 . . . . Then, in the next time
step, processors 1 and 2 would calculate the elements 1,2 and 2,1 respec
tively, and so on. This way, processor 6, for instance, would have to wait 6
time steps before starting to work, and by the time we get to time step 8,
processor 1 would already be idle. In the worst case, where X and Y are the
same length, each processor would only be used half of the time on average.
Another challenge has to do with the number of elements to be computed
by each processor in each time step. In the previous example we assumed that
each processor would calculate one element of the matrix at a time. However,
a When this information is stored during the calculation of the matrix, the second part
of the algorithm, identi cation of the alignment, can be easily computed by following the
pointers from the lowest right corner to the upper left corner.
Pacific Symposium on Biocomputing 6:311322 (2001)
this negrain computation would require a very large number of processors
if real biological data is to be considered. An additional problem is the high
communication overheads for such an implementation, which would require
data exchange among all active processors at every time step.
In this paper, we are interested in the following challenging question: Can
the dynamic programming algorithm be e ciently implemented on general
purpose parallel computers with high performance and e ciency, good pro
grammability, and reasonable cost? Here we are particularly interested in
parallel machines made mainly of commodity o theshelf microprocessors
and stock hardware. The requirement of high performance" implies good
speedup and scalability, and good programmability" and reasonable cost" fa
vor generalpurpose parallel computer solutions as opposed to specialpurpose
hardware or exotic processor technology.
4 Parallel Implementation of the Dynamic Programming Algorithm
The previous section mentioned that computing the antidiagonal element by
element would lead to expensive communication overheads. For each element,
the program would compute a single maximum, yet would have to send the
result to three processors though one of these may be the same processor.
One solution to this problem is to divide the similarity matrix into rect
angular blocks, as shown in Figure 2a. In this example, the program would
compute block 1 rst, followed by 2 and 5, etc. If each block has q rows and r
columns, then the computation of a given block requires only the row segment
immediately above the block, the column segment to its immediate left, and
the element above and to the left  a total of q + r +1 elements. For instance,
if each block has 4 rows and 4 columns, then each block has to compute 16
maxima after receiving 9 input values. The communicationtocomputation
ratio drops from 3:1 to 9:16  an 81 reduction!
Note that this blocking will decrease the maximum achievable parallelism
somewhat, by introducing some sequential dependences in the code. However,
given the sizes of the current problems and the parallel machines currently
used, this potential loss will not be a limiting factor.
The loadbalancing problem can be addressed by putting several rows of
blocks or strips" on the same processor. Figure 2b illustrates this ap
proach when four processors are used. The rst and fth strips are assigned
to processor 1, the second and sixth strips are assigned to processor 2 and so
on. This helps to keep all processors busy through most of the computation.
For example, processor 1 initially works with the rst strip, then simultane
ously with the rst and fth strip, then nally only with the fth strip. The
Pacific Symposium on Biocomputing 6:311322 (2001)
P1 1 2 3 4
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
P1
111111111111111111111
000000000000000000000
111111111111111111111111111111111111111111
000000000000000000000000000000000000000000
P2 5 6 7 8 P2 111111111111111111111111111111111111111111
000000000000000000000000000000000000000000
111111111111111111111111111111111111111111
000000000000000000000000000000000000000000
111111111111111111111
000000000000000000000
P3 9 10 11 12 000000000000000000000
111111111111111111111
P3
000000000000000000000
111111111111111111111
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
P4 13 14 15 16 111111111111111111111
000000000000000000000
P4
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
P1 17 18 19 20 111111111111111111111
000000000000000000000
P1
111111111111111111111
000000000000000000000
111111111111111111111111111111111111111111
000000000000000000000000000000000000000000
P2 111111111111111111111111111111111111111111
P2 21 22 23 24 000000000000000000000000000000000000000000
111111111111111111111111111111111111111111
000000000000000000000000000000000000000000
111111111111111111111
000000000000000000000
P3 25 26 27 28 111111111111111111111
000000000000000000000
P3
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
P4 29 30 31 32 111111111111111111111
000000000000000000000
P4
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
a Block Division b Processor Distribution
Figure 2: Partition of the similarity matrix
processor utilization rises to 75.
4.1 The EARTH Multithreaded Architecture  Our Platform
EARTH 8 9 supports a multithreaded program execution model in which a
;
program is viewed as a collection of threads whose execution ordering is deter
mined by data and control dependences explicitly identi ed in the program.
Threads, in turn, are further divided into bers which are nonpreemptive and
scheduled according to data owlike ring rules, i.e., all needed data must be
available before it becomes ready for execution. Programs structured using
this twolevel hierarchy can take advantage of both local synchronization and
communication between bers within the same thread, exploiting data local
ity. In addition, an e ective overlapping of communication and computation is
made possible by providing a pool of readytorun bers from which the pro
cessor can fetch new work as soon as the current ber ends and the necessary
communication is initiated.
The EARTH model de nes a common set of primitive operations required
for the management, synchronization and data communication of threads.
Each node in an EARTH system consists of an execution unit EU, a syn
chronization unit SU, queues linking the EU and SU, local memory, and an
interface to interconnection network. While the EU merely executes bers, i.e.,
does the computation, the SU is responsible for scheduling and synchronizing
threads, handling remote accesses and performing dynamic load balancing.
Although designed to deal with multiple threads per node, the EARTH
model does not require any support for rapid context switching since bers
are nonpreemptive and is wellsuited to running on o theshelf processors.
EARTH systems have been implemented on a number of platforms: MANNA
and PowerMANNA, IBM SP2, Sun SMP cluster and Beowulf. EARTH pro
Pacific Symposium on Biocomputing 6:311322 (2001)
Inactive fiber P1
111
000
111
000 Active fiber
111
000
111
000 Thread A Thread B
data E Fibers 1111
0000 E
11111
00000
Fibers
sync 1111
0000 11111
00000
P2 1111
0000
1111
0000 11111
00000
11111
00000
acknowledgement O O
P3
* 4fibers per processor
O 111111111
000000000
* Eventdriven 111111111
000000000
P1 111111111
000000000 E
* Asynchronous 11111
00000 111111111
000000000
E 11111
00000
* Nonpreemptive fibers P2 11111
00000
11111
00000 O
11111
00000
11111
00000
* Local communication/ O 11111
00000
11111
00000
11111
00000
11111
00000
synchronization between P3 11111
00000
11111
00000
11111
00000
11111
00000 E
fibers
E 11111
00000
P4 11111
00000
11111
00000 O
111111111
000000000 11111
00000
O 111111111
000000000
P1 111111111
000000000
111111111
000000000 E
1111
0000 111111111
000000000
E 1111
0000
P2 1111
0000
1111
0000 O
1111
0000
O 1111
0000
1111
0000
1111
0000
1111
0000
P3 1111
0000
1111
0000
1111
0000
1111
0000 E
1111
0000
P4 1111
0000
1111
0000 O
1111
0000
Figure 3: Computation of the similarity matrix on EARTH
grams are written using the programming language ThreadedC.8 9 This is ;
an extension of the ANSIC programming language which, by incorporating
EARTH operations, allows the user to indicate parallelism explicitly.
4.2 Our Multithreaded Implementation
Our multithreaded implementation follows the description given at the begin
ning of this section. Generally speaking, it assigns the computation of each
strip to a thread, having 2 independent threads per node. However, in order
to better overlap computation and communication, blocks on a strip are ac
tually calculated by two bers within a thread. These bers are repeatedly
instantiated to compute one block at a time, and only one of the two bers of
each thread can be active at a particular time.
The decision of having two alternating bers within each thread was based
on the following reasoning. It would be a waste of resources if we had one sep
arate ber for each block, in each strip, since only one block can be calculated
at a time. Having just one ber for all blocks is also not a good idea because
Pacific Symposium on Biocomputing 6:311322 (2001)
this ber would get delayed due to the synchronization signal coming from
the ber immediately below. This signal acknowledges the receipt of data 
without it the ber, reinstantiated, would be allowed to overwrite the previous
data. Thus, with just one ber, computation would not be allowed to proceed
until this acknowledgment signal is received. With the addition of an extra
ber we can further overlap computation and communication since one of the
bers can wait for the acknowledgment while the other starts working on the
following block. This doublebu ering and acknowledgment scheme is used
with other parallel applications on EARTH.8 10
;
A snapshot of the computation of the similarity matrix using our multi
threaded implementation is illustrated in Figure 3. A thread is assigned to each
horizontal strip and the actual computation is done by bers labeled Even
and Odd. The gure shows the computation of the main antidiagonal of
the matrix. The arrows indicate data and synchronization signals. For exam
ple, processor 2 sends data downward arrows to processor 3 and receives data
from processor 1  i.e., bers E of strips 2 and 6 send data to bers E of strips
3 and 7, and bers O of strips 1 and 5 send data to bers O of strips 2 and 6.
Fibers within a same thread, that is, associated with the same strip, send only
a synchronization signal horizontal arrows since they share data local to the
thread to which they belong. Finally, dotted upward arrows acknowledge the
receipt of data so that the ber receiving this signal can be reinstantiated to
calculate another block of the same strip.
During the initialization phase, each thread grabs a piece of the input
sequence X. This piece is all a thread needs from sequence X so the whole
sequence need not be stored. Moreover, after computing a block, each ber
sends to the ber beneath a piece of the sequence Y being compared. By
doing so, we minimize the initialization delay that occurs when the nodes are
reading sequence X from the server. Furthermore, since subsequent pieces of
sequence Y can be stored in the same memory area, the demands for space are
considerably reduced.
5 Results
The experiments in this study are based on the EARTH implementation for
the MANNA parallel machine. Our experiments were run using both a 20node
MANNA and SEMi, an accurate simulator of the MANNA.8 9 The di erence
;
in the clock cycle counts between the simulator and the real MANNA have
been measured and were less than 3 for the same ThreadedC code. The
simulator allows one to test di erent con gurations for the EARTH system.
In this paper we consider only the most conservative of them, one in which each
Pacific Symposium on Biocomputing 6:311322 (2001)
node contains only a single o theshelf microprocessor, and the operations of
the EU and SU must be performed therefore on the same processor.
The proposed multithreaded dynamicprogramming algorithm for sequence
comparison has been implemented under the EARTH experimental platform.
The results are reported for sequences ranging from 512 to 10K elements. Fig
ure 4 shows absolute speedups for various problem sizes sizes list the length
of the X and Y sequences. Absolute speedup compares the running time of
the parallel code to the best sequential program, so that the results account
for multithreading overheads.
The main experimental results include: b
Speedup: The multithreaded implementation has achieved impressive
speedup: up to 90 on a 120node platform.
Scalability: A good scalability is obtained from 4 up to 120 nodes.
Processor E ciency: Very good e ciency has been achieved; on 8
nodes, processors are utilized 99 as much as on one node running se
quential code. Even on 120 nodes, processors achieve 75 utilization on
the largest problem size.
Programmability: The good performance is achieved using a straight
forward partitioning method, without requiring expensive compiler sup
port for optimal partitioning algorithms.
Reasonable cost: This is achieved on an EARTH platform based on
o theshelf generalpurpose microprocessor technology.
6 Related Work
Di erent parallel implementations of pairwise sequence comparison algorithms
using dynamic programming techniques range from the exploitation of instruc
tion level parallelism in uniprocessor machines 11 12 to SIMD 13 14 Single In
; ;
struction stream, Multiple Data stream and MIMD implementations.15 16 Re ;
lated problems such as sequence alignment and database search have also made
extensive use of parallel hardware, from specialpurpose VLSI to recon gurable
hardware and programmable coprocessor designs.14 17 Some work has been re
;
ported in the eld of hardware accelerators 18 and some basic software platforms
b The authors are not aware of any published work reporting results, for a parallel MIMD
implementation of the dynamic programming algorithm, for the range of input data and
number of processors reported in this paper.
Pacific Symposium on Biocomputing 6:311322 (2001)
96
512x1k
1kx2k
2kx4k
4kx4k
 Related pdf books
 A MULTITHREADED PARALLEL IMPLEMENTATION OF A DYNAMIC ...
 Integrated Tools for Structural and Sequence Alignment and ...
 GENOM EWIDE PATHWAY ANALYSIS AND VISUALIZATION USING GENE ...

 The Clinical Bioinformatics Ontology: A Curated Semantic Network Utilizing RefSeq
 GenomeWide Detection of Alternative Splicing in Expressed Sequences Using Partial Order
 SIMULATING THE GROWTH OF VIRUSES LINGCHONG YOU AND JOHN YIN*
 PREmRNA SECONDARY STRUCTURE PREDICTION AIDS SPLICE SITE ...
 Partof Relations in Anatomy Ontologies: A Proposal for RDFS ...
 A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text
 DIGIT: A Novel Gene Finding Program by Combining GeneFinders ...
 wsprocs9x6
 Pedigree Generation for Analysis of Genetic Linkage and Association
 Modeling Gene Expression from Microarray Expression Data with StateSpace Equations
 Random Forest Similarity for ProteinProtein Interaction Prediction from Multiple
 Popular epubs
 IMPLEMENTATION OF DUALCORE MULTITHREADED PROCESSOR ON XILINX ...
 CS184c: Computer Architecture [Parallel and Multithreaded]
 DYNAMIC INSTABILITIES OF TWO PARALLEL INTEGRAL SUPERHEATERS TO ...
 Optimal Architectures for Massively Parallel Implementation of ...
 Project Report: Parallel AES Implementation
 A Parallel Implementation of the Spectral Bundle Method for ...
 Implementation of Parallel and Distributed Query Processing in the ...
 Dynamic Scheduling of Parallel Applications on SharedMemory ...
 An FPGA Implementation of Parallel ICA for Dimensionality ...
 TPCCUVa  An OpenSource TPCC Implementation for Parallel and Distributed Systems
 Characterization of pairwise and multiple sequence alignment ...
 Sequence Alignment
Download the ebook