SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 08 | Aug-2013, Available @ http://www.ijret.org 312
COMPARATIVE ANALYSIS OF DYNAMIC PROGRAMMING
ALGORITHMS TO FIND SIMILARITY IN GENE SEQUENCES
Shankar Biradar1
, Vinod Desai2
, Basavaraj Madagouda3
, Manjunath Patil4
1, 2, 3, 4
Assistant Professor, Department of Computer Science & Engineering, Angadi Institute of Technology and
Management, Belgaum, Karnataka, India.
shankar_pda@yahoo.com, vinod.cd0891@gmail.com, basavarajmadagoda@gmail.com, manjunath.patil03@gmail.com
Abstract
There exist many computational methods for finding similarity in gene sequence, finding suitable methods that gives optimal similarity
is difficult task. Objective of this project is to find an appropriate method to compute similarity in gene/protein sequence, both within
the families and across the families. Many dynamic programming algorithms like Levenshtein edit distance; Longest Common
Subsequence and Smith-waterman have used dynamic programming approach to find similarities between two sequences. But none of
the method mentioned above have used real benchmark data sets. They have only used dynamic programming algorithms for synthetic
data. We proposed a new method to compute similarity. The performance of the proposed algorithm is evaluated using number of data
sets from various families, and similarity value is calculated both within the family and across the families. A comparative analysis
and time complexity of the proposed method reveal that Smith-waterman approach is appropriate method when gene/protein sequence
belongs to same family and Longest Common Subsequence is best suited when sequence belong to two different families.
Keywords - Bioinformatics, Gene, Gene Sequencing, Edit distance, String Similarity.
-----------------------------------------------------------------------***-----------------------------------------------------------------------
1. INTRODUCTION
Bioinformatics is the application of computer technology to
the management of biological information. The field of
bioinformatics has gained widespread popularity largely due
to efforts such as the genome projects, which have produced
lot of biological sequence data for analysis. This has led to the
development and improvement of many computational
techniques for making inference in biology and medicine. A
gene is a molecular unit of heredity of a living organism. It is
a name given to some stretches of DNA and RNA that code
for a polypeptide or for an RNA chain that has a function in
the organism. Genes hold the information to build and
maintain an organism's cells and pass genetic characteristic to
their child. Gene sequencing can be used to gain important
information on genes, genetic variation and gene function for
biological and medical studies [13]. Edit distance is a method
of finding similarity between gene/protein sequences by
finding dissimilarity between two sequences [5]. Edit distance
between source and target string is represented by how many
fundamental operation are required to transfer source string
into target, these fundamental operations are insertion,
deletion and subtraction. The similarity of two strings is the
minimum number of edit distance. String Similarity is
quantitative term that shows degree of commonality or
difference between two comparative sequences [10], Finding
the gene similarity has massive use in the field of
bioinformatics.
2. MATERIALS AND METHODS
In this section we describe the various materials and methods
which are used in our algorithms
2.1 Dataset Used
For the experiment purpose we took data sets from 5 different
families which are listed below, and the source of information
is [16] [17].
Family: kruppel c2h2-type zinc finger protein.
Family: caution-diffusion facilitator (CDF) transporter family.
Family: E3 ubquitin-protein ligase.
Family: Semaphorin-7A.
Family: SPAG11 family.
2.2 Dataset Format
In this research work we used various data sets from different
families for the implementation of different algorithms, all this
data set is in FASTA format. In bioinformatics, FASTA
format is a text-based format for representing nucleotide
sequences, in which nucleotides or amino acids are
represented using single-letter codes. The format also contain
sequence name before the sequences start. A sequence in
FASTA format begins with a single-line description, followed
by lines of sequence data. The description line is distinguished
from the sequence data by a greater-than (">") symbol in the
first column. The word following the ">" symbol is the
identifier of the sequence, and the rest of the line is the
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 08 | Aug-2013, Available @ http://www.ijret.org 313
description (both are optional). There should be no space
between the ">" and the first letter of the identifier. It is
recommended that all lines of text be shorter than 80
characters. The sequence ends if another line starting with a
">" appears; this indicates the start of another sequence.
2.3 Gap Penalty
In order to get best possible sequence alignment between two
DNA sequences, it important to insert gaps in sequence
alignment and use gap penalties. While aligning DNA
sequences, a positive score is assigned for matches negative
score is assigned for mismatch To find out score for matches
and mismatches in alignments of proteins, it is necessary to
know how often one sequence is substituted for another in
related proteins. In addition, a method is needed to account for
insertions and deletions that sometimes appear in related DNA
or protein sequences. To accommodate such sequence
variations, gaps that appear in sequence alignments are given a
negative penalty score reflecting the fact that they are not
expected to occur very often. It is very difficult to get the best-
possible alignment, either global or local, unless gaps are
included in the alignment.
2.4 Blosum Matrix
A Blosum matrix is necessary for pair wise sequence
alignment. The four DNA bases are of two types, purines (A
and G) and pyrimidines (T and C). The purines are chemically
similar to each other and the pyrimidines are chemically
similar to each other. Therefore, we will penalize substitutions
between a purine and a purine or between a pyrimidine and a
pyrimidine (transitions) less heavily than substitutions
between purines and pyrimidines (transfusions). We will use
the following matrix for substitutions and matching’s. The
score is 2 for a match, 1 for a purine with a purine or a
pyrimidine with a pyrimidine, and -1 for a purine with a
pyrimidine.
3. ALGORITHMS
Dynamic programming algorithms for finding gene sequence
similarity are discussed in detail in this section along with
pseudo codes and algorithms. We used three algorithms for
analysis purpose, all these algorithms uses the concept of
dynamic programming, which is output sequence depends
upon the input of previous sequence. Those three algorithms
are.
a. Levenshtein edit distance algorithm
b. Longest common subsequence algorithm
c. Smith-waterman algorithm
3.1 Levenshtein Edit Distance Algorithms
It is one of the most popular algorithms to find dissimilarity
between two nucleotide sequences, it is an approximate string
matching algorithm mainly used for forensic data set, the basic
principle of this algorithm is to measure the similarity between
two strings [4]. This is done by calculating the number of
basic operations as mentioned in introduction part. Algorithm
for Levenshtein edit distance is as fallows
Int LevenshteinDistance (char S[1..N], char T[1...M] )
{
Declare int D[1....N, 1....M]
For i from 0 to N
D[i,0] := i // the distance of any first string to an empty
second string
For j from 0 to M
D[0, j] := j // the distance of any second string to an empty
first string
For j from 1 to M
{
For i from 1 to N
{
S[i] = T[j] then
D[i, j] := D[i-1, j-1]
Else
D[i,j] := min {
D[i-1 , j] +1 // deletion
D[i, j-1] +1 // insertion
D[i-1, j-1] +1// substitution
}
}
}
Return D[M, N]
}
3.2 Longest Common Subsequence Algorithm
Finding LCS [3] [8] is one way of computing how similar two
sequences are, Longer the LCS more similar they are. The
LCS problem is a special case of the edit distance problem.
LCS is similar to Levenshtein edit distance algorithm except
few steps and it also involves trace back process in order to
find similar sequences.
Algorithm for Longest common subsequence is as fallows
Int LCS (char S[1...N], char T[1....M] )
{
Declare int D[1....N, 1....M]
For i from 0 to N
D[i,0] := 0
For j from 0 to M
D[0, j] := 0
For j from 1 to M
{
For i from 1 to N
{
D[i,j] := max {
V1;
V2;
V3+1 if S=T else V3
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 08 | Aug-2013, Available @ http://www.ijret.org 314
}
}
}
Return D [M, N]
}
Where V1 = the value in the cell to the left of current cell.
V2= the value in the cell above the current cell.
V3= value in the cell above left to the current cell,
S and T are source string and Target string respectively
3.3 Smith-Waterman Algorithm
The Smith–Waterman algorithm is a well-known algorithm for
performing local sequence alignment; that is, for determining
similar regions between two nucleotide or protein sequences.
Instead of looking at the total sequence, the Smith–Waterman
algorithm compares segments of all possible lengths and
optimizes the similarity measure.
Smith-waterman algorithm differ from other Local alignment
algorithm in fallowing factors
a. A negative score/weight must be given to mismatches.
b. Zero must be the minimum score recorded in the
matrix.
c. The beginning and end of an optimal path may be found
anywhere in the matrix not just the last row or column.
Pseudo code core smith-waterman algorithm is as fallows.
Pseudo code for initialization of matrix
For i=0 to length(A)
F(i,0) ← d*i
For j=0 to length(B)
F(0,j) ← d*j
For i=1 to length(A)
For j=1 to length(B)
{
Diag ← F(i-1,j-1) + S(Ai, Bj)
Up ← F(i-1, j) + d
Left ← F(i, j-1) + d
F(i,j) ← max(Match, Insert, Delete)
}
Pseudo code for SW alignment
For (int i=1;i<=n;i++)
For (int j=1;j<=m;j++)
int s=score[seq1.charAt(i-1)][seq2.charAt(j-1)];
int val=max(0,F[i-1][j-1]+s,F[i-1][j]-d,F[i][j-1]-d);
F[i][j]=val;
If (val==0)
B[i][j]=null;
Else if(val==F[i-1][j-1]+s)
B[i][j]=new Traceback2(i-1,j-1);
Else if(val==F[i-1][j]-d)
S[i][j]= new Traceback2(i-1,j);
Else if(val==F]i][j-1]-d)
B[i][j]= new Traceback2(i,j-1);
Where i and j are columns and rows respectively, S (xi; yj) is
value of substitution matrix and g is gap penalty, the
substitution matrix is a matrix which describes the rate at
which one character in a sequence changes to other character
states over time
3.4 Results within the Same Families
Table1. Family: cation-duffusion facilitator
Figure1. Similarity graph for cation-duffusion facilitator
family
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 08 | Aug-2013, Available @ http://www.ijret.org 315
Table2. Family: semaphorin
Figure2. Similarity graph for family semaphorin
In the figure 2, blue line indicates similarity in smith-
waterman algorithm, red line indicate similarity in longest
common subsequence algorithm and finally the green line
indicate similarity in Levenshtein algorithm. As we see from
the graph smith-water man algorithm is more efficient then
other two algorithms while finding the similarity of gene
sequences that belonging to same family.
3.5 Results between Different Families
Figure3. Similarity graph across family
Where red line indicates LCS algorithm, green line indicates
Levenshtein edit distance algorithm and blue line is for smith-
waterman algorithm. From the above graph, we can conclude
that while comparing two gene sequences belonging to
different families, longest common subsequence is better
algorithm because it gives maximum similarity as compare to
other two algorithms.
Table3. Similarities across the families
CONCLUSIONS
We considered finding the gene sequence similarity using
dynamic programming for our project work. In dynamic
programming there exist many different approaches to find
similarity among gene sequences; we took some of these
algorithms for our project and did comparative analysis of
these algorithms using datasets from five different families.
We took different protein sequences from all these dataset as
input to our program and did rigorous experimentation on
these datasets, both within the families and across the families.
Five data sets which are used for our experimental work are
kruppel c2h2-type zinc finger protein, cation-diffusion
facilitator (CDF) transporter, E3 ubquitin –protein ligase,
semaphorin and finally SPAG11B and got the results as
discussed in the previous section. From the results we can
conclude that smith-waterman algorithm is best suited to find
similarity for protein sequences that belonging to the same
family, and longest common subsequence algorithm is best
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 02 Issue: 08 | Aug-2013, Available @ http://www.ijret.org 316
suited for protein sequences that are belong to different
families.
REFERENCES
[1] S. Hirschberg; “Algorithms for the longest common
subsequence problem”. J.ACM, 24;{664-675};1977
[2] Levenshtein V.I; “binary code capable of correcting
deletion, insertion and reversal”; soviet physics
doklady; vol 8; 1966
[3] Ristead, R.S Yianilos,P.N; “learning string edit
distance”; IEEE Transaction or pattern analysis and
machine intelligence; 1998
[4] L. Bergroth; “Survey of Longest Common Subsequence
Algorithms”; Department of Computer Science,
University of Turku20520 Turku,Finland; 2000 IEEE
[5] Hekki Hyyro, Ayumi Shinohara; “A new bit-parallel-
distance algorithm”; Nikoltseas.LNCS 3772; 2005
[6] Adrian Horia Dediu,et al; “A fast longest common
subsequence algorithm for similar strings”; Language
and automation theory and application, International
Conference, LATA; 2010
[7] Patsaraporn Somboonsat, Mud-Armeen munlin; “a new
edit distance method for finding similarity in DNA
sequence”; world academy of science engineering and
technology 58; 2011
[8] Dekang Lin; “ An Information-Theoretic Definition of
Similarity”; Department of Computer Science
University of Manitoba,Winnipeg, Manitoba, Canada
R3T 2N2
[9] Xingqin Qi, Qin Wu, Yusen Zhang2, Eddie Fuller and
Cun-Quan Zhang1; “A Novel Model for DNA
Sequence Similarity Analysis Based on Graph Theory”;
Department of Mathematics,
[10] West Virginia University, Morgantown, WV, USA,
26506. School of Mathematics and Statistics,Shandong
University at Weihai, Weihai, China, 264209 Gina M.
Cannarozzi; “String Alignment using Dynamic
Programming”
[11] David R Bentley; Whole-genome re-sequencing.
[12] M. Madan Babu; Biological Databases and Protein
Sequence Analysis; Center for Biotechnology, Anna
University, Chenna
[13] A pattern classification; Richard O.Duda, peter E.Hart,
David G.Stork 2nd editon;
[14] simultaneous solution of the RNA folding alignment
and protosequence problem; David Sankoff Siam
,J.Apple Math; vol 45; 1985
[15] http://www.ncbi.nlm.nih.gov/
[16] http://www.uniprot.org/uniprot/

More Related Content

PDF
Telecardiology and Teletreatment System Design for Heart Failures Using Type-...
PDF
Pattern recognition system based on support vector machines
PDF
International Journal of Computer Science, Engineering and Information Techno...
PDF
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
PDF
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
PDF
Investigations on Hybrid Learning in ANFIS
DOCX
Bioinformatics_Sequence Analysis
PDF
Dynamic thresholding on speech segmentation
Telecardiology and Teletreatment System Design for Heart Failures Using Type-...
Pattern recognition system based on support vector machines
International Journal of Computer Science, Engineering and Information Techno...
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
Investigations on Hybrid Learning in ANFIS
Bioinformatics_Sequence Analysis
Dynamic thresholding on speech segmentation

What's hot (16)

PDF
Dynamic thresholding on speech segmentation
PDF
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
PDF
Eyeblink artefact removal from eeg using independent
PDF
INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...
PDF
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
PDF
A survey research summary on neural networks
PDF
40120130406014 2
PDF
Genome structure prediction a review over soft computing techniques
PDF
Efficiency of Neural Networks Study in the Design of Trusses
PDF
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES
PDF
F017533540
PDF
AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORK
PDF
DCT AND DFT BASED BIOMETRIC RECOGNITION AND MULTIMODAL BIOMETRIC SECURITY
PDF
A Simple Segmentation Approach for Unconstrained Cursive Handwritten Words in...
PDF
A over damped person identification system using emg signal
PDF
Intelligent Handwritten Digit Recognition using Artificial Neural Network
Dynamic thresholding on speech segmentation
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
Eyeblink artefact removal from eeg using independent
INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
A survey research summary on neural networks
40120130406014 2
Genome structure prediction a review over soft computing techniques
Efficiency of Neural Networks Study in the Design of Trusses
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES
F017533540
AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORK
DCT AND DFT BASED BIOMETRIC RECOGNITION AND MULTIMODAL BIOMETRIC SECURITY
A Simple Segmentation Approach for Unconstrained Cursive Handwritten Words in...
A over damped person identification system using emg signal
Intelligent Handwritten Digit Recognition using Artificial Neural Network
Ad

Similar to Comparative analysis of dynamic programming algorithms to find similarity in gene sequences (20)

PDF
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
PDF
CCC-Bicluster Analysis for Time Series Gene Expression Data
PDF
PDF
Data reduction techniques for high dimensional biological data
PDF
A clonal based algorithm for the reconstruction of genetic network using s sy...
PDF
A clonal based algorithm for the reconstruction of
PDF
Sequence Similarity between Genetic Codes using Improved Longest Common Subse...
PDF
Power spectrum sequence analysis of rheumatic
PPTX
презентация за варшава
PDF
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
PDF
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
PDF
Power spectrum sequence analysis of rheumatic arthritis (ra disease using dsp...
PDF
The Chaotic Structure of Bacterial Virulence Protein Sequences
PDF
The chaotic structure of
PDF
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
PDF
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
PDF
27 20 dec16 13794 28120-1-sm(edit)genap
PDF
Double layered dna based cryptography
PDF
Stock markets and_human_genomics
PDF
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
CCC-Bicluster Analysis for Time Series Gene Expression Data
Data reduction techniques for high dimensional biological data
A clonal based algorithm for the reconstruction of genetic network using s sy...
A clonal based algorithm for the reconstruction of
Sequence Similarity between Genetic Codes using Improved Longest Common Subse...
Power spectrum sequence analysis of rheumatic
презентация за варшава
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Power spectrum sequence analysis of rheumatic arthritis (ra disease using dsp...
The Chaotic Structure of Bacterial Virulence Protein Sequences
The chaotic structure of
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
27 20 dec16 13794 28120-1-sm(edit)genap
Double layered dna based cryptography
Stock markets and_human_genomics
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
Ad

More from eSAT Journals (20)

PDF
Mechanical properties of hybrid fiber reinforced concrete for pavements
PDF
Material management in construction – a case study
PDF
Managing drought short term strategies in semi arid regions a case study
PDF
Life cycle cost analysis of overlay for an urban road in bangalore
PDF
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
PDF
Laboratory investigation of expansive soil stabilized with natural inorganic ...
PDF
Influence of reinforcement on the behavior of hollow concrete block masonry p...
PDF
Influence of compaction energy on soil stabilized with chemical stabilizer
PDF
Geographical information system (gis) for water resources management
PDF
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
PDF
Factors influencing compressive strength of geopolymer concrete
PDF
Experimental investigation on circular hollow steel columns in filled with li...
PDF
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
PDF
Evaluation of punching shear in flat slabs
PDF
Evaluation of performance of intake tower dam for recent earthquake in india
PDF
Evaluation of operational efficiency of urban road network using travel time ...
PDF
Estimation of surface runoff in nallur amanikere watershed using scs cn method
PDF
Estimation of morphometric parameters and runoff using rs &amp; gis techniques
PDF
Effect of variation of plastic hinge length on the results of non linear anal...
PDF
Effect of use of recycled materials on indirect tensile strength of asphalt c...
Mechanical properties of hybrid fiber reinforced concrete for pavements
Material management in construction – a case study
Managing drought short term strategies in semi arid regions a case study
Life cycle cost analysis of overlay for an urban road in bangalore
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory investigation of expansive soil stabilized with natural inorganic ...
Influence of reinforcement on the behavior of hollow concrete block masonry p...
Influence of compaction energy on soil stabilized with chemical stabilizer
Geographical information system (gis) for water resources management
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
Factors influencing compressive strength of geopolymer concrete
Experimental investigation on circular hollow steel columns in filled with li...
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
Evaluation of punching shear in flat slabs
Evaluation of performance of intake tower dam for recent earthquake in india
Evaluation of operational efficiency of urban road network using travel time ...
Estimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of morphometric parameters and runoff using rs &amp; gis techniques
Effect of variation of plastic hinge length on the results of non linear anal...
Effect of use of recycled materials on indirect tensile strength of asphalt c...

Recently uploaded (20)

PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT
Drone Technology Electronics components_1
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
“Next-Gen AI: Trends Reshaping Our World”
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Road Safety tips for School Kids by a k maurya.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Internship_Presentation_Final engineering.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
web development for engineering and engineering
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
ANIMAL INTERVENTION WARNING SYSTEM (4).pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Drone Technology Electronics components_1
bas. eng. economics group 4 presentation 1.pptx
“Next-Gen AI: Trends Reshaping Our World”
Structs to JSON How Go Powers REST APIs.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Lesson 3_Tessellation.pptx finite Mathematics
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Road Safety tips for School Kids by a k maurya.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Internship_Presentation_Final engineering.pptx
Lecture Notes Electrical Wiring System Components
Strings in CPP - Strings in C++ are sequences of characters used to store and...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
web development for engineering and engineering
Internet of Things (IOT) - A guide to understanding
ANIMAL INTERVENTION WARNING SYSTEM (4).pptx

Comparative analysis of dynamic programming algorithms to find similarity in gene sequences

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 08 | Aug-2013, Available @ http://www.ijret.org 312 COMPARATIVE ANALYSIS OF DYNAMIC PROGRAMMING ALGORITHMS TO FIND SIMILARITY IN GENE SEQUENCES Shankar Biradar1 , Vinod Desai2 , Basavaraj Madagouda3 , Manjunath Patil4 1, 2, 3, 4 Assistant Professor, Department of Computer Science & Engineering, Angadi Institute of Technology and Management, Belgaum, Karnataka, India. [email protected], [email protected], [email protected], [email protected] Abstract There exist many computational methods for finding similarity in gene sequence, finding suitable methods that gives optimal similarity is difficult task. Objective of this project is to find an appropriate method to compute similarity in gene/protein sequence, both within the families and across the families. Many dynamic programming algorithms like Levenshtein edit distance; Longest Common Subsequence and Smith-waterman have used dynamic programming approach to find similarities between two sequences. But none of the method mentioned above have used real benchmark data sets. They have only used dynamic programming algorithms for synthetic data. We proposed a new method to compute similarity. The performance of the proposed algorithm is evaluated using number of data sets from various families, and similarity value is calculated both within the family and across the families. A comparative analysis and time complexity of the proposed method reveal that Smith-waterman approach is appropriate method when gene/protein sequence belongs to same family and Longest Common Subsequence is best suited when sequence belong to two different families. Keywords - Bioinformatics, Gene, Gene Sequencing, Edit distance, String Similarity. -----------------------------------------------------------------------***----------------------------------------------------------------------- 1. INTRODUCTION Bioinformatics is the application of computer technology to the management of biological information. The field of bioinformatics has gained widespread popularity largely due to efforts such as the genome projects, which have produced lot of biological sequence data for analysis. This has led to the development and improvement of many computational techniques for making inference in biology and medicine. A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a polypeptide or for an RNA chain that has a function in the organism. Genes hold the information to build and maintain an organism's cells and pass genetic characteristic to their child. Gene sequencing can be used to gain important information on genes, genetic variation and gene function for biological and medical studies [13]. Edit distance is a method of finding similarity between gene/protein sequences by finding dissimilarity between two sequences [5]. Edit distance between source and target string is represented by how many fundamental operation are required to transfer source string into target, these fundamental operations are insertion, deletion and subtraction. The similarity of two strings is the minimum number of edit distance. String Similarity is quantitative term that shows degree of commonality or difference between two comparative sequences [10], Finding the gene similarity has massive use in the field of bioinformatics. 2. MATERIALS AND METHODS In this section we describe the various materials and methods which are used in our algorithms 2.1 Dataset Used For the experiment purpose we took data sets from 5 different families which are listed below, and the source of information is [16] [17]. Family: kruppel c2h2-type zinc finger protein. Family: caution-diffusion facilitator (CDF) transporter family. Family: E3 ubquitin-protein ligase. Family: Semaphorin-7A. Family: SPAG11 family. 2.2 Dataset Format In this research work we used various data sets from different families for the implementation of different algorithms, all this data set is in FASTA format. In bioinformatics, FASTA format is a text-based format for representing nucleotide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also contain sequence name before the sequences start. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 08 | Aug-2013, Available @ http://www.ijret.org 313 description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. 2.3 Gap Penalty In order to get best possible sequence alignment between two DNA sequences, it important to insert gaps in sequence alignment and use gap penalties. While aligning DNA sequences, a positive score is assigned for matches negative score is assigned for mismatch To find out score for matches and mismatches in alignments of proteins, it is necessary to know how often one sequence is substituted for another in related proteins. In addition, a method is needed to account for insertions and deletions that sometimes appear in related DNA or protein sequences. To accommodate such sequence variations, gaps that appear in sequence alignments are given a negative penalty score reflecting the fact that they are not expected to occur very often. It is very difficult to get the best- possible alignment, either global or local, unless gaps are included in the alignment. 2.4 Blosum Matrix A Blosum matrix is necessary for pair wise sequence alignment. The four DNA bases are of two types, purines (A and G) and pyrimidines (T and C). The purines are chemically similar to each other and the pyrimidines are chemically similar to each other. Therefore, we will penalize substitutions between a purine and a purine or between a pyrimidine and a pyrimidine (transitions) less heavily than substitutions between purines and pyrimidines (transfusions). We will use the following matrix for substitutions and matching’s. The score is 2 for a match, 1 for a purine with a purine or a pyrimidine with a pyrimidine, and -1 for a purine with a pyrimidine. 3. ALGORITHMS Dynamic programming algorithms for finding gene sequence similarity are discussed in detail in this section along with pseudo codes and algorithms. We used three algorithms for analysis purpose, all these algorithms uses the concept of dynamic programming, which is output sequence depends upon the input of previous sequence. Those three algorithms are. a. Levenshtein edit distance algorithm b. Longest common subsequence algorithm c. Smith-waterman algorithm 3.1 Levenshtein Edit Distance Algorithms It is one of the most popular algorithms to find dissimilarity between two nucleotide sequences, it is an approximate string matching algorithm mainly used for forensic data set, the basic principle of this algorithm is to measure the similarity between two strings [4]. This is done by calculating the number of basic operations as mentioned in introduction part. Algorithm for Levenshtein edit distance is as fallows Int LevenshteinDistance (char S[1..N], char T[1...M] ) { Declare int D[1....N, 1....M] For i from 0 to N D[i,0] := i // the distance of any first string to an empty second string For j from 0 to M D[0, j] := j // the distance of any second string to an empty first string For j from 1 to M { For i from 1 to N { S[i] = T[j] then D[i, j] := D[i-1, j-1] Else D[i,j] := min { D[i-1 , j] +1 // deletion D[i, j-1] +1 // insertion D[i-1, j-1] +1// substitution } } } Return D[M, N] } 3.2 Longest Common Subsequence Algorithm Finding LCS [3] [8] is one way of computing how similar two sequences are, Longer the LCS more similar they are. The LCS problem is a special case of the edit distance problem. LCS is similar to Levenshtein edit distance algorithm except few steps and it also involves trace back process in order to find similar sequences. Algorithm for Longest common subsequence is as fallows Int LCS (char S[1...N], char T[1....M] ) { Declare int D[1....N, 1....M] For i from 0 to N D[i,0] := 0 For j from 0 to M D[0, j] := 0 For j from 1 to M { For i from 1 to N { D[i,j] := max { V1; V2; V3+1 if S=T else V3
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 08 | Aug-2013, Available @ http://www.ijret.org 314 } } } Return D [M, N] } Where V1 = the value in the cell to the left of current cell. V2= the value in the cell above the current cell. V3= value in the cell above left to the current cell, S and T are source string and Target string respectively 3.3 Smith-Waterman Algorithm The Smith–Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. Smith-waterman algorithm differ from other Local alignment algorithm in fallowing factors a. A negative score/weight must be given to mismatches. b. Zero must be the minimum score recorded in the matrix. c. The beginning and end of an optimal path may be found anywhere in the matrix not just the last row or column. Pseudo code core smith-waterman algorithm is as fallows. Pseudo code for initialization of matrix For i=0 to length(A) F(i,0) ← d*i For j=0 to length(B) F(0,j) ← d*j For i=1 to length(A) For j=1 to length(B) { Diag ← F(i-1,j-1) + S(Ai, Bj) Up ← F(i-1, j) + d Left ← F(i, j-1) + d F(i,j) ← max(Match, Insert, Delete) } Pseudo code for SW alignment For (int i=1;i<=n;i++) For (int j=1;j<=m;j++) int s=score[seq1.charAt(i-1)][seq2.charAt(j-1)]; int val=max(0,F[i-1][j-1]+s,F[i-1][j]-d,F[i][j-1]-d); F[i][j]=val; If (val==0) B[i][j]=null; Else if(val==F[i-1][j-1]+s) B[i][j]=new Traceback2(i-1,j-1); Else if(val==F[i-1][j]-d) S[i][j]= new Traceback2(i-1,j); Else if(val==F]i][j-1]-d) B[i][j]= new Traceback2(i,j-1); Where i and j are columns and rows respectively, S (xi; yj) is value of substitution matrix and g is gap penalty, the substitution matrix is a matrix which describes the rate at which one character in a sequence changes to other character states over time 3.4 Results within the Same Families Table1. Family: cation-duffusion facilitator Figure1. Similarity graph for cation-duffusion facilitator family
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 08 | Aug-2013, Available @ http://www.ijret.org 315 Table2. Family: semaphorin Figure2. Similarity graph for family semaphorin In the figure 2, blue line indicates similarity in smith- waterman algorithm, red line indicate similarity in longest common subsequence algorithm and finally the green line indicate similarity in Levenshtein algorithm. As we see from the graph smith-water man algorithm is more efficient then other two algorithms while finding the similarity of gene sequences that belonging to same family. 3.5 Results between Different Families Figure3. Similarity graph across family Where red line indicates LCS algorithm, green line indicates Levenshtein edit distance algorithm and blue line is for smith- waterman algorithm. From the above graph, we can conclude that while comparing two gene sequences belonging to different families, longest common subsequence is better algorithm because it gives maximum similarity as compare to other two algorithms. Table3. Similarities across the families CONCLUSIONS We considered finding the gene sequence similarity using dynamic programming for our project work. In dynamic programming there exist many different approaches to find similarity among gene sequences; we took some of these algorithms for our project and did comparative analysis of these algorithms using datasets from five different families. We took different protein sequences from all these dataset as input to our program and did rigorous experimentation on these datasets, both within the families and across the families. Five data sets which are used for our experimental work are kruppel c2h2-type zinc finger protein, cation-diffusion facilitator (CDF) transporter, E3 ubquitin –protein ligase, semaphorin and finally SPAG11B and got the results as discussed in the previous section. From the results we can conclude that smith-waterman algorithm is best suited to find similarity for protein sequences that belonging to the same family, and longest common subsequence algorithm is best
  • 5. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Issue: 08 | Aug-2013, Available @ http://www.ijret.org 316 suited for protein sequences that are belong to different families. REFERENCES [1] S. Hirschberg; “Algorithms for the longest common subsequence problem”. J.ACM, 24;{664-675};1977 [2] Levenshtein V.I; “binary code capable of correcting deletion, insertion and reversal”; soviet physics doklady; vol 8; 1966 [3] Ristead, R.S Yianilos,P.N; “learning string edit distance”; IEEE Transaction or pattern analysis and machine intelligence; 1998 [4] L. Bergroth; “Survey of Longest Common Subsequence Algorithms”; Department of Computer Science, University of Turku20520 Turku,Finland; 2000 IEEE [5] Hekki Hyyro, Ayumi Shinohara; “A new bit-parallel- distance algorithm”; Nikoltseas.LNCS 3772; 2005 [6] Adrian Horia Dediu,et al; “A fast longest common subsequence algorithm for similar strings”; Language and automation theory and application, International Conference, LATA; 2010 [7] Patsaraporn Somboonsat, Mud-Armeen munlin; “a new edit distance method for finding similarity in DNA sequence”; world academy of science engineering and technology 58; 2011 [8] Dekang Lin; “ An Information-Theoretic Definition of Similarity”; Department of Computer Science University of Manitoba,Winnipeg, Manitoba, Canada R3T 2N2 [9] Xingqin Qi, Qin Wu, Yusen Zhang2, Eddie Fuller and Cun-Quan Zhang1; “A Novel Model for DNA Sequence Similarity Analysis Based on Graph Theory”; Department of Mathematics, [10] West Virginia University, Morgantown, WV, USA, 26506. School of Mathematics and Statistics,Shandong University at Weihai, Weihai, China, 264209 Gina M. Cannarozzi; “String Alignment using Dynamic Programming” [11] David R Bentley; Whole-genome re-sequencing. [12] M. Madan Babu; Biological Databases and Protein Sequence Analysis; Center for Biotechnology, Anna University, Chenna [13] A pattern classification; Richard O.Duda, peter E.Hart, David G.Stork 2nd editon; [14] simultaneous solution of the RNA folding alignment and protosequence problem; David Sankoff Siam ,J.Apple Math; vol 45; 1985 [15] http://www.ncbi.nlm.nih.gov/ [16] http://www.uniprot.org/uniprot/