An Empirical Study of Choosing Efficient Discriminative Seeds for

An Empirical Study of Choosing Efficient Discriminative
Seeds for Oligonucleotide Design
Won-Hyong Chung and Seong-Bae Park
Dept. of Computer Engineering
Kyungpook National University, South Korea
Motivation
• Issues for designing oligonucleotides
– To minimize the cross-hybridizations
– To minimize the computing time
• Seeding (or indexing) have been widely used for
concurring those issues by means of prescreening unreliable sequence regions before
calculating cross-hybridizations.
• Although many types of seeding methods have
been proposed, measure of evaluating the seeds
regarding how adequate and efficient they are in
the oligonucleotide design is not yet proposed.
Difference between alignment and
oligonucleotide design
• Alignment
– To find all possible alignments which have enough
scores.
– Sensitivity is important, while specificity is usually
guaranteed by seed’s own specificity.
• Oligoncleotide design
– To find optimal oligonucleotides to differentiate
target sequences from the others.
– Specificity should be considered as well as
sensitivity for checking cross-hybridization.
Objectives
• We propose novel measures of evaluating the
seeds based on the discriminability and the
efficiency.
• We examine five seeding methods in
oligonucleotide design.
– continuous, spaced, transition-constrained, BLAT, and
Vector seed
• We provide a software package SeedChooser
which enables users to get the adequate seeds
under their own experimental conditions.
What is Seed?
• Seeding process
– Filtering step: short fixed-length common words
which are found at both query and target
sequences are selected.
– Extension step: the selected words are extended
to the size of oligonucleotide and be checked the
cross-hybridization.
Seed = the filtering template of the fixed-length words
Seeding methods (1/2)
• Continuous seed: a seed to find k-length exact
matches
– BLAST employs 11-bp length seed 11111111111
• Spaced seed: allowing don’t care letter labeled ‘0’ in
the seed
– 18-bp-length seed containing 11-bp matches 101101100111001011 is
used at PatternHunter.
• Transition-constrained seed: adopting transition
(A <-> G, C <-> T) letter ‘@’ in the seed
– YASS used such seed 1110@10010@1010111, it consists of 18-bp
length, 10-bp matches and 2 transitions.
Seeding methods (2/2)
• Blat seed: a continuous seed allowing one or two
mismatches at any positions of the seed.
• Vector seed: a generalized seed by combining the
idea of BLAT seed and spaced seed.
• BLAT seed and Vector seed allow some mismatches
in any positions.
– They greatly increase the sensitivity but spends much
more computing time than the previous seeds.
The Issues of seeds for oligo design
• An ideal seed should filter all regions as fast as possible that
have no possibility of being chosen as an oligo.
a seed should find as
many oligos as possible
a seed should avoid to
find non-oligo region
a seed should minimize
the cost of indexing to
find oligos
Discriminability
The discriminability is a balance between
precision and recall to minimize both false
positives and false negatives.
# of seed indices hit oligos
P
# of seed indices
alpha
# of oligos containing seed hit(s)
R
# of oliogs
jump
Efficiency
The efficiency is the proportion of useful regions
filtered by a seed.
– the duplication ratio of generated indices
– the average number of indices in each oligo
# of the generated seed indices
D
# of unique seed indices
beta, gamma
# of seed indices in oligos
A
# of oligos
jump
Efficient discriminability
The efficient discriminative seed is the seed
that has the maximum efficient
discriminability value for the given
Experiments
• Empirically chosen seeds were evaluated by three
measures, discriminability, efficiency, and efficient
discriminability, respectively.
• We tested the seeds for designing the 50mer oligos.
– The parameters are set to 1 for evaluation.
• Simulated data set
– A set of random sequences which are generated by
OligoGenerator in SeedChooser.
• Biological data set
– Ecologically important genes involved in the nitrogen and
carbon cycles.
– nirS: nitrite reductase gene set
– pmoA: methane monooxygenase gene set
Discriminability of the five seeding methods
1.0
0.9
Discriminability
0.8
0.7
0.6
0.5
Continuous
Spaced
Transition
BLAT
Vector
0.4
0.3
0.2
5
10
15
20
Seed weight
25
30
Efficiency of the five seeding methods
0.18
Continuous
Spaced
Transition
BLAT
Vector
0.16
Efficiency
0.14
0.12
0.10
0.08
0.06
5
10
15
20
Seed weight
25
30
Efficient Discriminability the five seeding methods
0.12
Efficient Discriminability
0.10
0.08
0.06
Continuous
Spaced
Transition
BLAT
Vector
0.04
0.02
5
10
15
20
Seed weight
25
30
Evaluation results of pmoA data set
Evaluation results of nirS data set
SeedChooser: Seed Evaluation and
Recommendation Tools
• SeedChooser : To recommend best seeds by the evaluation
parameters. It uses genetic algorithm to find best seeds.
• SeedEvaluator : To evaluate a set of the seeds by the
parameters.
• OligoGenerator : To generate a set of oligos for the desired
experimental conditions.
• SeedChooser homepage
http://ml.knu.ac.kr/~whchung/seedchooser.html
CONCLUSION
• The novel measure for evaluating the seeds in the
oligo design based on the discriminability and the
efficiency.
• The spaced seed was generally preferred to the other
seeding methods.
• Our study can be applied to the oligo design
programs in order to improve the performance by
suggesting the experiment-specific seeds.
• We expect that our study will be helpful to the other
genomic tasks.
Supplementary materials
P0
T0
P1
T1
S0
S1
P2
S2
T2
T3
S3
• T1, T2, T3: the target sequences.
• P1 and P2 are the matched oligos for an oligo P0
• S1, S2 and S3 are the seed indices for S0 by a seed.
back
Relations of precision, recall and discriminability
1.2
Discriminability
1.0
0.8
0.6
0.4
Precision
Recall
Discriminability
0.2
6
8
10
12
14
16
18
Seed weight
20
22
24
26
Discriminability according to values of α
1.2
Discriminability
1.0
0.8
0.6
  1/ 8
  1/ 4
  1/ 2
 1
 2
 4
 8
0.4
0.2
6
8
10
12
14
16
18
20
22
24
26
Seed weight
back
Efficiency according to values of β and γ
1.2
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.6
0.4
0.2
0.1 0.2
0.3 0.4
0.5
Gam
ta
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Be
Efficiency
1.0
0.6
ma
0.7
0.8
0.9
1.0
back
Efficient Discriminability for 70mer Oligos
0.10
Efficient Discriminability
0.08
0.06
0.04
Continuous
Spaced
Transition
BLAT
Vector
0.02
0.00
5
10
15
20
Seed weight
25
30