Mathematical Cell

Promoter Finder PlatProm can analyze a given sequence of nucleotides as circular DNA molecule, evaluating the probability for each position in both strands of the entire genome or selected area to be the starting point of transcription (TSP). To create continuity, the program closes the proposed sequence in the circle and automatically produces values of the calculated scores for both strands of the genome (or selected area). If the scanning of short fragment is required, the "cycling" is not biologically relevant. In this case additional 255 bp are required upstream of the target sequence and 155 bp downstream of the end. As the PlatProm calculates the score by analyzing the context of nucleotide sequences around each position of the sample in the range -255 to +155, the scanning in this case begins from the position 256 and should be terminated 155 bp upstream from the end.

Three types of position weight matrices (PWM) [1, 2] are used to account all contextual and structural parameters found in this area of known promoters. Two PWM assess compliance of existing sequences to the conserved hexanucleotide (TTGACA and TATAAT near positions -35 and -10, respectively), which forming specific contacts with the 70 subunit of RNA polymerase. Two other matrices take into account specific distribution of dinucleotide around TSP (dominant in positions -1/ +1 are CA and TA) and dinucleotides flanking the 5'-end of the -10 module ("extended -10" element where dominates TG). The scores of these four matrices reflect the occurrence frequency of all nucleotides/dinucleotides in all positions of conservative elements by taking into account both positive and negative contributions of existing pairs. Fifty-four "cascade" matrices assess the presence around the analyzed position of thermodynamically unstable, structuring, and other non-canonical sequence motifs, which occurrence frequency in the promoter DNA exceeds the background value by at least 5 StD [3, 6]. The presence of these elements in a given area is scored by cascaded matrices on the basis of alternative principle (only one with the highest score is taken into account). In the absence of all motifs, typical for a given promoter region, a negative contribution to the total score is assigned, which is estimated by the percentage of such promoters in the training promoter set. The contribution of each cascade matrix is normalized by the relative information content of the analyzed promoter sub-region and the least conservative sixth base pair of the -35 element. As a result, the contribution of additional elements in the overall score estimated by PlatProm is about 50%.

Two matrices take into account the length of the spacer between the -35 and -10 elements (14-21 bp, optimal - 17 bp) and the distance between the -10 element and a potential starting point (2-11 bp, optimal - 6 bp). The deviation from the optimal length gives a negative contribution, which value depends on the proportion of the corresponding promoters in the training set.

PlatProm perceives the promoter DNA as a common platform for interaction with RNA polymerase, and regulatory proteins. Therefore, as an independent promoter signature it uses the presence of direct and inverted repeats of at least 5 bp in length (L), spaced by 5-6 bp. These repeats can be targeted for interaction with dimers or tetramers of transcription factors. Their contribution is estimated as the log(L).

Scoring system of PlatProm does not employ "external" factors, such as information on the location of open reading frames or the presence of Shine-Dalgarno sequences. Therefore, PlatProm can be used for the direct genome scanning and can predict promoters for genes encoding proteins, as well as promoters for the synthesis of untranslated RNAs.

Currently PlatProm is able to scan circular genomes (each chain sequentially), or linear DNA fragments containing only the standard symbols A, T, G and C. In the presence of a degenerate symbols in the given genome, they are replaced with the standard, in accordance with the Table 1. In this the case the output scores are marked.

Table 1. Vocabulary for replacement of undefined symbols

Symbol	Description	Bases represented				Base used
W	Weak	A			T	A
S	Strong		C	G		G
M	aMino	A	C			C
K	Keto			G	T	G
R	puRine	A		G		G
Y	pYrimidine		C		T	C
B	not A (B comes after A)		C	G	T	G
D	not C (D comes after C)	A		G	T	G
H	not G (H comes after G)	A	C		T	C
V	not T (V comes after T and U)	A	C	G		G
N or -	any Nucleotide (not a gap)	A	C	G	T	G

An average score for non-promoter fragments selected from the genome of E. coli and evaluated as a control set is -5.0, the standard deviation is 3.11 [2]. Signals exceeding the background level for 4 StD (values above 7.44) and arranged in clusters, provide about 99% true positive signals. An automatic calculation of these parameters without special compilations, accomplished by the method proposed in [6], gives approximately the same values (-4.66 and 3.49, respectively). They are provided as output parameters of the program.

Conservativeness of the bacterial transcriptional machinery allows use of PlatProm for promoter prediction in the genomes of other bacteria, including those, which AT/GC-content differs substantially from the genome of E. coli. Critical in this case is the adequate assessment of the threshold level and its variability. These two parameters can be automatically calculated and desired level of reliability can be set by the user in terms of StD.

A more detailed description of the algorithm can be found in [2, 6].

Search for promoters in bacterial DNA sequences

DESCRIPTION OF THE MODEL

Table 1. Vocabulary for replacement of undefined symbols