15.1
Names of each clone, EST, and contig
Each clones, EST, contig, and putative
transcript has an identifier assigned to it.
The identifiers have the following forms:
pphXXXXX
clone
from non-treated library or the 5' sequence.
MSXXX
clone
from non-treated library or its 5' sequence.
rpphXXXXX
3'
sequence of clone pphXXXXX (from non-treated library).
pphnXXXXX
clone
from auxin-treated library or its 5' sequence.
rpphnXXXXX
3'
sequence of clone pphnXXXXX (from auxin (NAA)-treated library).
pphbXXXXX
clone
from cytokinin-treated library or its 5' sequence.
rpphbXXXXX
3'
sequence of clone pphbXXXXX (from cytokinin (BA)-treated library).
pphfXXXXX
clone
from cytokinin-treated library or its 5' sequence.
rpphfXXXXX
3'
sequence of clone pphfXXXXX (from First protoplast cell division library).
ppspXXXXX
clone
from cytokinin-treated library or its 5' sequence.
rppspXXXXX
3'
sequence of clone ppspXXXXX (from SPorophyte library).
ContigXXXXXX
Contig.
This number is subject to change
PXXXXXX
Putative
transcript represented by a single contig or a pair of contigs. This number
will be conserved as far as possible.
gb:[accession no.]
Sequence
from public DNA database (DDBJ/GenBank/EMBL).
XXXXXX
A string begining with a digit.
15.2
Blast datasets and programs
The
most usual way to use PHYSCObase would be searching EST, genomic sequence, and
full-length cDNA clones with blast program. We usually provide three datasets
for blast search.
PHYSCObase
contigs
This
dataset contains all the assembled sequences. This dataset has reduced
redundancy but should contain all the sequence information. Since longer match
can be detected with contig sequence, this dataset should have higher
sensitivity. This is the default dataset.
PHYSCObase
clones
This
dataset contains sequence data from every clone and genbank entries. This
dataset may be used when you are looking for clone with some minute difference
that may be hidden in the assembling presses. More time is needed to search
this dataset.
This dataset contains sequence data from the Whole Genome Shotgun Sequence
ongoing at DOE Joint Genome Institute. The dataset is weekely synchronized with
the data available at the FTP site of NCBI Trace Database
(ftp://ftp.ncbi.nih.gov/pub/TraceDB/physcomitrella_patens/). Since this dataset containes
redundant data, you may find multiple sequence from one locus. An interface to make local contigs will
become available soon, so that you can distinguish the hit from different loci.
This dataset is temporarily available during reassembly with
new EST data. When a large number of EST become available and the usual
assembly results are not ready we make this dataset.
Both databases contain nucleotide sequences and three
variation of blast program can be used; namely, blastn, tbastn, and tblastx. If
you are searching for protein coding sequences, tblastn is recommended. In this
case, you enter the amino acid sequence for query. You may use blastn search
when you want to know if some genomic or cDNA sequence have corresponding EST.
Blastn will be also useful when you have sequenced 3' RACE products and want to
know if full-length clones are available.
Interpretation
of the blast results
After blast search, you may find some contigs, clones, and
genbank entries. A contig has a form ContigNNNNNN, where N is a digit. An entry
in a form pphM[a-p]NN is the 5' sequence of a clone from non treated library,
where M is a number no more than 50, [a-o] is one alphabet in a range of a
through p, NN is a number 01 to 24. rpphM[a-p]NN is 3' sequence of a clone from
non treated library. Likewise pphnM[a-o]NN is a 5' sequence of a clone from NAA
treated library, and pphbM[a-p]NN is a 5' sequence of a clone from BA treated
library. The 3' sequences of these clones are labeled with preceding r. MSNNN,
where NNN is a number, is 5' sequence from a clone from non treated library. No
3' sequences are available for these clones. MSNNN clones are not currently
deposited to RIKEN BRC. Sequences taken from genbank are labeled gb:[accesion
no.].
An
example tblastn result can be seen as cuc2-blast.html.
CUC2 amino acid sequence was used as
a query. A description list of the hit is as following.
Score E
Sequences
producing significant alignments: (bits)
Value
gnl|contig|Contig3017 Contig3017 162 1e-40
gnl|contig|pphn33o16 pphn33o16 128 2e-30
gnl|contig|pphb14p21 pphb14p21 57 5e-09
gnl|contig|Contig3604 Contig3604 46 1e-05
gnl|contig|pphb2h16 pphb2h16 30 1.1
gnl|contig|Contig3526 Contig3526 28 2.4
gnl|contig|Contig12137 Contig12137 27 5.3
gnl|contig|Contig7880 Contig7880 27 6.9
gnl|contig|pphn35f03 pphn35f03 27 9.0
gnl|contig|Contig3272 Contig3272 27 9.0
You see two contigs and two 5' end sequences hit with E
values less than 1e-3 and four contigs and two 5' end sequences hit with E
values above 1. The similarity with E value above 1 usually happen just by
chance. If a strong similarity in a short region is observed, the match may be
meaningful despite large E value. Such information can be read from the
alignment. The alignment are shwon following the list of significant matches.
You can jump to the corresponding alignment by clicking the score, which is
left to the E value and usually displayed blue and underlined. In this case, I
consider only the four sequences with E value less than 1e-3 to have
significant similarity.
15.3
Finding the information associated to the sequence
Once
you find contigs or clones, you will see what clones the contig contains and
the sequences of those clones from the other end. To do this you enther the
contig number or clone name in the search box on the top page http://moss.nibb.ac.jp/.
This
will search for matches in a table which contains clone name, putative
transcript id, 5' contig number, and 3' contig number. If your query was a
clone name, you will get just one line containing the clone name, putative
transcript id, the identification number of contig to which the 5' sequence of
the clone belongs (preceded by contig1), and the identification number of the
contig to which the 3' sequence of the clone belongs (preceded with contig2).
A
putative transcipt id is in the form of Pnnnnnn, where n represent a digit, and
identifies a pair of contigs or a contig. When 5' and 3' sequence of a clone
belonged to a different contig, the two contigs are considered to represent 5'
and 3' end sequences of a single transcript. When both end sequences are
contained in a contig, the contig likely constitute itself a putative
transcript. Inconsistency among clones nessesitated to use a bit more complex
rule. When some clone ties contigA (5') and contigB (3') but another clone ties
contigC (5') and contigB (3'), we treated them as different putative
transcripts. When yet another clone have both sequences in contigB, we cannot
specify which transcript it belongs and assign a new putative transcript id.
For
example you enter Contig3017 to search for clones which has either end sequence
in Contig3017. This will return a list as following. Only the first two lines
are shown here, but you can see the complete page in another
window.
pphb13d01 P007036 contig1 003017
pphb14e10 P007036 contig1 003017 contig2 003018
The
putative transcript ID is linked to the putative transcript information page. A
putative transcript page begins with links to blastx results
with the conceptual putative transcript, 5' contig, and 3' contig; links to 5'
and 3' contig information pages. The blastx result can tell you if your query
is among the strong hit in nr dataset. In the
case of P007036, the top hits are NAC proteins from arabidopsis
and supports that P007036 represents a member of NAC family. The result is just
like the original NCBI blast, but the taxon from which the gene was isolated
are shown right to the E value. The taxon name is colored according to its
phylogenetic position. Since the blast results are precalculated and stored, it
is much faster to see the result than performing actual blast search. On the
other hand, if you want to know the result with latest database, retrieve the
sequence of both contigs and perform the blast search elsewhere; for example,
at NCBI.
Then
a list of clones, produced in our EST project, belonging to the putative
transcript follows. The list of clones contains the clone name, link to their
sequences (Seq.), and description of the best hit sequence in a blastx search
against the nr dataset. Note: The complete blastx results by individual clone
sequences against nr dataset are currently unavailable.
Finally
Alignment section contains a brief overview of relative position of clones
belonging to both contigs. In the Alignment section, clones that has both end
sequences in the contig defining the putative transcript are shown first, from
the longest clone to the shortest clone. The relative length are estimated by
the start point and end point in the 5' and 3' contigs respectively. Then
clones with only one end belonging to the contig.
The
clone name are linked to a search program, so that you can find the putative
transcript the clone belongs when only one end of the clone was in one of the
contigs. 5' sequneces are colored blue and 3' sequences are colored red. Dead
clones which did not grow on replica plates are black and badly growing clones
are gray. Genbank entries are shown green and linked to the entry.
------------------------------------------------------------------------