How to use PHYSCObase

Names of each clone, EST, and contig

pphXXXXX
clone from non-treated library or the 5' sequence.
MSXXX
clone from non-treated library or its 5' sequence.
rpphXXXXX
3' sequence of clone pphXXXXX (from non-treated library).
pphnXXXXX
clone from auxin-treated library or its 5' sequence.
rpphnXXXXX
3' sequence of clone pphnXXXXX (from auxin (NAA)-treated library).
pphbXXXXX
clone from cytokinin-treated library or its 5' sequence.
rpphbXXXXX
3' sequence of clone pphbXXXXX (from cytokinin (BA)-treated library).
pphfXXXXX
clone from cytokinin-treated library or its 5' sequence.
rpphfXXXXX
3' sequence of clone pphfXXXXX (from First protoplast cell division library).
ppspXXXXX
clone from young sporophyte library or its 5' sequence.
rppspXXXXX
3' sequence of clone ppspXXXXX (from young SPorophyte library).
contigXXXXXX
Contig. This number is subject to change
PXXXXXX
Putative transcript represented by a single contig or a pair of contigs. This number will be conserved as far as possible.
gb:[accession no.]
Sequence from public DNA database (DDBJ/GenBank/EMBL).
XXXXXX
A string begining with a digit.

Blast datasets and programs

The most usual way to use PHYSCObase would be searching EST, genomic sequence, and full-length cDNA clones with blast programs. We usually provide three datasets for blast searches.

PHYSCObase contigs
This dataset containes all the assembled sequences. This dataset has reduced redundancy but should contain all the sequence information. Since longer match can be detected with contig sequence, this dataset should have higher sensitivity. This is the default dataset.
PHYSCObase clones
This dataset contains sequence data from every clone and genbank entries. This dataset may be used when you are looking for clone with some minute difference that may be hidden in the assembling presses. More time is needed to search this dataset.
JGI_raw
This dataset contains sequence data from the Whole Genome Shotgun Sequence ongoing at DOE Joint Genome Institute. The dataset is weekely synchronized with the data available at the FTP site of NCBI Trace Database (ftp://ftp.ncbi.nih.gov/pub/TraceDB/physcomitrella_patens/).
PHYSCObase contigs new
This dataset is temporarily available during reassembly with new EST data. When a large number of EST become available and the usual assembly results are not ready we make this dataset.

Every database contains nucleotide sequences and three variation of blast program can be used; namely, blastn, tbastn, and tblastx. If you are searching for protein coding sequences, tblastn is recommended. In this case, you enter the amino acid sequence for query. You may use blastn search when you want to know if some genomic or cDNA sequence have corresponding EST. Blastn will be also useful when you have sequenced 3' RACE products and want to know if full-length clones are available.

Interpretation of the blast results

An example tblastn result can be seen as cuc2-blast.html. CUC2 amino acid sequence was used as a query. A description list of the hit is as following.

                                                                   Score     E
Sequences producing significant alignments:                        (bits)  Value

gnl|contig|Contig3017  Contig3017                                 162  1e-40
gnl|contig|pphn33o16  pphn33o16                                   128  2e-30
gnl|contig|pphb14p21  pphb14p21                                    57  5e-09
gnl|contig|Contig3604  Contig3604                                  46  1e-05
gnl|contig|pphb2h16  pphb2h16                                      30  1.1
gnl|contig|Contig3526  Contig3526                                  28  2.4
gnl|contig|Contig12137  Contig12137                                27  5.3
gnl|contig|Contig7880  Contig7880                                  27  6.9
gnl|contig|pphn35f03  pphn35f03                                    27  9.0
gnl|contig|Contig3272  Contig3272                                  27  9.0

You see two contigs and two 5' end sequences hit with E values less than 1e-3 and four contigs and two 5' end sequences hit with E values above 1. The similarity with E value above 1 usually happen just by chance. If a strong similarity in a short region is observed, the match may be meaningful despite large E value. Such infomation can be read from the alignment. The alignment are shwon following the list of significant matches. You can jump to the corresponding alignment by clicking the score, which is left to the E value and usually displayed blue and underlined. In this case, I consider only the four sequences with E value less than 1e-3 to have significant similarity.

Finding the information associated to the sequence

Once you find contigs or clones, you will see what clones the contig contains and the sequences of those clones from the other end. To do this you enther the contig number or clone name in the search box on the top page http://moss.nibb.ac.jp/.

This will search for matches in a table which contains clone name, putative transcript id, 5' contig number, and 3' contig number. If your query was a clone name, you will get just one line containing the clone name, putative transcript id, the identification number of contig to which the 5' sequence of the clone belongs (preceded by contig1), and the identification number of the contig to which the 3' sequence of the clone belongs (preceded with contig2).

A putative transcipt id is in the form of Pnnnnnn, where n represent a digit, and identifies a pair of contigs or a contig. When 5' and 3' seqnece of a clone belonged to a different contig, the two contigs are considered to represent 5' and 3' end sequences of a single transcript. When both end sequences are contained in a contig, the contig likely constitute itself a putative transcript. Inconsistency among clones nessesitated to use a bit more complex rule. When some clone ties contigA (5') and contigB (3') but another clone ties contigC (5') and contigB (3'), we treated them as different putative transcripts. When yet another clone have both sequences in contigB, we cannot specify which transcript it belongs and assign a new putative transcript id.

For example you enter Contig3017 to search for clones which has either end sequence in Contig3017. This will return a list as following. Only the first two lines are shown here, but you can see the complete page in another window.

pphb13d01	P007036	contig1 003017	
pphb14e10	P007036	contig1 003017	contig2 003018	

The putative transcript ID is linked to the putative transcript information page. A putative transcript page begins with links to blastx results with the conceptual putative transcript, 5' contig, and 3' contig; links to 5' and 3' contig information pages. The blastx result can tell you if your query is among the strong hit in nr dataset. In the case of P007036, the top hits are NAC proteins from arabidopsis and supports that P007036 represents a member of NAC family. The result is just like the original NCBI blast, but the taxon from which the gene was isolated are shown to the right of the E value. The taxon name is colored according to its phylogenetic position. Since the blast results are precalculated and stored, it is much faster to see the result than performing actual blast search. On the other hand, if you want to know the result with latest databse, retrieve the sequence of both contigs and perform the blast search elsewhere; for example, at NCBI.

Then a list of clones, produced in our EST project, belonging to the putative transcript follows. The list of clones contains the clone name, link to their sequences (Seq.), and description of the best hit sequence in a blastx search agaist the nr dataset.

Finally Alignment section contains a brief overview of relative position of clones belonging to both contigs. In the Alignment section, clones that has both end sequences in the contig defining the putative transcript are shown first, from the longest clone to the shortest clone. The relative length are estimated by the start point and end point in the 5' and 3' contigs respectively. Then clones with only one end belonging to the contig.

The clone name are linked to a search program, so that you can find the putative transcript the clone belongs when only one end of the clone was in one of the contigs. 5' sequneces are colored blue and 3' sequences are colored red. Dead clones which did not grow on replica plates are black and badly growing clones are gray. Genbank entries are shown green and linked to the entry.



Tomoaki NISHIYAMA

Last modified: Wed Oct 8 10:27:34 JST 2003 $Id: usage.html,v 1.3 2004/03/22 02:08:14 tomoaki Exp $