13. PHYSCObase

Tomoaki Nishiyama

15.1 Names of each clone, EST, and contig

Each clones, EST, contig, and putative transcript has an identifier assigned to it.　 The identifiers have the following forms:

pphXXXXX

clone from non-treated library or the 5' sequence.

MSXXX

clone from non-treated library or its 5' sequence.

rpphXXXXX

3' sequence of clone pphXXXXX (from non-treated library).

pphnXXXXX

clone from auxin-treated library or its 5' sequence.

rpphnXXXXX

3' sequence of clone pphnXXXXX (from auxin (NAA)-treated library).

pphbXXXXX

clone from cytokinin-treated library or its 5' sequence.

rpphbXXXXX

3' sequence of clone pphbXXXXX (from cytokinin (BA)-treated library).

pphfXXXXX

clone from cytokinin-treated library or its 5' sequence.

rpphfXXXXX

3' sequence of clone pphfXXXXX (from First protoplast cell division library).

ppspXXXXX

clone from cytokinin-treated library or its 5' sequence.

rppspXXXXX

3' sequence of clone ppspXXXXX (from SPorophyte library).

ContigXXXXXX

Contig. This number is subject to change

PXXXXXX

Putative transcript represented by a single contig or a pair of contigs. This number will be conserved as far as possible.

gb:[accession no.]

Sequence from public DNA database (DDBJ/GenBank/EMBL).

XXXXXX

A string begining with a digit.

15.2 Blast datasets and programs

The most usual way to use PHYSCObase would be searching EST, genomic sequence, and full-length cDNA clones with blast program. We usually provide three datasets for blast search.

PHYSCObase contigs

This dataset contains all the assembled sequences. This dataset has reduced redundancy but should contain all the sequence information. Since longer match can be detected with contig sequence, this dataset should have higher sensitivity. This is the default dataset.

PHYSCObase clones

This dataset contains sequence data from every clone and genbank entries. This dataset may be used when you are looking for clone with some minute difference that may be hidden in the assembling presses. More time is needed to search this dataset.

JGI_raw

This dataset contains sequence data from the Whole Genome Shotgun Sequence ongoing at DOE Joint Genome Institute. The dataset is weekely synchronized with the data available at the FTP site of NCBI Trace Database (ftp://ftp.ncbi.nih.gov/pub/TraceDB/physcomitrella_patens/).　 Since this dataset containes redundant data, you may find multiple sequence from one locus.　 An interface to make local contigs will become available soon, so that you can distinguish the hit from different loci.

PHYSCObase contigs_new

This dataset is temporarily available during reassembly with new EST data. When a large number of EST become available and the usual assembly results are not ready we make this dataset.

Both databases contain nucleotide sequences and three variation of blast program can be used; namely, blastn, tbastn, and tblastx. If you are searching for protein coding sequences, tblastn is recommended. In this case, you enter the amino acid sequence for query. You may use blastn search when you want to know if some genomic or cDNA sequence have corresponding EST. Blastn will be also useful when you have sequenced 3' RACE products and want to know if full-length clones are available.

Interpretation of the blast results

After blast search, you may find some contigs, clones, and genbank entries. A contig has a form ContigNNNNNN, where N is a digit. An entry in a form pphM[a-p]NN is the 5' sequence of a clone from non treated library, where M is a number no more than 50, [a-o] is one alphabet in a range of a through p, NN is a number 01 to 24. rpphM[a-p]NN is 3' sequence of a clone from non treated library. Likewise pphnM[a-o]NN is a 5' sequence of a clone from NAA treated library, and pphbM[a-p]NN is a 5' sequence of a clone from BA treated library. The 3' sequences of these clones are labeled with preceding r. MSNNN, where NNN is a number, is 5' sequence from a clone from non treated library. No 3' sequences are available for these clones. MSNNN clones are not currently deposited to RIKEN BRC. Sequences taken from genbank are labeled gb:[accesion no.].

An example tblastn result can be seen as cuc2-blast.html. CUC2 amino acid sequence was used as a query. A description list of the hit is as following.

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 Score　　　　 E

Sequences producing significant alignments:　　　　　　　　　　　　　　　　　　　　　　　 (bits)　 Value

gnl|contig|Contig3017　 Contig3017　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 162　 1e-40

gnl|contig|pphn33o16　 pphn33o16　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 128　 2e-30

gnl|contig|pphb14p21　 pphb14p21　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　57　 5e-09

gnl|contig|Contig3604　 Contig3604　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　46　 1e-05

gnl|contig|pphb2h16　 pphb2h16　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　30　 1.1

gnl|contig|Contig3526　 Contig3526　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　28　 2.4

gnl|contig|Contig12137　 Contig12137　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　27　 5.3

gnl|contig|Contig7880　 Contig7880　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　27　 6.9

gnl|contig|pphn35f03　 pphn35f03　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　27　 9.0

gnl|contig|Contig3272　 Contig3272　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　27　 9.0

You see two contigs and two 5' end sequences hit with E values less than 1e-3 and four contigs and two 5' end sequences hit with E values above 1. The similarity with E value above 1 usually happen just by chance. If a strong similarity in a short region is observed, the match may be meaningful despite large E value. Such information can be read from the alignment. The alignment are shwon following the list of significant matches. You can jump to the corresponding alignment by clicking the score, which is left to the E value and usually displayed blue and underlined. In this case, I consider only the four sequences with E value less than 1e-3 to have significant similarity.

テキストボックス: 15.3 Finding the information associated to the sequence

Once you find contigs or clones, you will see what clones the contig contains and the sequences of those clones from the other end. To do this you enther the contig number or clone name in the search box on the top page http://moss.nibb.ac.jp/.

This will search for matches in a table which contains clone name, putative transcript id, 5' contig number, and 3' contig number. If your query was a clone name, you will get just one line containing the clone name, putative transcript id, the identification number of contig to which the 5' sequence of the clone belongs (preceded by contig1), and the identification number of the contig to which the 3' sequence of the clone belongs (preceded with contig2).

A putative transcipt id is in the form of Pnnnnnn, where n represent a digit, and identifies a pair of contigs or a contig. When 5' and 3' sequence of a clone belonged to a different contig, the two contigs are considered to represent 5' and 3' end sequences of a single transcript. When both end sequences are contained in a contig, the contig likely constitute itself a putative transcript. Inconsistency among clones nessesitated to use a bit more complex rule. When some clone ties contigA (5') and contigB (3') but another clone ties contigC (5') and contigB (3'), we treated them as different putative transcripts. When yet another clone have both sequences in contigB, we cannot specify which transcript it belongs and assign a new putative transcript id.

For example you enter Contig3017 to search for clones which has either end sequence in Contig3017. This will return a list as following. Only the first two lines are shown here, but you can see the complete page in another window.

pphb13d01　　　 P007036　　　　　 contig1 003017

pphb14e10　　　　 P007036　　　　　 contig1 003017 contig2 003018

The putative transcript ID is linked to the putative transcript information page. A putative transcript page begins with links to blastx results with the conceptual putative transcript, 5' contig, and 3' contig; links to 5' and 3' contig information pages. The blastx result can tell you if your query is among the strong hit in nr dataset. In the case of P007036, the top hits are NAC proteins from arabidopsis and supports that P007036 represents a member of NAC family. The result is just like the original NCBI blast, but the taxon from which the gene was isolated are shown right to the E value. The taxon name is colored according to its phylogenetic position. Since the blast results are precalculated and stored, it is much faster to see the result than performing actual blast search. On the other hand, if you want to know the result with latest database, retrieve the sequence of both contigs and perform the blast search elsewhere; for example, at NCBI.

Then a list of clones, produced in our EST project, belonging to the putative transcript follows. The list of clones contains the clone name, link to their sequences (Seq.), and description of the best hit sequence in a blastx search against the nr dataset. Note: The complete blastx results by individual clone sequences against nr dataset are currently unavailable.

Finally Alignment section contains a brief overview of relative position of clones belonging to both contigs. In the Alignment section, clones that has both end sequences in the contig defining the putative transcript are shown first, from the longest clone to the shortest clone. The relative length are estimated by the start point and end point in the 5' and 3' contigs respectively. Then clones with only one end belonging to the contig.

The clone name are linked to a search program, so that you can find the putative transcript the clone belongs when only one end of the clone was in one of the contigs. 5' sequneces are colored blue and 3' sequences are colored red. Dead clones which did not grow on replica plates are black and badly growing clones are gray. Genbank entries are shown green and linked to the entry.

------------------------------------------------------------------------