The draft genome sequence of the Physcomitrella
patens (Rensing et al 2008) serves as a foundation
for molecular biological analysis in P. patens. The primary sequence is served at the JGI
genome portal and also available from the DDBJ/EMBL/GenBank international
sequence database.
The
gene annotation especially the exon-intron boundaries are not very accurately
predicted. We have a large number of
EST of full-length cDNA clones from various stages in the life cycle. Such information are available through
PHYSCObase
Public genome information sites
PHYSCObase http://moss.nibb.ac.jp
cosmoss http://www.cosmoss.org
JGI http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html
NCBI http://www.ncbi.nlm.nih.gov/
You need a query sequence preferably in
amino acid.
Connect with a browser to
http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html
Select blast tab
Change the program to tblastn
Select the target dataset as Physcomitrella_patens.1_1
Paste your amino acid sequence
Press “Submit job” button.
Wait a few minutes until the search is
complete
Press “Refresh Display” button.
Now you get a graphical hit map
Click one of the arrows if any hit was
found
View on browser
The hit will be displayed at an upper
track.
Zoom approprietely if you find any gene
model or EST close to the hit region.
You can see gene models and
EST mapped to the genome if there is any.
The ESTs are shown condensed as green bars
as default display, and can be expanded by clicking on the green bar.
Note that the gene model may not be correct
and not all the EST currently available are mapped.
You can retrieve the genomic sequence of
the region you are showing on the browser by clicking DNA at the right upper
box. The DNA sequence can be used to
search PHYSCObase. Since the EST sequence may not reach to the conserved domain
of the full-length cDNA, you may find a good cDNA clone by this way, which you
would miss if you searched directly to the EST dataset in PHYSCObase.
The most usual way to
use PHYSCObase would be to search EST, and full-length cDNA clones with the
BLAST program. A full-length cDNA clone
is useful for construction of ectopic expression or overexpression constructs
and production of recombinant proteins in bacterial or other expression
systems. The EST data are also useful
for understanding the gene structure, such as transcriptional start sites,
intron-exon junction, and polyadenylation sites. For genes with sufficient expression levels, you may also find
alternative transcripts.
You can either
search contigs or individual clones with a BLASTN, TBLASTN, or TBLASTX search.
There are two datasets for EST data:
PHYSCObase contigs dataset contains all the assembled sequences. This dataset has
reduced redundancy but should contain all the sequence information. Since a longer
match can be detected with contig sequence, this dataset should have higher
sensitivity. This is the default dataset.
PHYSCObase clones dataset contains sequence data from every clone and genbank
entries. This dataset may be used when you are looking for a clone with some
minute difference that may be hidden in the assembling process. More time is
needed to search this dataset.
Both databases contain nucleotide sequences
and three variation of the BLAST program can be used; namely, BLASTN, TBLASTN,
and TBLASTX. If you are searching for protein coding sequences, TBLASTN is
recommended. In this case, you enter the amino acid sequence for query. You may
use BLASTN search when you want to know if some genomic or cDNA sequence have
corresponding ESTs. BLASTN will be also useful when you have sequenced 3'-RACE
products and want to know if full-length clones are available.
13.2.1 Interpretation of the BLAST results
After the BLAST search, you may find some
contigs, clones, and genbank entries. A contig has a form ContigNNNNNN, where N
is a digit. Sequences taken from genbank are labeled gb:[accesion no.]. Others
are individual EST sequences: A 3’-end
sequence has a prefix of r and others are 5’-end sequences. The distinction of the libraries are
explained later in 13.2.3 and 13.2.4.
An example TBLASTN result can be seen as cuc2-BLAST.html. CUC2
amino acid sequence was used as a query. A description list of the
hit is as following.
Score E
Sequences producing significant alignments: (bits) Value
gnl|contig|Contig3017 Contig3017 162 1e-40
gnl|contig|pphn33o16 pphn33o16 128 2e-30
gnl|contig|pphb14p21 pphb14p21 57
5e-09
gnl|contig|Contig3604 Contig3604 46
1e-05
gnl|contig|pphb2h16 pphb2h16 30
1.1
gnl|contig|Contig3526 Contig3526 28
2.4
gnl|contig|Contig12137 Contig12137 27 5.3
gnl|contig|Contig7880 Contig7880 27 6.9
gnl|contig|pphn35f03 pphn35f03 27 9.0
gnl|contig|Contig3272 Contig3272 27 9.0
You see two contigs and two 5'-end
sequences hit with E values less than 1e-3 and four contigs and two 5'-end
sequences hit with E values above 1. Similarities with E values above 1 usually
happen just by chance. If a strong similarity in a short region is observed,
the match may be meaningful despite the large E value. Such information can be
read from the alignment. The alignments are shown following the list of
significant matches. You can jump to the corresponding alignment by clicking
the score, which is left to the E value and usually displayed in blue and
underlined. In this case, I consider only the four sequences with E value less
than 1e-3 to have significant similarity.
|
13.2.2 Finding the information
associated to the sequence
Once you find contigs or clones, you will
see what clones the contig contains and the sequences of those clones from the
other end. To do this you enter the contig number or clone name in the search
box on the top page http://moss.nibb.ac.jp/.
This will search for matches in a table
which contains the clone name, putative transcript id, 5' contig number, and 3'
contig number. If your query was a clone name, you will get just one line
containing the clone name, putative transcript id, the identification number of
contig to which the 5' sequence of the clone belongs (preceded by contig1), and
the identification number of the contig to which the 3' sequence of the clone
belongs (preceded with contig2).
A putative transcipt id is in the form of
Pnnnnnn, where n represent a digit, and identifies a pair of contigs or a
contig. When the 5' and 3' sequence of a clone belong to different contigs, the
two contigs are considered to represent 5' and 3' end sequences of a single
transcript. When both end sequences are contained in a single contig, this
contig itself likely constitutes a putatively complete transcript.
Inconsistency among clones necessitated the use of a rather more complex rule.
When one clone ties contigA (5') and contigB (3') but another clone ties
contigC (5') and contigB (3'), we treated them as different putative
transcripts. When yet another clone has both sequences in contigB, we cannot
specify to which transcript it belongs and so assign a new putative transcript
id.
For example you enter Contig3017 to search
for clones which have either end sequence in Contig3017. This will return a
list as following. Only the first two lines are shown here, but you can see the
complete page in another window.
pphb13d01 P007036 contig1 003017
pphb14e10 P007036 contig1 003017 contig2 003018
The putative transcript ID is linked to the
putative transcript information page. A putative
transcript page begins with links to BLASTX results with the
conceptual putative transcript, 5' contig, and 3' contig; links to 5' and 3'
contig information pages. The BLASTX result can tell you if your query is among
the strong hits in the nr dataset. In the case of
P007036, the top hits are NAC proteins from Arabidopsis and supports
that P007036 represents a member of NAC family. The result is just like the
original NCBI BLAST, but the taxon from which the gene was isolated are shown
to the right of the E value. The taxon name is colored according to its
phylogenetic position. Since the BLAST results are precalculated and stored, it
is much faster to see the result than performing an actual BLAST search. On the
other hand, if you want to know the result with latest database, retrieve the
sequence of both contigs and perform the BLAST search elsewhere; for example,
at NCBI.
Then a list of clones, produced in our EST
project, belonging to the putative transcript follows. The list of clones
contains the clone name, link to their sequences (Seq.), and description of the
best hit sequence in a BLASTX search against the nr dataset. Note: The complete
BLASTX results by individual clone sequences against the nr dataset are
currently unavailable.
Finally, the Alignment section contains a
brief overview of the relative position of clones belonging to both contigs. In
the Alignment section, clones that have both end sequences in the contig
defining the putative transcript are shown first, from the longest clone to the
shortest clone. The relative lengths are estimated by the start point and end
point in the 5' and 3' contigs respectively. Then clones with only one end
belonging to the contig.
The clone names are linked to a search
program, so that you can find the putative transcript to which the clone
belongs when only one end of the clone was in one of the contigs. 5' sequences
are colored blue and 3' sequences are colored red. Dead clones which did not
grow on replica plates are black and badly growing clones are gray. Genbank
entries are shown green and are linked to the entry.
13.2.3 Names of each clone, EST, and contig
Each clone, EST, contig, and
putative transcript has an identifier assigned to it.
Individual clones are usually named as pp +
library code (one or two characters) + plate number + row (a-o) + column number
(01-24)
The sequence from 5’ end is the name of the clone itself and the sequence from 3’ end has a prefix of r before the clone name
The identifiers have the
following forms:
ContigXXXXXX
Contig. This number is subject to change
PXXXXXX
Putative transcript represented by a single contig
or a pair of contigs. This number will be conserved as far as possible.
gb:[accession
no.]
Sequence from public DNA database
(DDBJ/GenBank/EMBL).
MSXXX
clone from non-treated library or its 5' sequence.
pphXXXXX
clone from non-treated library or its 5' sequence.
rpphXXXXX
3' sequence of clone pphXXXXX.
pphnXXXXX
clone from auxin(NAA)-treated library or its 5'
sequence.
rpphnXXXXX
3' sequence of clone pphnXXXXX.
pphbXXXXX
clone from cytokinin-treated library or its 5'
sequence.
rpphbXXXXX
3' sequence of clone pphbXXXXX.
pphfXXXXX
clone from first protoplast division library (protoplasts during the first cell division) or its 5' sequence.
rpphfXXXXX
3' sequence of clone pphfXXXXX.
ppspXXXXX
clone from sporophyte library (sporophytes before meiosis with surrounding
archegonia) or its 5'
sequence.
rppspXXXXX
3' sequence of clone ppspXXXXX.
pplsXXXXX
clone from leafy shoot library (upper halves of gametophores) or its 5' sequence.
rpplsXXXXX
3' sequence of clone pplsXXXXX.
ppaaXXXXX
clone from antheridia and archegonia library (gametophore tips including the antheridia and archegonia) or its 5' sequence.
rppaaXXXXX
3' sequence of clone ppaaXXXXX.
ppgsXXXXX
clone from green sporophyte library (sporophytes during meiosis) or its 5' sequence.
rppgsXXXXX
3' sequence of clone ppgsXXXXX.
XXXXXX
A string begining with a digit.
13.2.4 EST data
EST analyses of the following eight full-length cDNA libraries were
performed and all ESTs are deposited in public DNA databases.
(1) MS + pph: Untreated protonemata library (9,944 5'ESTs; 9,352
3'ESTs)
(2) pphn: Auxin-treated library (16,733 5'ESTs; 16,763 3'ESTs)
(3) pphb: cytokinin-treated library (16,450 5'ESTs; 15,000 3'ESTs)
(4) pphf: Library for protoplasts during
the first cell division (10,535 5'ESTs; 10,975 3'ESTs)
(5) ppsp: Library for sporophytes before meiosis
with surrounding archegonia (8,514 5'ESTs; 8,241
3'ESTs)
(6) ppls: Upper halves
of gametophores (52,838 5’ ESTs; 4,500 3’ EST)
(7) ppaa: Gametophore
tips including the antheridia and archegonia (14,443 5’ EST; 14,925 3’EST)
(8) ppgs: Sporophytes
during meiosis (14,842 5’ EST; 15,076 3’ EST)
Together with other 18,364 mRNA sequences
deposited in GenBank as of Nov 9, 2006, a total of 258,332 ESTs are open in
public DNA databases. These ESTs are
assembled into 32,128 sequences (20,427 contigs and 11,701 singlets). The public EST data grew to 382,584
including 125,160 Villersexel sequences.
All clones corresponding to the ESTs in the
eight full-length cDNA libraries are distributed from RIKEN Bio Resource Center
(http://www.brc.riken.jp/lab/epd/Eng/index.shtml). Available clones are searchable
in PHYSCObase (http://moss.nibb.ac.jp).
13.2.5 Detail of the Full-length cDNA
libraries
The individual clones from the following
libraries are distributed from RIKEN Bio Resource Center (http://www.brc.riken.jp/inf/en/).
(1) Untreated protonemata library:
Physcomitrella patens (Hedw.) Bruch & Schimp subsp. patens collected in Gransden Wood, Huntingdonshire, UK, was used as the
wild-type strain. Protonemata were homogenized with a Polytron (Kinematica,
Littau, Switzerland), and inoculated in BCDATG medium at 25°C under continuous
light, and the tissues were harvested at the 13th and 14th days. The collected
tissue contained protonemata and young gametophores with two to five leaves. Full-length
cDNA was recovered by using the biotinylated CAP trapper method and the
single-strand linker ligation method was used in the construction of the cDNA
libraries. Clones originating from this library are designated as “pph” clones. >90% of clones should have a complete open reading
frame. Full-length cDNAs were cloned into the vector in Fig. 1, which has a
pBluescriptII backbone with Amp resistance.
(2) Auxin-treated library:
Protonemata were homogenised with a Polytron
and inoculated into BCD medium that contained 1.0 mM CaCl2 and 1.0
µM NAA (naphthalene acetic acid; Sigma-Aldrich, St. Louis, MO) at 25°C
under continuous light, and the tissues were harvested between the 8th to 11th days.
The collected NAA-treated tissue contained chloronemata, caulonemata, and
rhizoid-like protonemata. Full-length cDNA was recovered by using the
biotinylated CAP trapper method and the single-strand linker ligation method was
used in the construction of the cDNA libraries. One round of normalization was
performed. Clones originating from this library are designated as “pphn”
clones. >90% of clones should have a complete open reading frame. Full-length cDNAs were cloned into the
vector in Fig. 1, which has a pBluescriptII backbone with Amp resistance.
(3) Cytokinin-treated library:
Protonemata were homogenised with a Polytron
and inoculated into BCD medium that contained 1.0 mM CaCl2 and 0.50
µM BA (6-benzylaminopurine; Sigma-Aldrich) for the BA-treated specimens at
25°C under continuous light, and the tissues were harvested between the 8th to
13th days. The collected BA-treated tissue contained chloronemata, caulonemata,
and malformed buds. Full-length cDNA was recovered by using the biotinylated
CAP trapper method and the single-strand linker ligation method was used in the
construction of the cDNA libraries. One round of normalization was performed. Clones
originating from this library are designated as “pphb” clones. >90% of
clones should have a complete open reading frame. Full-length cDNAs were cloned into the vector in Fig. 1, which has
a pBluescriptII backbone with Amp resistance.
Fig. 1 A vector used
for cap-trapper full-length cDNA libraries.
(4) Library for protoplasts at a stage of the first cell division:
Protonemata were subcultured into BCDATG
medium every ca. 5 days and protoplasts were prepared. Isolated protoplasts
were incubated at 25°C under continuous light for 2-3 days, when the number of
cells at a stage of the first cell division, which is asymmetric, or cells with
protrusions are increased. Full-length cDNA was recovered by the Vector-Capping
method (Kato et al. (2005) DNA Res. 12:53-62). Clones originating from this
library are designated as “pphf” clones. Full-length cDNAs were cloned into the
pGCAPzf3 vector (Fig. 2).
5’....GCCAGGGTTTTCCCAGTCACGACGTTGTAAAACGACGGCCAGTGAA
M13 Forward Primer
T7
Transcription Start
AATTTGAATTGTAATACGACTCACTATAGGGCGAATTGGCGGCCAAATCGGCC
T7
promoter SfiI
GAATT(
cDNA )GGCCATAAGGGCCAGCTTGAG
SfiI
SP6 Transcription Start
TATTCTATAGTGTCACCTAAATAGCTTGGCGTAATCATGGTCATAGCTGTTTC
SP6 promoter
M13 Reverse Primer
CTGTGTGAAATTGTTATCCGCTCACAATTCCACACAACATACGAGC....3’
Fig. 2
pGCAPzf3 Vector promoter and cloning site sequence
(5)Library for sporophytes before meiosis
with surrounding archegonia:
Mosses were grown on Jiffy 7 for 4-6 weeks
at 25°C (24L) followed 3-4 weeks at 15°C (8L16D). Sporophytes before meiosis, together with surrounding archegonia,
were collected using stereomicroscopy.
Full-length cDNA was recovered using the oligo-capping method (Maruyama
and Sugano, 1994, Gene 38: 7-74). Clones originating from this library are
designated as “ppsp” clones. Full-length cDNAs were
cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig. 3.
Fig. 3 pME18S-FL3 vector
Note: this vector consists of a promoter, an intron donor, an
acceptor, and a polyadenylation signal from Simian virus 40 (SV40), and a
promoter from Human T-lymphotropic virus (HTLV), in addition to the pUC18
backbone. The stuffer sequence, which
should be absent from a good clone, is tetR from Escherichia coli.
(6) ppls: Upper halves of gametophores
Gametophores were cultivated on sterile peat pellets (Jiffy-7; Jiffy
Products International AS, Kristansand, Norway) for 1 to 1.5 months at 25°C
under continuous light.
cDNA library was prepared by the
oligo-capping method (Maruyama and Sugano 1994. Gene 138: 171-174). The cDNAs were cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig.
3.
(7) ppaa: Gametophore tips including the antheridia and archegonia
(unpublished)
Gametophores cultivated under the same
conditions for the same period were moved to 15°C under 8-h light/16-h dark
conditions to induce gametangia and cultivated for 11 to 26 days. Those
archegonia that were brown in color, which indicates fertilization, were
discarded.
cDNA library was prepared by the
oligo-capping method (Maruyama and Sugano 1994. Gene 138: 171-174). The cDNAs were cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig.
3.
(8) ppgs: Sporophytes during meiosis (unpublished)
The gametophores harbouring gametangia were
further cultivated at 15°C under the short day conditions for 31 to 53 days to
collect sporophyte tissue during meiosis. Archegonial tissue was removed and
only sporophytic tissue was collected.
cDNA library was prepared by the
oligo-capping method (Maruyama and Sugano 1994. Gene 138: 171-174). The cDNAs were cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig.
3.
T. Nishiyama, T. Fujita, T. Shin-I, M.
Seki, H. Nishide, I. Uchiyama, A. Kamiya, P. Carninci, Y. Hayashizaki, K.
Shinozaki, Y. Kohara, M. Hasebe, Comparative genomics of the Physcomitrella
gametophytic transcriptome and Arabidopsis genome: implication for the land
plant evolution. PNAS.100, 8007-8012 (2003)
T. Fujita, T. Nishiyama, Y. Hiwatashi, and M.
Hasebe, Gene tagging, Gene- and Enhancer-trapping, and Full-length cDNA
Overexpression in Physcomitrella patens.
In New Frontiers in
Bryology: Physiology, Molecular Biology, and Functional Genomics. A. J. Wood,
M. J. Oliver, and D. J. Cove eds. Kluwer Academic Publishers, Dordrecht,
Netherlands, pp. 111-132 (2004)
Introduction
In order to construct expression and
knockout moss lines first you must obtain DNA fragments located in the 5' and
3' regions of a targeting coding sequence. The size of the DNA fragments is
usually between 1 and 2 kb.
There are two websites; you can use to get
these sequences:
JGI:
http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html
PHYSCObase: http://moss.nibb.ac.jp/
http://moss.nibb.ac.jp/cgi-bin/blast-assemble
1.) Using JGI to retrieve genomic sequence:
Procedure:
1.) Click on the BLAST tab.
2.) Choose the alignment program blastn: blast nucleotide vs.
nucleotide.
3.) Leave the defaults for expect and word size OK for beginning.
4.) Choose the database Physcomitrella_patens.1_1.
5.) Paste the genomic or cDNA sequence of your gene of interest into the
box for query sequence.
6.) You may enter your e-mail address if you would like the result
e-mailed to you, but it is not necessary.
7.) Click Submit job.
8.) Please wait while the server searches for sequences that show
similarity. This may take a few minutes
depending on queries ahead of yours.
9.) The output will show you scaffolds that have similarity to your
sequence in a graphical format.
Scaffolds that are red show a high similarity score of at least 200 bits
(corresponding to 100-bp complete match) to your sequence. The first scaffold
likely contains your gene. Make a note
of the scaffold number.
10.) Mouse over the graph and
click on the red line for the scaffold.
11.) At the top of the screen you
will see a table. This contains your
query length, the scaffold number, and where your query sequence lies in the
scaffold.
For example:
This means that
your query length was 564 bp and it is located in scaffold 41 between 556327
and 556890 bp. It is best to make a
note of the location of your cDNA sequence; it will be useful in the following
steps. You can click on others to view
other scaffolds that contain your gene.
Beneath the
table you can see a graph that shows the similarity between your sequence and
the scaffold.
For example:
Below this you can see the alignment of
your query and the scaffold, if you click on seq next to scaffold seq you can
retrieve the scaffold sequence the corresponds to your query.
12.) Click on the scaffold name found in the first
table (shown in step 11). In this
example you would click on scaffold_41.
13.) In this table you can see all of the contigs
that were used to assemble the scaffold.
14.) You can click on get sequence to retrieve the entire genomic sequence for the scaffold, but it is
easier to retrieve the sequence of the interest by entering an appropriate
number in Start and End boxes. If you
want 2 kb flanking regions to the hit region (556327 - 556890), the Start is
calculated as 556327 - 2000 = 554327, and the End is calculated as 556890 +
2000 = 558890. You type in 554327 in
the Start and 558890 in the End, and click get sequence. If the region of interest contains many Ns
you may need to get the reassembly of the region based on the individual
sequence reads at PHYSCObase to make a good judgment of the genomic sequence.
2.) Using PHYSCObase to retrieve genomic
sequence.
“PHYSCObase”
http://moss.nibb.ac.jp/
“Blast Assemble Data Submission” http://moss.nibb.ac.jp/cgi-bin/blast-assemble
If you cannot find a genomic DNA sequence
corresponding to your gene by blast searching with the JGI database, you should
try to identify the sequence using “Blast Assemble Data Submission”
Procedure:
1.) Click on the link, or use this address: http://moss.nibb.ac.jp/cgi-bin/blast-assemble. Or if you start at the homepage for
PHSYCObase, click “DNA database” and then click
“BLAST raw WGS sequence and assemble into contig”
2.) You may enter the name of your gene in the box for sequence name if
you wish, but it is not necessary.
3.) Paste the genomic or cDNA sequence (fasta format) of your gene of
interest into the box for sequence; this is your query.
4.) Choose “nucleotide strict” and “Physcomitrella
patens”.
5.) Click Construct Contigs.
6.) Please wait while the server searches for sequences that show
similarity. This may take about 10
minutes depending on queries ahead of yours.
7.) After a few minutes, click get the highest score scaffold to retrieve a genomic DNA sequence corresponding to your gene.
8.) Usually you can see three sequences in the result. The first and
third sequences are represented by lowercase letters above or under ‘nnnn…nnnn’, respectively. These
represent possible, but not guaranteed, extreme 5’ and 3’ genomic sequences of
your gene. The second sequence is shown
in capital letters between ‘nnnn…nnnn’. The second sequence should be overlapping with the sequence that
you used as query and is usually enough for designing primers to make
constructs.
9.) Copy and paste the second sequence into a new file and use it to
construct a contig with your original query sequence.
The draft genome sequence consists of 2106
scaffolds that is at least 1 kb in length and contains a lot of gaps filled
with Ns. The gaps within a scaffold is
usually made because there are very similar sequence in other place in the
genome. Such gaps can sometimes
recovered by locally assembling the original sequence reads
(http://moss.nibb.ac.jp/cgi-bin/blast-assemble).
When
you find a small difference in your clone in comparison to the draft genome,
that may indicate that there is another copy of the gene rather than a PCR
error. To seek for independent
evidence, you may search the JGI_raw dataset in PHYSCObase. The raw sequence data from the Whole Genome Shotgun Sequence done in DOE Joint
Genome Institute is incorporated from FTP site of NCBI
Trace Database (ftp://ftp.ncbi.nih.gov/pub/TraceDB/physcomitrella_patens/),
and available for blastn and tblastn searches in PHYSCObase. Since this dataset contains redundant data,
you may find multiple sequence from one locus.
For a single copy gene the expected number of complete or nealy complete
hits are about 8 fold. If you see more,
then it implies that the sequence may be present multiple copies in the genome.
An interface to make local contigs is
available so that you can distinguish the hit from different loci.