13. How to use genome information

Tomoaki Nishiyama, Kari Thompson, Tomomichi Fujita, Yuji Hiwatashi,　 and Mitsuyasu Hasebe

The draft genome sequence of the Physcomitrella patens (Rensing et al 2008) serves as a foundation for molecular biological analysis in P. patens.　 The primary sequence is served at the JGI genome portal and also available from the DDBJ/EMBL/GenBank international sequence database.　

　　　　　　　 The gene annotation especially the exon-intron boundaries are not very accurately predicted.　 We have a large number of EST of full-length cDNA clones from various stages in the life cycle.　 Such information are available through PHYSCObase

Public genome information sites

　　　　　　　 PHYSCObase　　　　　 http://moss.nibb.ac.jp

　　　　　　　 cosmoss　 http://www.cosmoss.org

　　　　　　　 JGI　　　　　 http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html

　　　　　　　 NCBI　　　 http://www.ncbi.nlm.nih.gov/

13.1 Searching homologues of your protein coding gene

You need a query sequence preferably in amino acid.

Connect with a browser to http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html

Select blast tab

Change the program to tblastn

Select the target dataset as Physcomitrella_patens.1_1

Paste your amino acid sequence

Press “Submit job” button.

Wait a few minutes until the search is complete

Press “Refresh Display” button.

Now you get a graphical hit map

Click one of the arrows if any hit was found

View on browser

The hit will be displayed at an upper track.

Zoom approprietely if you find any gene model or EST close to the hit region.

You can see gene models and EST mapped to the genome if there is any.

The ESTs are shown condensed as green bars as default display, and can be expanded by clicking on the green bar.

Note that the gene model may not be correct and not all the EST currently available are mapped.

You can retrieve the genomic sequence of the region you are showing on the browser by clicking DNA at the right upper box.　 The DNA sequence can be used to search PHYSCObase. Since the EST sequence may not reach to the conserved domain of the full-length cDNA, you may find a good cDNA clone by this way, which you would miss if you searched directly to the EST dataset in PHYSCObase.

13.2 Searching EST clones at PHYSCObase

The most usual way to use PHYSCObase would be to search EST, and full-length cDNA clones with the BLAST program.　 A full-length cDNA clone is useful for construction of ectopic expression or overexpression constructs and production of recombinant proteins in bacterial or other expression systems.　 The EST data are also useful for understanding the gene structure, such as transcriptional start sites, intron-exon junction, and polyadenylation sites.　 For genes with sufficient expression levels, you may also find alternative transcripts.

You can either search contigs or individual clones with a BLASTN, TBLASTN, or TBLASTX search. There are two datasets for EST data:

PHYSCObase contigs dataset contains all the assembled sequences. This dataset has reduced redundancy but should contain all the sequence information. Since a longer match can be detected with contig sequence, this dataset should have higher sensitivity. This is the default dataset.

PHYSCObase clones dataset contains sequence data from every clone and genbank entries. This dataset may be used when you are looking for a clone with some minute difference that may be hidden in the assembling process. More time is needed to search this dataset.

Both databases contain nucleotide sequences and three variation of the BLAST program can be used; namely, BLASTN, TBLASTN, and TBLASTX. If you are searching for protein coding sequences, TBLASTN is recommended. In this case, you enter the amino acid sequence for query. You may use BLASTN search when you want to know if some genomic or cDNA sequence have corresponding ESTs. BLASTN will be also useful when you have sequenced 3'-RACE products and want to know if full-length clones are available.

13.2.1 Interpretation of the BLAST results

After the BLAST search, you may find some contigs, clones, and genbank entries. A contig has a form ContigNNNNNN, where N is a digit. Sequences taken from genbank are labeled gb:[accesion no.]. Others are individual EST sequences:　 A 3’-end sequence has a prefix of r and others are 5’-end sequences.　 The distinction of the libraries are explained later in 13.2.3 and 13.2.4.

An example TBLASTN result can be seen as cuc2-BLAST.html. CUC2 amino acid sequence was used as a query. A description list of the hit is as following.

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　Score 　　　E

Sequences producing significant alignments:　　　　　　　　　　　　　　　　　　　　　　　 (bits)　 Value

gnl|contig|Contig3017　 Contig3017　　　　　　　　　　　　　　　　　　　　　　　　　　　 162　　 1e-40

gnl|contig|pphn33o16　 pphn33o16　　　　　　　　　　　　　　　　　　　　　　　　　　　 128　　 2e-30

gnl|contig|pphb14p21　 pphb14p21　　　　　　　　　　　　　　　　　　　　　　　　　　　 57　　　 5e-09

gnl|contig|Contig3604 Contig3604　　　　　　　　　　　　　　　　　　　　　　　　　　　 46　　　 1e-05

gnl|contig|pphb2h16 　　 pphb2h16　　　　　　　　　　　　　　　　　　　　　　　　　　　　 30　　　　1.1

gnl|contig|Contig3526　 Contig3526　　　　　　　　　　　　　　　　　　　　　　　　　　　 28　　　 2.4

gnl|contig|Contig12137 Contig12137　　　　　　　　　　　　　　　　　　　　　　　　　　 27　　　　5.3

gnl|contig|Contig7880　 Contig7880　　　　　　　　　　　　　　　　　　　　　　　　　　　 27 　　　　6.9

gnl|contig|pphn35f03　 pphn35f03　　　　　　　　　　　　　　　　　　　　　　　　　　　 27 　　　　9.0

gnl|contig|Contig3272　 Contig3272　　　　　　　　　　　　　　　　　　　　　　　　　　　 27　　　　9.0

You see two contigs and two 5'-end sequences hit with E values less than 1e-3 and four contigs and two 5'-end sequences hit with E values above 1. Similarities with E values above 1 usually happen just by chance. If a strong similarity in a short region is observed, the match may be meaningful despite the large E value. Such information can be read from the alignment. The alignments are shown following the list of significant matches. You can jump to the corresponding alignment by clicking the score, which is left to the E value and usually displayed in blue and underlined. In this case, I consider only the four sequences with E value less than 1e-3 to have significant similarity.

13.2.2 Finding the information associated to the sequence

Once you find contigs or clones, you will see what clones the contig contains and the sequences of those clones from the other end. To do this you enter the contig number or clone name in the search box on the top page http://moss.nibb.ac.jp/.

This will search for matches in a table which contains the clone name, putative transcript id, 5' contig number, and 3' contig number. If your query was a clone name, you will get just one line containing the clone name, putative transcript id, the identification number of contig to which the 5' sequence of the clone belongs (preceded by contig1), and the identification number of the contig to which the 3' sequence of the clone belongs (preceded with contig2).

A putative transcipt id is in the form of Pnnnnnn, where n represent a digit, and identifies a pair of contigs or a contig. When the 5' and 3' sequence of a clone belong to different contigs, the two contigs are considered to represent 5' and 3' end sequences of a single transcript. When both end sequences are contained in a single contig, this contig itself likely constitutes a putatively complete transcript. Inconsistency among clones necessitated the use of a rather more complex rule. When one clone ties contigA (5') and contigB (3') but another clone ties contigC (5') and contigB (3'), we treated them as different putative transcripts. When yet another clone has both sequences in contigB, we cannot specify to which transcript it belongs and so assign a new putative transcript id.

For example you enter Contig3017 to search for clones which have either end sequence in Contig3017. This will return a list as following. Only the first two lines are shown here, but you can see the complete page in another window.

pphb13d01　　　　　　 P007036 contig1 003017　　　

pphb14e10　　　　　　 P007036 contig1 003017　　　 contig2 003018　　　

The putative transcript ID is linked to the putative transcript information page. A putative transcript page begins with links to BLASTX results with the conceptual putative transcript, 5' contig, and 3' contig; links to 5' and 3' contig information pages. The BLASTX result can tell you if your query is among the strong hits in the nr dataset. In the case of P007036, the top hits are NAC proteins from Arabidopsis and supports that P007036 represents a member of NAC family. The result is just like the original NCBI BLAST, but the taxon from which the gene was isolated are shown to the right of the E value. The taxon name is colored according to its phylogenetic position. Since the BLAST results are precalculated and stored, it is much faster to see the result than performing an actual BLAST search. On the other hand, if you want to know the result with latest database, retrieve the sequence of both contigs and perform the BLAST search elsewhere; for example, at NCBI.

Then a list of clones, produced in our EST project, belonging to the putative transcript follows. The list of clones contains the clone name, link to their sequences (Seq.), and description of the best hit sequence in a BLASTX search against the nr dataset. Note: The complete BLASTX results by individual clone sequences against the nr dataset are currently unavailable.

Finally, the Alignment section contains a brief overview of the relative position of clones belonging to both contigs. In the Alignment section, clones that have both end sequences in the contig defining the putative transcript are shown first, from the longest clone to the shortest clone. The relative lengths are estimated by the start point and end point in the 5' and 3' contigs respectively. Then clones with only one end belonging to the contig.

The clone names are linked to a search program, so that you can find the putative transcript to which the clone belongs when only one end of the clone was in one of the contigs. 5' sequences are colored blue and 3' sequences are colored red. Dead clones which did not grow on replica plates are black and badly growing clones are gray. Genbank entries are shown green and are linked to the entry.

13.2.3 Names of each clone, EST, and contig

Each clone, EST, contig, and putative transcript has an identifier assigned to it.　

Individual clones are usually named as pp + library code (one or two characters) + plate number + row (a-o) + column number (01-24)

The sequence from 5’ end is the name of the clone itself and the sequence from 3’ end has a prefix of r before the clone name

The identifiers have the following forms:

ContigXXXXXX

Contig. This number is subject to change

PXXXXXX

Putative transcript represented by a single contig or a pair of contigs. This number will be conserved as far as possible.

gb:[accession no.]

Sequence from public DNA database (DDBJ/GenBank/EMBL).

MSXXX

clone from non-treated library or its 5' sequence.

pphXXXXX

clone from non-treated library or its 5' sequence.

rpphXXXXX

3' sequence of clone pphXXXXX.

pphnXXXXX

clone from auxin(NAA)-treated library or its 5' sequence.

rpphnXXXXX

3' sequence of clone pphnXXXXX.

pphbXXXXX

clone from cytokinin-treated library or its 5' sequence.

rpphbXXXXX

3' sequence of clone pphbXXXXX.

pphfXXXXX

clone from first protoplast division library (protoplasts during the first cell division) or its 5' sequence.

rpphfXXXXX

3' sequence of clone pphfXXXXX.

ppspXXXXX

clone from sporophyte library (sporophytes before meiosis with surrounding archegonia) or its 5' sequence.

rppspXXXXX

3' sequence of clone ppspXXXXX.

pplsXXXXX

clone from leafy shoot library (upper halves of gametophores) or its 5' sequence.

rpplsXXXXX

3' sequence of clone pplsXXXXX.

ppaaXXXXX

clone from antheridia and archegonia library (gametophore tips including the antheridia and archegonia) or its 5' sequence.

rppaaXXXXX

3' sequence of clone ppaaXXXXX.

ppgsXXXXX

clone from green sporophyte library (sporophytes during meiosis) or its 5' sequence.

rppgsXXXXX

3' sequence of clone ppgsXXXXX.

XXXXXX

A string begining with a digit.

13.2.4 EST data

EST analyses of the following eight full-length cDNA libraries were performed and all ESTs are deposited in public DNA databases.

(1) MS + pph: Untreated protonemata library (9,944 5'ESTs; 9,352 3'ESTs)

(2) pphn: Auxin-treated library (16,733 5'ESTs; 16,763 3'ESTs)

(3) pphb: cytokinin-treated library (16,450 5'ESTs; 15,000 3'ESTs)

(4) pphf: Library for protoplasts during the first cell division (10,535 5'ESTs; 10,975 3'ESTs)

(5) ppsp: Library for sporophytes before meiosis with surrounding archegonia (8,514 5'ESTs; 8,241 3'ESTs)

(6) ppls: Upper halves of gametophores (52,838 5’ ESTs; 4,500 3’ EST)

(7) ppaa: Gametophore tips including the antheridia and archegonia (14,443 5’ EST; 14,925 3’EST)

(8) ppgs: Sporophytes during meiosis (14,842 5’ EST; 15,076 3’ EST)

Together with other 18,364 mRNA sequences deposited in GenBank as of Nov 9, 2006, a total of 258,332 ESTs are open in public DNA databases.　 These ESTs are assembled into 32,128 sequences (20,427 contigs and 11,701 singlets).　 The public EST data grew to 382,584 including 125,160 Villersexel sequences.

All clones corresponding to the ESTs in the eight full-length cDNA libraries are distributed from RIKEN Bio Resource Center (http://www.brc.riken.jp/lab/epd/Eng/index.shtml). Available clones are searchable in PHYSCObase (http://moss.nibb.ac.jp).

13.2.5 Detail of the Full-length cDNA libraries

The individual clones from the following libraries are distributed from RIKEN Bio Resource Center (http://www.brc.riken.jp/inf/en/).

(1) Untreated protonemata library:

Physcomitrella patens (Hedw.) Bruch & Schimp subsp. patens collected in Gransden Wood, Huntingdonshire, UK, was used as the wild-type strain. Protonemata were homogenized with a Polytron (Kinematica, Littau, Switzerland), and inoculated in BCDATG medium at 25°C under continuous light, and the tissues were harvested at the 13th and 14th days. The collected tissue contained protonemata and young gametophores with two to five leaves. Full-length cDNA was recovered by using the biotinylated CAP trapper method and the single-strand linker ligation method was used in the construction of the cDNA libraries. Clones originating from this library are designated as “pph” clones. >90% of clones should have a complete open reading frame. Full-length cDNAs were cloned into the vector in Fig. 1, which has a pBluescriptII backbone with Amp resistance.

(2) Auxin-treated library:

Protonemata were homogenised with a Polytron and inoculated into BCD medium that contained 1.0 mM CaCl₂ and 1.0 µM NAA (naphthalene acetic acid; Sigma-Aldrich, St. Louis, MO) at 25°C under continuous light, and the tissues were harvested between the 8th to 11th days. The collected NAA-treated tissue contained chloronemata, caulonemata, and rhizoid-like protonemata. Full-length cDNA was recovered by using the biotinylated CAP trapper method and the single-strand linker ligation method was used in the construction of the cDNA libraries. One round of normalization was performed. Clones originating from this library are designated as “pphn” clones. >90% of clones should have a complete open reading frame.　 Full-length cDNAs were cloned into the vector in Fig. 1, which has a pBluescriptII backbone with Amp resistance.

(3) Cytokinin-treated library:

Protonemata were homogenised with a Polytron and inoculated into BCD medium that contained 1.0 mM CaCl₂ and 0.50 µM BA (6-benzylaminopurine; Sigma-Aldrich) for the BA-treated specimens at 25°C under continuous light, and the tissues were harvested between the 8th to 13th days. The collected BA-treated tissue contained chloronemata, caulonemata, and malformed buds. Full-length cDNA was recovered by using the biotinylated CAP trapper method and the single-strand linker ligation method was used in the construction of the cDNA libraries. One round of normalization was performed. Clones originating from this library are designated as “pphb” clones. >90% of clones should have a complete open reading frame.　 Full-length cDNAs were cloned into the vector in Fig. 1, which has a pBluescriptII backbone with Amp resistance.

Fig. 1 A vector used for cap-trapper full-length cDNA libraries.

　(4) Library for protoplasts at a stage of the first cell division:

Protonemata were subcultured into BCDATG medium every ca. 5 days and protoplasts were prepared. Isolated protoplasts were incubated at 25°C under continuous light for 2-3 days, when the number of cells at a stage of the first cell division, which is asymmetric, or cells with protrusions are increased. Full-length cDNA was recovered by the Vector-Capping method (Kato et al. (2005) DNA Res. 12:53-62). Clones originating from this library are designated as “pphf” clones. Full-length cDNAs were cloned into the pGCAPzf3 vector (Fig. 2).

5’....GCCAGGGTTTTCCCAGTCACGACGTTGTAAAACGACGGCCAGTGAA

　　　　　　　　　　　　　　　　　　　　　　　　　M13 Forward Primer

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　T7 Transcription Start

AATTTGAATTGTAATACGACTCACTATAGGGCGAATTGGCGGCCAAATCGGCC

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　T7 promoter　　　　　　　　　　　　　　　　　　　　　　　　　　　　SfiI

GAATT(　　　　　　　　 cDNA　　　　　　　　　　　 )GGCCATAAGGGCCAGCTTGAG

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 SfiI

SP6 Transcription Start

TATTCTATAGTGTCACCTAAATAGCTTGGCGTAATCATGGTCATAGCTGTTTC

　　　　　　　　　　　　　SP6 promoter　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　M13 Reverse Primer

CTGTGTGAAATTGTTATCCGCTCACAATTCCACACAACATACGAGC....3’

Fig. 2　 pGCAPzf3 Vector promoter and cloning site sequence

(5)Library for sporophytes before meiosis with surrounding archegonia:

Mosses were grown on Jiffy 7 for 4-6 weeks at 25°C (24L) followed 3-4 weeks at 15°C (8L16D).　 Sporophytes before meiosis, together with surrounding archegonia, were collected using stereomicroscopy.　 Full-length cDNA was recovered using the oligo-capping method (Maruyama and Sugano, 1994, Gene 38: 7-74). Clones originating from this library are designated as “ppsp” clones. Full-length cDNAs were cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig. 3.

Fig. 3　 pME18S-FL3 vector

Note: this vector consists of a promoter, an intron donor, an acceptor, and a polyadenylation signal from Simian virus 40 (SV40), and a promoter from Human T-lymphotropic virus (HTLV), in addition to the pUC18 backbone.　 The stuffer sequence, which should be absent from a good clone, is tetR from Escherichia coli.

(6) ppls: Upper halves of gametophores

Gametophores were cultivated on sterile peat pellets (Jiffy-7; Jiffy Products International AS, Kristansand, Norway) for 1 to 1.5 months at 25°C under continuous light.

cDNA library was prepared by the oligo-capping method (Maruyama and Sugano 1994. Gene 138: 171-174). The cDNAs were cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig. 3.

(7) ppaa: Gametophore tips including the antheridia and archegonia (unpublished)

Gametophores cultivated under the same conditions for the same period were moved to 15°C under 8-h light/16-h dark conditions to induce gametangia and cultivated for 11 to 26 days. Those archegonia that were brown in color, which indicates fertilization, were discarded.

cDNA library was prepared by the oligo-capping method (Maruyama and Sugano 1994. Gene 138: 171-174). The cDNAs were cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig. 3.

(8) ppgs: Sporophytes during meiosis (unpublished)

The gametophores harbouring gametangia were further cultivated at 15°C under the short day conditions for 31 to 53 days to collect sporophyte tissue during meiosis. Archegonial tissue was removed and only sporophytic tissue was collected.

cDNA library was prepared by the oligo-capping method (Maruyama and Sugano 1994. Gene 138: 171-174). The cDNAs were cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig. 3.

Reference

T. Nishiyama, T. Fujita, T. Shin-I, M. Seki, H. Nishide, I. Uchiyama, A. Kamiya, P. Carninci, Y. Hayashizaki, K. Shinozaki, Y. Kohara, M. Hasebe, Comparative genomics of the Physcomitrella gametophytic transcriptome and Arabidopsis genome: implication for the land plant evolution. PNAS.100, 8007-8012 (2003)

T. Fujita, T. Nishiyama, Y. Hiwatashi, and M. Hasebe, Gene tagging, Gene- and Enhancer-trapping, and Full-length cDNA Overexpression in Physcomitrella patens.　 In New Frontiers in Bryology: Physiology, Molecular Biology, and Functional Genomics. A. J. Wood, M. J. Oliver, and D. J. Cove eds. Kluwer Academic Publishers, Dordrecht, Netherlands, pp. 111-132 (2004)

13.3 Retrieving flanking Genomic Sequences for Gene Targeting

Kari Thompson, Tomoaki Nishiyama, and Yuji Hiwatashi

Introduction

In order to construct expression and knockout moss lines first you must obtain DNA fragments located in the 5' and 3' regions of a targeting coding sequence. The size of the DNA fragments is usually between 1 and 2 kb.　

There are two websites; you can use to get these sequences:

JGI:　 http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html

PHYSCObase:　 http://moss.nibb.ac.jp/

　　　　　　　　　　　　　　　　　　　　　http://moss.nibb.ac.jp/cgi-bin/blast-assemble

1.) Using JGI to retrieve genomic sequence:

Procedure:

1.) Click on the BLAST tab.

2.) Choose the alignment program blastn: blast nucleotide vs. nucleotide.

3.) Leave the defaults for expect and word size OK for beginning.

4.) Choose the database Physcomitrella_patens.1_1.　

5.) Paste the genomic or cDNA sequence of your gene of interest into the box for query sequence.　

6.) You may enter your e-mail address if you would like the result e-mailed to you, but it is not necessary.

7.) Click Submit job.

8.) Please wait while the server searches for sequences that show similarity.　 This may take a few minutes depending on queries ahead of yours.

9.) The output will show you scaffolds that have similarity to your sequence in a graphical format.　 Scaffolds that are red show a high similarity score of at least 200 bits (corresponding to 100-bp complete match) to your sequence. The first scaffold likely contains your gene. 　Make a note of the scaffold number.

10.) Mouse over the graph and click on the red line for the scaffold.　

11.) At the top of the screen you will see a table.　 This contains your query length, the scaffold number, and where your query sequence lies in the scaffold.　

For example:

This means that your query length was 564 bp and it is located in scaffold 41 between 556327 and 556890 bp.　 It is best to make a note of the location of your cDNA sequence; it will be useful in the following steps.　 You can click on others to view other scaffolds that contain your gene.　

Beneath the table you can see a graph that shows the similarity between your sequence and the scaffold.　

For example:

　　　　　 Below this you can see the alignment of your query and the scaffold, if you click on seq next to scaffold seq you can retrieve the scaffold sequence the corresponds to your query.　

12.) 　Click on the scaffold name found in the first table (shown in step 11).　 In this example you would click on scaffold_41.

13.) 　In this table you can see all of the contigs that were used to assemble the scaffold.　

14.) 　You can click on get sequence to retrieve the entire genomic sequence for the scaffold, but it is easier to retrieve the sequence of the interest by entering an appropriate number in Start and End boxes.　 If you want 2 kb flanking regions to the hit region (556327 - 556890), the Start is calculated as 556327 - 2000 = 554327, and the End is calculated as 556890 + 2000 = 558890.　 You type in 554327 in the Start and 558890 in the End, and click get sequence.　 If the region of interest contains many Ns you may need to get the reassembly of the region based on the individual sequence reads at PHYSCObase to make a good judgment of the genomic sequence.　

2.) Using PHYSCObase to retrieve genomic sequence.　

“PHYSCObase” http://moss.nibb.ac.jp/

“Blast Assemble Data Submission” http://moss.nibb.ac.jp/cgi-bin/blast-assemble

If you cannot find a genomic DNA sequence corresponding to your gene by blast searching with the JGI database, you should try to identify the sequence using “Blast Assemble Data Submission”

Procedure:

1.) Click on the link, or use this address: http://moss.nibb.ac.jp/cgi-bin/blast-assemble.　 Or if you start at the homepage for PHSYCObase, click “DNA database” and then click “BLAST raw WGS sequence and assemble into contig”

2.) You may enter the name of your gene in the box for sequence name if you wish, but it is not necessary.

3.) Paste the genomic or cDNA sequence (fasta format) of your gene of interest into the box for sequence; this is your query.　

4.) Choose “nucleotide strict” and “Physcomitrella patens”.

5.) Click Construct Contigs.

6.) Please wait while the server searches for sequences that show similarity.　 This may take about 10 minutes depending on queries ahead of yours.

7.) After a few minutes, click get the highest score scaffold to retrieve a genomic DNA sequence corresponding to your gene.

8.) Usually you can see three sequences in the result. The first and third sequences are represented by lowercase letters above or under ‘nnnn…nnnn’, respectively.　 These represent possible, but not guaranteed, extreme 5’ and 3’ genomic sequences of your gene.　 The second sequence is shown in capital letters between　 ‘nnnn…nnnn’. The second sequence should be overlapping with the sequence that you used as query and is usually enough for designing primers to make constructs.

9.) Copy and paste the second sequence into a new file and use it to construct a contig with your original query sequence.

13.4 Searching the raw shotgun sequence data

The draft genome sequence consists of 2106 scaffolds that is at least 1 kb in length and contains a lot of gaps filled with Ns.　 The gaps within a scaffold is usually made because there are very similar sequence in other place in the genome.　 Such gaps can sometimes recovered by locally assembling the original sequence reads (http://moss.nibb.ac.jp/cgi-bin/blast-assemble).

　　　　　　　 When you find a small difference in your clone in comparison to the draft genome, that may indicate that there is another copy of the gene rather than a PCR error.　 To seek for independent evidence, you may search the JGI_raw dataset in PHYSCObase. 　The raw sequence data from the Whole Genome Shotgun Sequence done in DOE Joint Genome Institute is incorporated from FTP site of NCBI Trace Database (ftp://ftp.ncbi.nih.gov/pub/TraceDB/physcomitrella_patens/), and available for blastn and tblastn searches in PHYSCObase.　 Since this dataset contains redundant data, you may find multiple sequence from one locus.　 For a single copy gene the expected number of complete or nealy complete hits are about 8 fold.　 If you see more, then it implies that the sequence may be present multiple copies in the genome.

An interface to make local contigs is available so that you can distinguish the hit from different loci.

13. How to use genome information

Tomoaki Nishiyama, Kari Thompson, Tomomichi Fujita, Yuji Hiwatashi, and Mitsuyasu Hasebe

13.1 Searching homologues of your protein coding gene

13.2 Searching EST clones at PHYSCObase

Reference

13.3 Retrieving flanking Genomic Sequences for Gene Targeting

Kari Thompson, Tomoaki Nishiyama, and Yuji Hiwatashi

13.4 Searching the raw shotgun sequence data

Tomoaki Nishiyama, Kari Thompson, Tomomichi Fujita, Yuji Hiwatashi,　 and Mitsuyasu Hasebe