13. How to use genome information

Tomoaki Nishiyama, Kari Thompson, Tomomichi Fujita, Yuji Hiwatashi,  and Mitsuyasu Hasebe

 

The draft genome sequence of the Physcomitrella patens (Rensing et al 2008) serves as a foundation for molecular biological analysis in P. patens.  The primary sequence is served at the JGI genome portal and also available from the DDBJ/EMBL/GenBank international sequence database. 

        The gene annotation especially the exon-intron boundaries are not very accurately predicted.  We have a large number of EST of full-length cDNA clones from various stages in the life cycle.  Such information are available through PHYSCObase

 

Public genome information sites

 

        PHYSCObase      http://moss.nibb.ac.jp

        cosmoss  http://www.cosmoss.org

        JGI      http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html

        NCBI    http://www.ncbi.nlm.nih.gov/

 

13.1 Searching homologues of your protein coding gene

 

You need a query sequence preferably in amino acid.

Connect with a browser to http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html

Select blast tab

Change the program to tblastn

Select the target dataset as Physcomitrella_patens.1_1

Paste your amino acid sequence

Press “Submit job” button.

Wait a few minutes until the search is complete

Press “Refresh Display” button.

Now you get a graphical hit map

Click one of the arrows if any hit was found

テキスト ボックス:  
In this view the region of query hit is shown with the yellow bar, which corresponds to the conserved domain of the gene. The conserved region is also apparent from the existence of Vista Poptr1_1 peaks and blastx bars the blue bars at the bottom of the window.  There are two gene models predicted on this region: e_wg1.227.43.1 and estExt_fgenesh_pg.C.2270053.  There are also three regions with EST match shown with green.  From this view the relationship of the EST and gene models are unclear.

View on browser

The hit will be displayed at an upper track.

Zoom approprietely if you find any gene model or EST close to the hit region.

 

 

テキスト ボックス:  
When you click the Green bar of ESTs the indivisual ESTs are shown as arrows. Thogh at this moment the connection of left and right EST are unclear, the left EST suggests e_gw1.227.43.1 does not match well with EST and the right ESTs does not match with estExt_fgenesh1_pg.C_2270053.  The similar number of left and right EST with the opposite  direction implies that they may be the other end sequence of the same clone.  That can be checked by a search in PHYSCObase.
You can see gene models and EST mapped to the genome if there is any.

The ESTs are shown condensed as green bars as default display, and can be expanded by clicking on the green bar.

 

Note that the gene model may not be correct and not all the EST currently available are mapped.

You can retrieve the genomic sequence of the region you are showing on the browser by clicking DNA at the right upper box.  The DNA sequence can be used to search PHYSCObase. Since the EST sequence may not reach to the conserved domain of the full-length cDNA, you may find a good cDNA clone by this way, which you would miss if you searched directly to the EST dataset in PHYSCObase.

 

 

 

13.2 Searching EST clones at PHYSCObase

The most usual way to use PHYSCObase would be to search EST, and full-length cDNA clones with the BLAST program.  A full-length cDNA clone is useful for construction of ectopic expression or overexpression constructs and production of recombinant proteins in bacterial or other expression systems.  The EST data are also useful for understanding the gene structure, such as transcriptional start sites, intron-exon junction, and polyadenylation sites.  For genes with sufficient expression levels, you may also find alternative transcripts.

You can either search contigs or individual clones with a BLASTN, TBLASTN, or TBLASTX search. There are two datasets for EST data:

PHYSCObase contigs dataset contains all the assembled sequences. This dataset has reduced redundancy but should contain all the sequence information. Since a longer match can be detected with contig sequence, this dataset should have higher sensitivity. This is the default dataset.

PHYSCObase clones dataset contains sequence data from every clone and genbank entries. This dataset may be used when you are looking for a clone with some minute difference that may be hidden in the assembling process. More time is needed to search this dataset.

 

Both databases contain nucleotide sequences and three variation of the BLAST program can be used; namely, BLASTN, TBLASTN, and TBLASTX. If you are searching for protein coding sequences, TBLASTN is recommended. In this case, you enter the amino acid sequence for query. You may use BLASTN search when you want to know if some genomic or cDNA sequence have corresponding ESTs. BLASTN will be also useful when you have sequenced 3'-RACE products and want to know if full-length clones are available.

 

13.2.1 Interpretation of the BLAST results

 

After the BLAST search, you may find some contigs, clones, and genbank entries. A contig has a form ContigNNNNNN, where N is a digit. Sequences taken from genbank are labeled gb:[accesion no.]. Others are individual EST sequences:  A 3’-end sequence has a prefix of r and others are 5’-end sequences.  The distinction of the libraries are explained later in 13.2.3 and 13.2.4.

 

An example TBLASTN result can be seen as cuc2-BLAST.html. CUC2 amino acid sequence was used as a query. A description list of the hit is as following.

 

 

                                                           Score    E

Sequences producing significant alignments:                        (bits)  Value

 

gnl|contig|Contig3017  Contig3017                            162   1e-40

gnl|contig|pphn33o16  pphn33o16                            128    2e-30

gnl|contig|pphb14p21  pphb14p21                            57     5e-09

gnl|contig|Contig3604 Contig3604                            46     1e-05

gnl|contig|pphb2h16    pphb2h16                             30      1.1

gnl|contig|Contig3526  Contig3526                            28    2.4

gnl|contig|Contig12137 Contig12137                           27      5.3

gnl|contig|Contig7880  Contig7880                            27      6.9

gnl|contig|pphn35f03  pphn35f03                            27      9.0

gnl|contig|Contig3272  Contig3272                            27      9.0

 

 

You see two contigs and two 5'-end sequences hit with E values less than 1e-3 and four contigs and two 5'-end sequences hit with E values above 1. Similarities with E values above 1 usually happen just by chance. If a strong similarity in a short region is observed, the match may be meaningful despite the large E value. Such information can be read from the alignment. The alignments are shown following the list of significant matches. You can jump to the corresponding alignment by clicking the score, which is left to the E value and usually displayed in blue and underlined. In this case, I consider only the four sequences with E value less than 1e-3 to have significant similarity.

 

13.2.2 Finding the information associated to the sequence

 

Once you find contigs or clones, you will see what clones the contig contains and the sequences of those clones from the other end. To do this you enter the contig number or clone name in the search box on the top page http://moss.nibb.ac.jp/.

 

This will search for matches in a table which contains the clone name, putative transcript id, 5' contig number, and 3' contig number. If your query was a clone name, you will get just one line containing the clone name, putative transcript id, the identification number of contig to which the 5' sequence of the clone belongs (preceded by contig1), and the identification number of the contig to which the 3' sequence of the clone belongs (preceded with contig2).

 

A putative transcipt id is in the form of Pnnnnnn, where n represent a digit, and identifies a pair of contigs or a contig. When the 5' and 3' sequence of a clone belong to different contigs, the two contigs are considered to represent 5' and 3' end sequences of a single transcript. When both end sequences are contained in a single contig, this contig itself likely constitutes a putatively complete transcript. Inconsistency among clones necessitated the use of a rather more complex rule. When one clone ties contigA (5') and contigB (3') but another clone ties contigC (5') and contigB (3'), we treated them as different putative transcripts. When yet another clone has both sequences in contigB, we cannot specify to which transcript it belongs and so assign a new putative transcript id.

 

For example you enter Contig3017 to search for clones which have either end sequence in Contig3017. This will return a list as following. Only the first two lines are shown here, but you can see the complete page in another window.

 

pphb13d01       P007036 contig1 003017   

pphb14e10       P007036 contig1 003017    contig2 003018   

 

 

The putative transcript ID is linked to the putative transcript information page. A putative transcript page begins with links to BLASTX results with the conceptual putative transcript, 5' contig, and 3' contig; links to 5' and 3' contig information pages. The BLASTX result can tell you if your query is among the strong hits in the nr dataset. In the case of P007036, the top hits are NAC proteins from Arabidopsis and supports that P007036 represents a member of NAC family. The result is just like the original NCBI BLAST, but the taxon from which the gene was isolated are shown to the right of the E value. The taxon name is colored according to its phylogenetic position. Since the BLAST results are precalculated and stored, it is much faster to see the result than performing an actual BLAST search. On the other hand, if you want to know the result with latest database, retrieve the sequence of both contigs and perform the BLAST search elsewhere; for example, at NCBI.

 

Then a list of clones, produced in our EST project, belonging to the putative transcript follows. The list of clones contains the clone name, link to their sequences (Seq.), and description of the best hit sequence in a BLASTX search against the nr dataset. Note: The complete BLASTX results by individual clone sequences against the nr dataset are currently unavailable.

 

Finally, the Alignment section contains a brief overview of the relative position of clones belonging to both contigs. In the Alignment section, clones that have both end sequences in the contig defining the putative transcript are shown first, from the longest clone to the shortest clone. The relative lengths are estimated by the start point and end point in the 5' and 3' contigs respectively. Then clones with only one end belonging to the contig.

 

The clone names are linked to a search program, so that you can find the putative transcript to which the clone belongs when only one end of the clone was in one of the contigs. 5' sequences are colored blue and 3' sequences are colored red. Dead clones which did not grow on replica plates are black and badly growing clones are gray. Genbank entries are shown green and are linked to the entry.

 

13.2.3 Names of each clone, EST, and contig

Each clone, EST, contig, and putative transcript has an identifier assigned to it. 

 

Individual clones are usually named as pp + library code (one or two characters) + plate number + row (a-o) + column number (01-24)

The sequence from 5’ end is the name of the clone itself and the sequence from 3’ end has a prefix of r before the clone name

 

 

The identifiers have the following forms:

 

ContigXXXXXX

Contig. This number is subject to change

PXXXXXX

Putative transcript represented by a single contig or a pair of contigs. This number will be conserved as far as possible.

gb:[accession no.]

Sequence from public DNA database (DDBJ/GenBank/EMBL).

 

MSXXX

clone from non-treated library or its 5' sequence.

pphXXXXX

clone from non-treated library or its 5' sequence.

rpphXXXXX

3' sequence of clone pphXXXXX.

pphnXXXXX

clone from auxin(NAA)-treated library or its 5' sequence.

rpphnXXXXX

3' sequence of clone pphnXXXXX.

pphbXXXXX

clone from cytokinin-treated library or its 5' sequence.

rpphbXXXXX

3' sequence of clone pphbXXXXX.

pphfXXXXX

clone from first protoplast division library (protoplasts during the first cell division) or its 5' sequence.

rpphfXXXXX

3' sequence of clone pphfXXXXX.

ppspXXXXX

clone from sporophyte library (sporophytes before meiosis with surrounding archegonia) or its 5' sequence.

rppspXXXXX

3' sequence of clone ppspXXXXX.

pplsXXXXX

clone from leafy shoot library (upper halves of gametophores) or its 5' sequence.

rpplsXXXXX

3' sequence of clone pplsXXXXX.

ppaaXXXXX

clone from antheridia and archegonia library (gametophore tips including the antheridia and archegonia) or its 5' sequence.

rppaaXXXXX

3' sequence of clone ppaaXXXXX.

ppgsXXXXX

clone from green sporophyte library (sporophytes during meiosis) or its 5' sequence.

rppgsXXXXX

3' sequence of clone ppgsXXXXX.

 

XXXXXX

A string begining with a digit.

 

13.2.4 EST data

EST analyses of the following eight full-length cDNA libraries were performed and all ESTs are deposited in public DNA databases.

(1) MS + pph: Untreated protonemata library (9,944 5'ESTs; 9,352 3'ESTs)

(2) pphn: Auxin-treated library (16,733 5'ESTs; 16,763 3'ESTs)

(3) pphb: cytokinin-treated library (16,450 5'ESTs; 15,000 3'ESTs)

(4) pphf: Library for protoplasts during the first cell division (10,535 5'ESTs; 10,975 3'ESTs)

(5) ppsp: Library for sporophytes before meiosis with surrounding archegonia (8,514 5'ESTs; 8,241 3'ESTs)

(6) ppls: Upper halves of gametophores (52,838 5’ ESTs; 4,500 3’ EST)

(7) ppaa: Gametophore tips including the antheridia and archegonia (14,443 5’ EST; 14,925 3’EST)

(8) ppgs: Sporophytes during meiosis (14,842 5’ EST; 15,076 3’ EST)

 

Together with other 18,364 mRNA sequences deposited in GenBank as of Nov 9, 2006, a total of 258,332 ESTs are open in public DNA databases.  These ESTs are assembled into 32,128 sequences (20,427 contigs and 11,701 singlets).  The public EST data grew to 382,584 including 125,160 Villersexel sequences.

 

All clones corresponding to the ESTs in the eight full-length cDNA libraries are distributed from RIKEN Bio Resource Center (http://www.brc.riken.jp/lab/epd/Eng/index.shtml). Available clones are searchable in PHYSCObase (http://moss.nibb.ac.jp).

 

 

13.2.5 Detail of the Full-length cDNA libraries

 

The individual clones from the following libraries are distributed from RIKEN Bio Resource Center (http://www.brc.riken.jp/inf/en/).

 

(1) Untreated protonemata library:

Physcomitrella patens (Hedw.) Bruch & Schimp subsp. patens collected in Gransden Wood, Huntingdonshire, UK, was used as the wild-type strain. Protonemata were homogenized with a Polytron (Kinematica, Littau, Switzerland), and inoculated in BCDATG medium at 25°C under continuous light, and the tissues were harvested at the 13th and 14th days. The collected tissue contained protonemata and young gametophores with two to five leaves. Full-length cDNA was recovered by using the biotinylated CAP trapper method and the single-strand linker ligation method was used in the construction of the cDNA libraries. Clones originating from this library are designated as “pph” clones. >90% of clones should have a complete open reading frame. Full-length cDNAs were cloned into the vector in Fig. 1, which has a pBluescriptII backbone with Amp resistance.

 

(2) Auxin-treated library:

Protonemata were homogenised with a Polytron and inoculated into BCD medium that contained 1.0 mM CaCl2 and 1.0 µM NAA (naphthalene acetic acid; Sigma-Aldrich, St. Louis, MO) at 25°C under continuous light, and the tissues were harvested between the 8th to 11th days. The collected NAA-treated tissue contained chloronemata, caulonemata, and rhizoid-like protonemata. Full-length cDNA was recovered by using the biotinylated CAP trapper method and the single-strand linker ligation method was used in the construction of the cDNA libraries. One round of normalization was performed. Clones originating from this library are designated as “pphn” clones. >90% of clones should have a complete open reading frame.  Full-length cDNAs were cloned into the vector in Fig. 1, which has a pBluescriptII backbone with Amp resistance.

 

(3) Cytokinin-treated library:

Protonemata were homogenised with a Polytron and inoculated into BCD medium that contained 1.0 mM CaCl2 and 0.50 µM BA (6-benzylaminopurine; Sigma-Aldrich) for the BA-treated specimens at 25°C under continuous light, and the tissues were harvested between the 8th to 13th days. The collected BA-treated tissue contained chloronemata, caulonemata, and malformed buds. Full-length cDNA was recovered by using the biotinylated CAP trapper method and the single-strand linker ligation method was used in the construction of the cDNA libraries. One round of normalization was performed. Clones originating from this library are designated as “pphb” clones. >90% of clones should have a complete open reading frame.  Full-length cDNAs were cloned into the vector in Fig. 1, which has a pBluescriptII backbone with Amp resistance.

 

 

Fig. 1 A vector used for cap-trapper full-length cDNA libraries.

 

 (4) Library for protoplasts at a stage of the first cell division:

Protonemata were subcultured into BCDATG medium every ca. 5 days and protoplasts were prepared. Isolated protoplasts were incubated at 25°C under continuous light for 2-3 days, when the number of cells at a stage of the first cell division, which is asymmetric, or cells with protrusions are increased. Full-length cDNA was recovered by the Vector-Capping method (Kato et al. (2005) DNA Res. 12:53-62). Clones originating from this library are designated as “pphf” clones. Full-length cDNAs were cloned into the pGCAPzf3 vector (Fig. 2).

 

 

5’....GCCAGGGTTTTCCCAGTCACGACGTTGTAAAACGACGGCCAGTGAA

                          M13 Forward Primer

 

                                                T7 Transcription Start

                                    

AATTTGAATTGTAATACGACTCACTATAGGGCGAATTGGCGGCCAAATCGGCC

                                   T7 promoter                             SfiI

 

 

GAATT(         cDNA            )GGCCATAAGGGCCAGCTTGAG

                                                 SfiI

 

SP6 Transcription Start

 


TATTCTATAGTGTCACCTAAATAGCTTGGCGTAATCATGGTCATAGCTGTTTC

              SP6 promoter                                                                     M13 Reverse Primer

 

CTGTGTGAAATTGTTATCCGCTCACAATTCCACACAACATACGAGC....3’

 

Fig. 2  pGCAPzf3 Vector promoter and cloning site sequence

 

 

(5)Library for sporophytes before meiosis with surrounding archegonia:

Mosses were grown on Jiffy 7 for 4-6 weeks at 25°C (24L) followed 3-4 weeks at 15°C (8L16D).  Sporophytes before meiosis, together with surrounding archegonia, were collected using stereomicroscopy.  Full-length cDNA was recovered using the oligo-capping method (Maruyama and Sugano, 1994, Gene 38: 7-74). Clones originating from this library are designated as “ppsp” clones. Full-length cDNAs were cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig. 3.

Fig. 3  pME18S-FL3 vector

Note: this vector consists of a promoter, an intron donor, an acceptor, and a polyadenylation signal from Simian virus 40 (SV40), and a promoter from Human T-lymphotropic virus (HTLV), in addition to the pUC18 backbone.  The stuffer sequence, which should be absent from a good clone, is tetR from Escherichia coli.

 

(6) ppls: Upper halves of gametophores

 

Gametophores were cultivated on sterile peat pellets (Jiffy-7; Jiffy Products International AS, Kristansand, Norway) for 1 to 1.5 months at 25°C under continuous light.

cDNA library was prepared by the oligo-capping method (Maruyama and Sugano 1994. Gene 138: 171-174). The cDNAs were cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig. 3.

 

(7) ppaa: Gametophore tips including the antheridia and archegonia (unpublished)

 

Gametophores cultivated under the same conditions for the same period were moved to 15°C under 8-h light/16-h dark conditions to induce gametangia and cultivated for 11 to 26 days. Those archegonia that were brown in color, which indicates fertilization, were discarded.

 

cDNA library was prepared by the oligo-capping method (Maruyama and Sugano 1994. Gene 138: 171-174). The cDNAs were cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig. 3.

 

 

(8) ppgs: Sporophytes during meiosis (unpublished)

 

The gametophores harbouring gametangia were further cultivated at 15°C under the short day conditions for 31 to 53 days to collect sporophyte tissue during meiosis. Archegonial tissue was removed and only sporophytic tissue was collected.

 

cDNA library was prepared by the oligo-capping method (Maruyama and Sugano 1994. Gene 138: 171-174). The cDNAs were cloned into the DraIII sites of pME18S-FL3 vector (AB009864) as shown in Fig. 3.

 

 

Reference

T. Nishiyama, T. Fujita, T. Shin-I, M. Seki, H. Nishide, I. Uchiyama, A. Kamiya, P. Carninci, Y. Hayashizaki, K. Shinozaki, Y. Kohara, M. Hasebe, Comparative genomics of the Physcomitrella gametophytic transcriptome and Arabidopsis genome: implication for the land plant evolution. PNAS.100, 8007-8012 (2003)

 

T. Fujita, T. Nishiyama, Y. Hiwatashi, and M. Hasebe, Gene tagging, Gene- and Enhancer-trapping, and Full-length cDNA Overexpression in Physcomitrella patens.  In New Frontiers in Bryology: Physiology, Molecular Biology, and Functional Genomics. A. J. Wood, M. J. Oliver, and D. J. Cove eds. Kluwer Academic Publishers, Dordrecht, Netherlands, pp. 111-132 (2004)

 


 

13.3 Retrieving flanking Genomic Sequences for Gene Targeting

Kari Thompson, Tomoaki Nishiyama, and Yuji Hiwatashi

 

Introduction

 

In order to construct expression and knockout moss lines first you must obtain DNA fragments located in the 5' and 3' regions of a targeting coding sequence. The size of the DNA fragments is usually between 1 and 2 kb. 

 

There are two websites; you can use to get these sequences:

JGI:  http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html

PHYSCObase:  http://moss.nibb.ac.jp/

                      http://moss.nibb.ac.jp/cgi-bin/blast-assemble

 

1.) Using JGI to retrieve genomic sequence:

 

Procedure:

 

1.)   Click on the BLAST tab.

2.)   Choose the alignment program blastn: blast nucleotide vs. nucleotide.

3.)   Leave the defaults for expect and word size OK for beginning.

4.)   Choose the database Physcomitrella_patens.1_1. 

5.)   Paste the genomic or cDNA sequence of your gene of interest into the box for query sequence. 

6.)   You may enter your e-mail address if you would like the result e-mailed to you, but it is not necessary.

7.)   Click Submit job.

8.)   Please wait while the server searches for sequences that show similarity.  This may take a few minutes depending on queries ahead of yours.

9.)   The output will show you scaffolds that have similarity to your sequence in a graphical format.  Scaffolds that are red show a high similarity score of at least 200 bits (corresponding to 100-bp complete match) to your sequence. The first scaffold likely contains your gene.  Make a note of the scaffold number.

10.) Mouse over the graph and click on the red line for the scaffold. 

11.) At the top of the screen you will see a table.  This contains your query length, the scaffold number, and where your query sequence lies in the scaffold. 

 

For example:

This means that your query length was 564 bp and it is located in scaffold 41 between 556327 and 556890 bp.  It is best to make a note of the location of your cDNA sequence; it will be useful in the following steps.  You can click on others to view other scaffolds that contain your gene. 

 

Beneath the table you can see a graph that shows the similarity between your sequence and the scaffold. 

 

For example:

 

 

      Below this you can see the alignment of your query and the scaffold, if you click on seq next to scaffold seq you can retrieve the scaffold sequence the corresponds to your query. 

 

12.)  Click on the scaffold name found in the first table (shown in step 11).  In this example you would click on scaffold_41.

13.)  In this table you can see all of the contigs that were used to assemble the scaffold. 

14.)  You can click on get sequence to retrieve the entire genomic sequence for the scaffold, but it is easier to retrieve the sequence of the interest by entering an appropriate number in Start and End boxes.  If you want 2 kb flanking regions to the hit region (556327 - 556890), the Start is calculated as 556327 - 2000 = 554327, and the End is calculated as 556890 + 2000 = 558890.  You type in 554327 in the Start and 558890 in the End, and click get sequence.  If the region of interest contains many Ns you may need to get the reassembly of the region based on the individual sequence reads at PHYSCObase to make a good judgment of the genomic sequence. 

 

2.) Using PHYSCObase to retrieve genomic sequence. 

 

PHYSCObasehttp://moss.nibb.ac.jp/

 

Blast Assemble Data Submissionhttp://moss.nibb.ac.jp/cgi-bin/blast-assemble

 

If you cannot find a genomic DNA sequence corresponding to your gene by blast searching with the JGI database, you should try to identify the sequence using “Blast Assemble Data Submission

 

Procedure:

 

1.)   Click on the link, or use this address: http://moss.nibb.ac.jp/cgi-bin/blast-assemble.  Or if you start at the homepage for PHSYCObase, click “DNA database” and then click “BLAST raw WGS sequence and assemble into contig

2.)   You may enter the name of your gene in the box for sequence name if you wish, but it is not necessary.

3.)   Paste the genomic or cDNA sequence (fasta format) of your gene of interest into the box for sequence; this is your query. 

4.)   Choose “nucleotide strict” and “Physcomitrella patens”.

5.)   Click Construct Contigs.

6.)   Please wait while the server searches for sequences that show similarity.  This may take about 10 minutes depending on queries ahead of yours.

7.)   After a few minutes, click get the highest score scaffold to retrieve a genomic DNA sequence corresponding to your gene.

8.)   Usually you can see three sequences in the result. The first and third sequences are represented by lowercase letters above or under ‘nnnn…nnnn’, respectively.  These represent possible, but not guaranteed, extreme 5’ and 3’ genomic sequences of your gene.  The second sequence is shown in capital letters between  nnnn…nnnn’. The second sequence should be overlapping with the sequence that you used as query and is usually enough for designing primers to make constructs.

9.)   Copy and paste the second sequence into a new file and use it to construct a contig with your original query sequence.

 


 

13.4 Searching the raw shotgun sequence data

The draft genome sequence consists of 2106 scaffolds that is at least 1 kb in length and contains a lot of gaps filled with Ns.  The gaps within a scaffold is usually made because there are very similar sequence in other place in the genome.  Such gaps can sometimes recovered by locally assembling the original sequence reads (http://moss.nibb.ac.jp/cgi-bin/blast-assemble).

        When you find a small difference in your clone in comparison to the draft genome, that may indicate that there is another copy of the gene rather than a PCR error.  To seek for independent evidence, you may search the JGI_raw dataset in PHYSCObase.  The raw sequence data from the Whole Genome Shotgun Sequence done in DOE Joint Genome Institute is incorporated from FTP site of NCBI Trace Database (ftp://ftp.ncbi.nih.gov/pub/TraceDB/physcomitrella_patens/), and available for blastn and tblastn searches in PHYSCObase.  Since this dataset contains redundant data, you may find multiple sequence from one locus.  For a single copy gene the expected number of complete or nealy complete hits are about 8 fold.  If you see more, then it implies that the sequence may be present multiple copies in the genome.

An interface to make local contigs is available so that you can distinguish the hit from different loci.