Question

Forum:how to map de novo assembly to reference genome ?

2

Entering edit mode

9.2 years ago

Farbod ★ 3.4k

Dear friends, Hi

I want to map my de novo transcriptome assembly to reference genome using BLAT or GMAP. Then, look at the distribution of intron lengths that can infer from those alignments.

The main story is this that the Trinity software needs a --genome_guided_max_intron parameter for its genome guided and its manual has suggested that "use a maximum intron length that makes most sense given your targeted organism"

So, I need your helps about the script(s) for mapping de novo assembly to genome: must I index the genome? must I install the BLAT same as local ncbi BLAST?

Thank you in advance

Assembly alignment RNA-Seq • 5.9k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 9.2 years ago by Farbod ★ 3.4k

3

Entering edit mode

Hi, Like you said,make use of GMAP and map the transcriptome to your refernece geneome and later you will obtain the gff3 file which has the eixon location of transcript within the scaffold/contig. Make use of that information to compute the intron length.

The other option is to make use of tool called "GAG" where you have to provide the Genome(fasta file) and its GFF3 file(you can obtain from GMAP) and it will tell the summary stats of genome features including min intron length, max intron length and mean intron length

ADD REPLY • link 9.2 years ago by EVR ▴ 610

1

Entering edit mode

Dear Tom, Hi. Very nice answer, thank you.

ADD REPLY • link 9.2 years ago by Farbod ★ 3.4k

1

Entering edit mode

I don't think the number has to be absolute. You can use a number that should fall in ballpark (one from zebrafish may be fine in this case).

ADD REPLY • link 9.2 years ago by GenoMax 154k

2

Entering edit mode

Dear genomax2, Hi

I have used "10000" that is written in The trinity website and the result was only about 500 transcripts but in the de novo assembly I have more than 500,000 transcripts!

So I think that this number must be very critical or the zebrafish and my species are very very distinct from each other.

Do you have any idea that what is this number (intron lengths) for Zebrafish?

ADD REPLY • link 9.2 years ago by Farbod ★ 3.4k

1

Entering edit mode

According to this paper that number may need to be ~1000 for zebrafish.

You have predictions for half a billion transcripts. There is no independent evidence that they are real, as yet.

ADD REPLY • link 9.2 years ago by GenoMax 154k

1

Entering edit mode

Thank you for the paper you have provided, and the time you have spent.

I really appreciate that.

ADD REPLY • link 9.2 years ago by Farbod ★ 3.4k

1

Entering edit mode

Dear Genomax2,

In the table1 of your paper, the "maximum intron size" is about 378,145 for zebrafish but you have siggested ~1000, is there any miss-understanding by me?

ADD REPLY • link 9.2 years ago by Farbod ★ 3.4k

1

Entering edit mode

Mean intron length is ~3000 and median is ~1000 (378K is an outlier). You could try running with a couple of different values (1000 and 3000).

ADD REPLY • link 9.2 years ago by GenoMax 154k

2

Entering edit mode

My Dear Friend, Genomax, Hi.

I have used the Trinity genome guided approach with different "maximum intron size"(s) and the number of genes or better to say, transcripts in the result fasta file was as below:

maximun intron size ....................................... No. of transcripts

378145 ..........................................................568

3000 ............................................................. 567

1000 ............................................................. 566

10000 ........................................................... 567

De novo assembly ......................................... ~ 600,000 transcripts !

Do you have any idea about these results?

my fish was a sturgeon and I have mapped its reads with zebrafish genome (using STAR) as there was not any close genome to my species.

ADD REPLY • link 7.5 years ago by Farbod ★ 3.4k