Question

How To Interpret The Flagstat Output After Mapping Reads Back To Contigs Of A De Novo Transcriptome Assembly

0

Entering edit mode

11.0 years ago

lqdo2000 ▴ 20

Hi !

I have some paired-end reads (FR) data that I processed as follow:

trimming of the adaptors
estimation of optimal k-mer length (KmerGenie)
de novo assembly (Trinity) of the paired trimmed reads (default settings)
mapping of paired trimmed reads to Trinity contigs
statistics of mapping (samtools flagstat)

Questions:

At step 2) I obtained optimal k = 93 for forward reads and k = 17 for reverse. Is such a difference make sense?
At step 5) I obtained this output:

38918266 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

38561384 + 0 mapped (99.08%:-nan%)

38918266 + 0 paired in sequencing

19459133 + 0 read1

19459133 + 0 read2

0 + 0 properly paired (0.00%:-nan%)

38561384 + 0 with itself and mate mapped

0 + 0 singletons (0.00%:-nan%)

2542804 + 0 with mate mapped to a different chr

406696 + 0 with mate mapped to a different chr (mapQ>=5)

Does 0 + 0 properly paired (0.00%:-nan%) mean nothing mapped to the contigs?

I would greatly appreciate any advice or explanation !

Thanks !

trinity samtools • 4.8k views

ADD COMMENT • link updated 11.0 years ago by Istvan Albert 101k • written 11.0 years ago by lqdo2000 ▴ 20

Istvan Albert · Answer 1 · 2013-11-16

2

Entering edit mode

11.0 years ago

Istvan Albert 101k

If you are assembling transcriptomes then KmerGenie is not the a right tool to estimate kmer size. KmerGenie should be used to estimate kmers for reads that represent randomly sheared DNA.
the "proper pair" term usually means that the read pairs are within a certain distance and a certain orientation (FR) relative to one another.

Since this is a transcriptome data I suspect that your data is strand specific and if it is then it may be in the RF orientation.

ADD COMMENT • link 11.0 years ago by Istvan Albert 101k

0

Entering edit mode

Thanks Istvan !

Ok, I didn't know for KmerGenie. I tried VevetK and VelvetOptimizer but they returned a value of 1 ! Is there anything else I could try to estimate the best k length?
Yes, data are strand specific (sequenced with MiSeq). I assumed the read R1 set referred to forward and the read R2 referred to reverse (and thus FR). How can we know which orientation are reads R1 and R2?

I also analysed the .SAM file generated by BWA with the Trinity script SAM_nameSorted_to_uniq_count_stats.pl. I got 87% proper_pairs, 6.5% left_only, 6.5% right_only. I assume it is not too bad, but there is a huge difference between results from samtools and trinity script. So which one are correct do you think? Is there a way to know? Thanks again !

ADD REPLY • link updated 11.0 years ago by Istvan Albert 101k • written 11.0 years ago by lqdo2000 ▴ 20

0

Entering edit mode

well if it is a strand specific RNA-seq then samtools will misinterpret your pairs. That's why the other script is provided.

There is no kmer size estimator for RNA-seq data that I know of.

ADD REPLY • link 11.0 years ago by Istvan Albert 101k

0

Entering edit mode

Alright, I see. Thank you very much for your help Istvan !

ADD REPLY • link 11.0 years ago by lqdo2000 ▴ 20