How To Interpret The Flagstat Output After Mapping Reads Back To Contigs Of A De Novo Transcriptome Assembly
1
0
Entering edit mode
11.0 years ago
lqdo2000 ▴ 20

Hi !

I have some paired-end reads (FR) data that I processed as follow:

  1. trimming of the adaptors
  2. estimation of optimal k-mer length (KmerGenie)
  3. de novo assembly (Trinity) of the paired trimmed reads (default settings)
  4. mapping of paired trimmed reads to Trinity contigs
  5. statistics of mapping (samtools flagstat)

Questions:

  • At step 2) I obtained optimal k = 93 for forward reads and k = 17 for reverse. Is such a difference make sense?

  • At step 5) I obtained this output:

    38918266 + 0 in total (QC-passed reads + QC-failed reads)

    0 + 0 duplicates

    38561384 + 0 mapped (99.08%:-nan%)

    38918266 + 0 paired in sequencing

    19459133 + 0 read1

    19459133 + 0 read2

    0 + 0 properly paired (0.00%:-nan%)

    38561384 + 0 with itself and mate mapped

    0 + 0 singletons (0.00%:-nan%)

    2542804 + 0 with mate mapped to a different chr

    406696 + 0 with mate mapped to a different chr (mapQ>=5)

Does 0 + 0 properly paired (0.00%:-nan%) mean nothing mapped to the contigs?

I would greatly appreciate any advice or explanation !

Thanks !

trinity samtools • 4.8k views
ADD COMMENT
2
Entering edit mode
11.0 years ago
  1. If you are assembling transcriptomes then KmerGenie is not the a right tool to estimate kmer size. KmerGenie should be used to estimate kmers for reads that represent randomly sheared DNA.

  2. the "proper pair" term usually means that the read pairs are within a certain distance and a certain orientation (FR) relative to one another.

Since this is a transcriptome data I suspect that your data is strand specific and if it is then it may be in the RF orientation.

ADD COMMENT
0
Entering edit mode

Thanks Istvan !

  1. Ok, I didn't know for KmerGenie. I tried VevetK and VelvetOptimizer but they returned a value of 1 ! Is there anything else I could try to estimate the best k length?
  2. Yes, data are strand specific (sequenced with MiSeq). I assumed the read R1 set referred to forward and the read R2 referred to reverse (and thus FR). How can we know which orientation are reads R1 and R2?

I also analysed the .SAM file generated by BWA with the Trinity script SAM_nameSorted_to_uniq_count_stats.pl. I got 87% proper_pairs, 6.5% left_only, 6.5% right_only. I assume it is not too bad, but there is a huge difference between results from samtools and trinity script. So which one are correct do you think? Is there a way to know? Thanks again !

ADD REPLY
0
Entering edit mode

well if it is a strand specific RNA-seq then samtools will misinterpret your pairs. That's why the other script is provided.

There is no kmer size estimator for RNA-seq data that I know of.

ADD REPLY
0
Entering edit mode

Alright, I see. Thank you very much for your help Istvan !

ADD REPLY

Login before adding your answer.

Traffic: 1658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6