Hi !
I have some paired-end reads (FR) data that I processed as follow:
- trimming of the adaptors
- estimation of optimal k-mer length (KmerGenie)
- de novo assembly (Trinity) of the paired trimmed reads (default settings)
- mapping of paired trimmed reads to Trinity contigs
- statistics of mapping (samtools flagstat)
Questions:
At step 2) I obtained optimal k = 93 for forward reads and k = 17 for reverse. Is such a difference make sense?
At step 5) I obtained this output:
38918266 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
38561384 + 0 mapped (99.08%:-nan%)
38918266 + 0 paired in sequencing
19459133 + 0 read1
19459133 + 0 read2
0 + 0 properly paired (0.00%:-nan%)
38561384 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
2542804 + 0 with mate mapped to a different chr
406696 + 0 with mate mapped to a different chr (mapQ>=5)
Does 0 + 0 properly paired (0.00%:-nan%)
mean nothing mapped to the contigs?
I would greatly appreciate any advice or explanation !
Thanks !
Thanks Istvan !
I also analysed the .SAM file generated by BWA with the Trinity script
SAM_nameSorted_to_uniq_count_stats.pl
. I got 87% proper_pairs, 6.5% left_only, 6.5% right_only. I assume it is not too bad, but there is a huge difference between results from samtools and trinity script. So which one are correct do you think? Is there a way to know? Thanks again !well if it is a strand specific RNA-seq then samtools will misinterpret your pairs. That's why the other script is provided.
There is no kmer size estimator for RNA-seq data that I know of.
Alright, I see. Thank you very much for your help Istvan !