I have aligned my paired-end reads to Hg19 build from Ensemble using tophat2. Tophat 'align_stats.txt' shows that 96% overall alignment. The command I used is
tophat2 -o OutDir -p 25 -G Homo_sapiens.GRCh37.70.gtf HG_19_GRCh37 CCGTCC_R1_001.fastq CCGTCC_R2_001 .fastq
A part of my GTF file looks like:
1 protein_coding CDS 35721 35736 . - 0 gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "1"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; protein_id "ENSP00000409362";
1 protein_coding start_codon 35734 35736 . - 0 gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "1"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001";
1 protein_coding exon 35277 35481 . - . gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "2"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; exon_id "ENSE00001669267";
1 protein_coding CDS 35277 35481 . - 2 gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "2"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; protein_id "ENSP00000409362";
1 protein_coding exon 34554 35174 . - . gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "3"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; exon_id "ENSE00001727627";
1 protein_coding CDS 35141 35174 . - 1 gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "3"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; protein_id "ENSP00000409362";
1 protein_coding stop_codon 35138 35140 . - 0 gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "3"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001";
The output of samtools flagstat accepted_hits.bam
file looks like:
62109999 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
62109999 + 0 mapped (100.00%:-nan%)
57657067 + 0 paired in sequencing
28804927 + 0 read1
28852140 + 0 read2
4916 + 0 properly paired (0.01%:-nan%)
55780192 + 0 with itself and mate mapped
1876875 + 0 singletons (3.26%:-nan%)
53443292 + 0 with mate mapped to a different chr
26218346 + 0 with mate mapped to a different chr (mapQ>=5)
I am afraid how to go for read counts. The flag stats output says 53443292 reads with mate mapped to different cur. Does the read counting tools like bedttols multicov
will take care of it ? or should I change any parameters in tophat2
command ?
The actual command I used is
tophat2 -o OutDir -p 25 -G Homo_sapiens.GRCh37.70.gtf HG_19_GRCh37 CCGTCC_R1_001.fastq CCGTCC_R2_001 .fastq
space seperated R1 and R2
you can sort your sam/bam file and use Picard's
CollectInsertSizeMetrics.jar
to estimate the insert size and see how it differs from what you specified (the tophat's default values in this case). i think that when you callflagstat
samtools automatically infers reads pairing by their names - which is a bit confusing.