Question

How To Handle 'Mate Mapped To Different Chr' In Rna-Seq Data Analysis ?

0

Entering edit mode

10.9 years ago

GouthamAtla 12k

I have aligned my paired-end reads to Hg19 build from Ensemble using tophat2. Tophat 'align_stats.txt' shows that 96% overall alignment. The command I used is

tophat2 -o OutDir -p 25 -G Homo_sapiens.GRCh37.70.gtf HG_19_GRCh37 CCGTCC_R1_001.fastq CCGTCC_R2_001 .fastq

A part of my GTF file looks like:

1       protein_coding  CDS     35721   35736   .       -       0        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "1"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; protein_id "ENSP00000409362";

1       protein_coding  start_codon     35734   35736   .       -       0        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "1"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001";

1       protein_coding  exon    35277   35481   .       -       .        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "2"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; exon_id "ENSE00001669267";

1       protein_coding  CDS     35277   35481   .       -       2        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "2"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; protein_id "ENSP00000409362";

1       protein_coding  exon    34554   35174   .       -       .        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "3"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; exon_id "ENSE00001727627";

1       protein_coding  CDS     35141   35174   .       -       1        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "3"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; protein_id "ENSP00000409362";

1       protein_coding  stop_codon      35138   35140   .       -       0        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "3"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001";

The output of samtools flagstat accepted_hits.bam file looks like:

62109999 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
62109999 + 0 mapped (100.00%:-nan%)
57657067 + 0 paired in sequencing
28804927 + 0 read1
28852140 + 0 read2
4916 + 0 properly paired (0.01%:-nan%)
55780192 + 0 with itself and mate mapped
1876875 + 0 singletons (3.26%:-nan%)
53443292 + 0 with mate mapped to a different chr
26218346 + 0 with mate mapped to a different chr (mapQ>=5)

I am afraid how to go for read counts. The flag stats output says 53443292 reads with mate mapped to different cur. Does the read counting tools like bedttols multicov will take care of it ? or should I change any parameters in tophat2 command ?

rna-seq tophat2 • 4.0k views

ADD COMMENT • link 10.9 years ago by GouthamAtla 12k

score 0 · Answer 1 · 2014-01-19

0

Entering edit mode

10.9 years ago

Pavel Senin ★ 1.9k

I think that you might run it not as they specify:

Usage: tophat [options]* <genome_index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2] 
When running TopHat with paired reads it is critical that the *_1 files an the *_2 files appear in separate comma-delimited lists, and that the order of the files in the two lists is the same.

i.e. it didn't recognize that your reads are paired, so the maximum insert size constraint was ignored, and so on.

ADD COMMENT • link 10.9 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

The actual command I used is

tophat2 -o OutDir -p 25 -G Homo_sapiens.GRCh37.70.gtf HG_19_GRCh37 CCGTCC_R1_001.fastq CCGTCC_R2_001 .fastq

space seperated R1 and R2

ADD REPLY • link 10.9 years ago by GouthamAtla 12k

0

Entering edit mode

you can sort your sam/bam file and use Picard's CollectInsertSizeMetrics.jar to estimate the insert size and see how it differs from what you specified (the tophat's default values in this case). i think that when you call flagstat samtools automatically infers reads pairing by their names - which is a bit confusing.

ADD REPLY • link 10.9 years ago by Pavel Senin ★ 1.9k