Different mapping results with and without GFT file
2
0
Entering edit mode
10.2 years ago

I am trying to use tophat and cufflinks to dealing with RNA-seq data, while something went wrong when running tophat. Different mapping results were generated after running Tophat with and without the GTF, ie, same read mapped to different site in two bam files.

a lot of reads have this problem, and the mapping results with GTF have many mismatches, maybe ~80 mismatch bases in one 90bp read

one is (with gtf)

FCC3378ACXX:4:1113:11999:94438# 99      scaffold_1      928636  50      90M     =       928763  217     CGCGCGTCCACGAGGCCGTTGTGGTCGGCGTCCGCGCGCTCGAAGGCGCCGGCGGCCACGGC       CGCGTTGGGGTTCGGGTTGGCCAGGGGC      bbbceecegggggiihiifihfhehffhhhiggeebccaaaXaccccccccT_a[aacca_accccVaccac_Xaaaa`aaccb^^a]aa      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i       :0  NM:i:0  MD:Z:90 YT:Z:UU XS:A:-  NH:i:1
  5083 FCC3378ACXX:4:1113:11999:94438# 147     scaffold_1      928763  50      90M     =       928636  -217    GCAGCGGCTCGGGCGGGGACTGCGCCAGCGCCAGCGCGACGAGCAGCGAGAGCGAGAGCGCC       GCTAGCTGCAGGAAGAGGCGCGGCATGG      BBBBBBBBBBaa[R`_b^^cc]V^_V[___a_YXaaa[W^b__Z^a`baaac`bfedaeUdf`agdfdfeahe_gffeggge`JcaZ[__      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i       :0  NM:i:0  MD:Z:90 YT:Z:UU XS:A:-  NH:i:1

Another one is (without gtf)

FCC3378ACXX:4:1113:11999:94438# 99      scaffold_1      910475  50      90M     =       910602  217     CGCGCGTCCACGAGGCCGTTGTGGTCGGCGTCCGCGCGCTCGAAGGCGCCGGCGGCCACGGC       CGCGTTGGGGTTCGGGTTGGCCAGGGGC      bbbceecegggggiihiifihfhehffhhhiggeebccaaaXaccccccccT_a[aacca_accccVaccac_Xaaaa`aaccb^^a]aa      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i       :0  NM:i:0  MD:Z:90 YT:Z:UU NH:i:1
  4818 FCC3378ACXX:4:1113:11999:94438# 147     scaffold_1      910602  50      90M     =       910475  -217    GCAGCGGCTCGGGCGGGGACTGCGCCAGCGCCAGCGCGACGAGCAGCGAGAGCGAGAGCGCC       GCTAGCTGCAGGAAGAGGCGCGGCATGG      BBBBBBBBBBaa[R`_b^^cc]V^_V[___a_YXaaa[W^b__Z^a`baaac`bfedaeUdf`agdfdfeahe_gffeggge`JcaZ[__      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i       :0  NM:i:0  MD:Z:90 YT:Z:UU NH:i:1

The command line is

tophat \
  -G FM_Physo1_1.gtf \
  --mate-inner-dist 20 \
  --mate-std-dev 34 \
  -p 11 \
  -o ./tophat_out/R0h \
  Psojae_genome \
  R0h_1.fq \
  R0h_2.fq

Could someone help me with that

topaht bowtie2 RNA-Seq • 3.1k views
ADD COMMENT
1
Entering edit mode
10.2 years ago

I think I may get the answer why this happens, I didn't notice that the conference fasta file was in dos format, when I change it into unix format and build the index, the tophat got the correct result

ADD COMMENT
0
Entering edit mode

FYI, I've moved this to an answer, since it sounds like this solved the problem.

That's pretty wild, but it's certainly good to know!

ADD REPLY
0
Entering edit mode
10.2 years ago
roy.granit ▴ 890

The inclusion of the GTF file changes the way TopHat does the mapping, it will first uses the GTF file and try to align the reads and only then if some reads do not map to the transcriptome tried to find matches in the genome.

From the TopHat guide:

reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output.

So it could be that if the mapping is done first with the whole genome you might get different hits.

ADD COMMENT
1
Entering edit mode

That was my first though too, but note the CIGAR strings and MAPQ in each case. Neither case show signs of splicing, they both have the same insert size, and they both have MAPQ=50, which is tophat2-speak for "not a multimapper". The alignment to the genome should have a lower MAPQ due to being a multimapper. This shouldn't happen.

ADD REPLY
0
Entering edit mode

i have not described clearly,the read with GTF may mapped to the wrong place because many mismatches occurred, ~80 mismatch bases in one 90bp read, and this is not a coincidence, lots of reads have this problem.

ADD REPLY
0
Entering edit mode

The alignments don't give indication of any mismatches (anyway, bowtie/bowtie2 can't align reads with that many mismatches). Are you sure that's correct?

ADD REPLY
0
Entering edit mode

Yes, the sam output said there is no mismatch, but when I check the mapped sequence in genome, I found that they are totally different while the mapped sequence without gtf is match the read.

ADD REPLY
0
Entering edit mode

What species is this?

ADD REPLY
0
Entering edit mode

phytophthora sojae, belong to oomycete

ADD REPLY
0
Entering edit mode

Indeed, the first alignment shouldn't ever occur. Perhaps this is some obscure tophat bug.

ADD REPLY

Login before adding your answer.

Traffic: 2718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6