CLC GW vs. tophat2
3
0
Entering edit mode
9.6 years ago
Assa Yeroslaviz ★ 1.9k

Hi,

We are having a discussion with our genomic centre about the mapping results of the samples they provided for us.

I have analysed the data with both tophat2 and STAR.

We have done a quality check using fastqc. The results we got back were not very promising. I have added one image below.

These were the command I have used to run the analysis:

tophat2 -p 10 -g 20 --read-edit-dist 5 --report-secondary-alignments -N 5 --transcriptome-index=transcriptome_index/genes -o $NEW_FILE.out genome $file

STAR --runMode alignReads --runThreadN 10 --genomeDir /home/yeroslaviz/genomes/Mus_musculus/STARIndex/ --readFilesCommand zcat --readFilesIn $file --sjdbGTFfile /home/yeroslaviz/genomes/Mus_musculus/Ensembl/NCBIM37/Annotation/Genes/genes.gtf --sjdbFileChrStartEnd  ~/genomes/Mus_musculus/STARIndex/sjdbList.out.tab --sjdbInsertSave All --outFilterMultimapNmax 20 --outFileNamePrefix $NEW_FILE --outSAMprimaryFlag AllBestScore --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM --twopassMode Basic --limitGenomeGenerateRAM 50000000000--alignSJDBoverhangMin 1

We have gotten very low mapping results (only around 30-60% were mapped).

When we asked at the sequence centre if they can explain the problem(s), we were told, that they can't reproduce the problem.

They have sent us a list of their mapping results which ranges between 75-95%.

It turns out they are using the CLC genomic workbench tool to map the results with the following parameters:

Mismatch cost: 2; Insertion cost: 3; Deletion cost: 3; Length fraction: 0.5; Similarity fraction: 0.8

I was wondering if it even make sense to try and map a data set with such parameters. The length fraction and the similarity allow IMHO for a very high error rate, where a minimum of 50% of the read must be a match and in this 50% I still expect only 80% similarity. This allows in our 100 bases read length samples for 60 bases to be not correct.

I have tried to search for papers or more information from other users who have worked with the CLC GW before, but couldn't find much.

Do you think the CLC way of analysing the data is still good enough? Is the error rate not too high?

Thanks,
Assa

per_base_quality

tophat CLC fastqc • 4.0k views
ADD COMMENT
1
Entering edit mode

Also the plot above shows data of very low quality - I would be highly suspicious of any tool (or settings) that produces high alignment rates on it

ADD REPLY
1
Entering edit mode
9.6 years ago
michael.ante ★ 3.9k

I would go first with a low-quality tail trimming and also check for adapter-contamination (also one part of the fastqc report). You can use for instance bbduk (from the bbmap suite) or the fastq_quality_trimmer from the FASTX toolkit.

Subsequently, you might check for over-represented sequences in the trimmed data. Maybe you have some other contaminations as well.

After these steps, you still could compare Tophat2 and CLC GW.

ADD COMMENT
1
Entering edit mode

Hi,

this I have already done. I did all the trimming and cutting and filtering i think I can do. It didn't really increase the mapping results by much. My question here is not really about how to make my data set better, but to try and understand whether or not the results from CLC are trustwothy enough, and, if so, how come that they differ so much from the tophat2 run.

ADD REPLY
1
Entering edit mode
9.6 years ago
Burnedthumb ▴ 90

Both Tophat2 and STAR are splice aware aligners. If I recall correctly, the default alignment program of CLC bio is not. Maybe you can verify which of the aligners they used, maybe they used the RNAseq pipeline which (should) work differently.

A couple of months back I did some tests with CLC bio vs Bowtie2 vs HiSAT followed by some SNP calling program. The results from that were that CLC bio gave more (false positive) SNPs than the other two. My guess is that this is due to weird liberal intron/exon boundary alignments of CLC (however, I need to do more testing for that).

ADD COMMENT
1
Entering edit mode

Did you use the default parameters from the CLC run?

I still think that taking a length fraction of 0.5 and than a similarity of 0.8 on top of that is quite high. Any experience on that?

ADD REPLY
0
Entering edit mode
9.6 years ago

By default bowtie2 is tuned for speed and will not be able to handle data with lots of errors. You can greatly increase its sensitivity, for example see this: BWA vs Bowtie 2 (Poll)

ADD COMMENT
1
Entering edit mode

I will try to run bowie instead of tophat2 with the mentioned parameters. Maybe I will play a bit with them as well.

But my main question stays the same. Can I trust the CLC results?

ADD REPLY
1
Entering edit mode

tophat2 already runs bowtie2 as its aligner - is may just need a few extra parameters.

ADD REPLY
1
Entering edit mode

Is it possible to add the bowtie parameters from the link you added above to the tophat2 run?

I have looked into the tophat parameters and couldn't find any beside the ones I listed above to make the search less stringent.

ADD REPLY
2
Entering edit mode

I think these correspond to:

--very-sensitive

see the Bowtie2 specific settings in the Tophat2 manual. Also in this case the options may need to be prepended by --b2 so that it knows to pass it down to Bowtie2. For example -D will be --b2-D

ADD REPLY

Login before adding your answer.

Traffic: 2202 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6