Rockhoppper - Suspiciously low percentage of perfectly aligned reads
0
0
Entering edit mode
13 months ago
langziv ▴ 70

Hi

I run Rockhopper and I'm wondering if there's something wrong since the percentage of perfectly aligned reads is not higher than 1%, which seems very low.

Here's an example of alignment of 2 pairs of sequencing files:

Aligning reads to assembled transcripts using files:  
    9C-13-12_S103_L002_R1_001.fq
    9C-13-12_S103_L002_R2_001.fq

    Total reads in files:           6829001
    Perfectly aligned reads:        61445   1%

Aligning reads to assembled transcripts using files:
    9C-13-10_S101_L002_R1_001.fq
    9C-13-10_S101_L002_R2_001.fq

    Total reads in files:           8658801
    Perfectly aligned reads:        32281   0%

What do you think?

UPDATE
@genomax's comment made me realize I didn't trim the reads before running Rockhopper, which is a huge mistake on my behalf.
I run Rockhopper after running Trim Galore on the reads, and got the following results:

Aligning reads to assembled transcripts using files:
9C-13-12_R1.fq
9C-13-12_R2.fq

Total reads in files:           6818641
Perfectly aligned reads:        2887029 42%

Aligning reads to assembled transcripts using files:
9C-13-10_R1.fq
9C-13-10_R2.fq

Total reads in files:           8642168
Perfectly aligned reads:        3548606 41%

Aligning reads to assembled transcripts using files:
9C-04-14_R1.fq
9C-04-14_R2.fq

Total reads in files:           6967804
Perfectly aligned reads:        2825965 41%

Aligning reads to assembled transcripts using files:
9S-04-13_R1.fq
9S-04-13_R2.fq

Total reads in files:           6839242
Perfectly aligned reads:        2602681 38%

Aligning reads to assembled transcripts using files:
9S-13-09_R1.fq
9S-13-09_R2.fq

Total reads in files:           7427471
Perfectly aligned reads:        2317790 31%

Aligning reads to assembled transcripts using files:
9S-13-11_R1.fq
9S-13-11_R2.fq

Total reads in files:           7326497
Perfectly aligned reads:        2899481 40%

   Total number of assembled transcripts:       1606
Average transcript length:              1955
Median transcript length:               1487
Total number of assembled bases:        3140019

These results are obviously better. I'm still not sure if they are good enough, though.

Rockhopper RNA-seq • 2.1k views
ADD COMMENT
0
Entering edit mode

What do you think?

Based on some recent threads you posted it is possible that your assembly is either not correct or is not of good quality. If you try to blast some sequences/transcripts from your assembled data do you get hits that approach 100% contiguous coverage for the query (which they should considering that there is a lot of K pneumoniae data out there).

How much data did you start with (in terms of gross coverage NNx based on genome size) that you used for this assembly? It may sound counterintuitive but having too much data is also not good for assemblies.

ADD REPLY
0
Entering edit mode

Thanks.
I just did blastn of some of the transcripts and got over 99.9% or 100% similarity.
By how much data you mean how many read are in each file? What does NNx mean?

ADD REPLY
0
Entering edit mode

I just did blastn of some of the transcripts and got over 99.9% or 100% similarity.

That is good. Was the alignment end to end with no separate bunch of local HSP's. If yes, then we know then that the 1% perfect alignment is strange. It should be rather high.

As for the amount of data. Say the genome is 4 Mb. How much read data did you start with in terms of total bases (number of reads * read length)?

(number of bases/genome size = Gross fold coverage). If you started with 100x coverage then that is probably too much.

I came across this paper which compares rockhopper2 and trinity: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6970227/

ADD REPLY
0
Entering edit mode

Total bases (number of reads * read length) should be calculated after removal of adapters, right?

ADD REPLY
0
Entering edit mode

Correct. Data that went into the assembly.

ADD REPLY
0
Entering edit mode

By the way, I run Rockhopper on the untrimmed files, and the full match percentage grew to 20%-30%, and there are 1552 transcripts in the untrimmed transcriptome and 105 in the trimmed one.

ADD REPLY
0
Entering edit mode

Here are the total bases in 4 fastq files (after adapter removal): 1,027,852,411, 1,041,895,083, 1,121,786,520, 1,299,721,929.
I just remembered that I got an assembly, not a genome, so I don't think the genome size is available.

ADD REPLY
0
Entering edit mode

so I don't think the genome size is available.

Perhaps not for your specific strain but plenty of other Kpn genomes are available and they are in ~ 6 Mb range. Not all 6 is going to code for RNA but close enough for rough calculations. So going by that size you had way too much data (~180 x for each fastq) going into your assemblies. If you actually merged all of the data above in one assembly then .. wow. That may be part of the issue.

there are 1552 transcripts in the untrimmed transcriptome and 105 in the trimmed one.

It looks like there are between 5500 to 6000 CDS's reported for K. pneumoniae genomes. So your assembly can use improvement. Look into normalizing/downsampling the data and then redoing the assembly. May want to start at around 25-30x bases and experiment.

ADD REPLY
0
Entering edit mode

Thanks.
I'll try.

ADD REPLY
0
Entering edit mode

I reduced the data to 50x with seqtk and got the same percentage.
What do you think about combining the assembly so that it will be a genome?

ADD REPLY
0
Entering edit mode

Since we can't see your data we may be reaching the end of the line here in terms of advice. At this point in time, bacterial RNAseq should be a routine experiment and getting usable results should not be this difficult (unless there is some flaw in your data or you are unable to share critical information for secrecy reason).

Have you ever aligned the data to a public Kpn genome (as I had suggested in a separate thread) and what sort of alignment % do you see there? I would expect that number to be in 90+% range. If not there is some basic problem with your data.

I reduced the data to 50x with seqtk

Instead of simply downsampling you may need to normalize the data using something like bbnorm.sh.

ADD REPLY
0
Entering edit mode

You mean blast the raw reads against the assembly?

ADD REPLY
0
Entering edit mode

Not blast. Align original/trimmed data using an NGS aligner against a complete Kpn genome (use the RefSeq) available at NCBI.

ADD REPLY
0
Entering edit mode

I used bowtie2. I aligned a pair of pair-ended reads files against an index that I created from the species' RefSeq file. It resulted in the following error messages:

Error, fewer reads in file specified with -1 than in file specified with -2
terminate called after throwing an instance of 'int'
(ERR): bowtie2-align died with signal 6 (ABRT) (core dumped)

Any idea? I never thought of the possibility of the data being the problem, but I'm starting to think it might be. I didn't extract the RNA myself and don't know what is its quality.

ADD REPLY
0
Entering edit mode

I didn't extract the RNA myself and don't know what is its quality.

While that could affect results it looks like you have an immediate informatics problem.

fewer reads in file specified with -1 than in file specified with -2

Looks like your R1/R2 files are out of sync. Did you trim them independently (which you should not have done)? Either re-trim the data using both files at the same time with bbduk.sh/fastp or use repair.sh tool from BBMap suite to bring the two files back in sync removing the singletons. You will then need to re-align with the synced files.

ADD REPLY
0
Entering edit mode

Meaning, I should have trimmed them only after merging them?

ADD REPLY
0
Entering edit mode

Are your read inserts so short that the reads are able to merge? But otherwise no. Reads should be scanned/trimmed before assembly. Especially you are doing de novo one.

ADD REPLY
0
Entering edit mode

Thanks.
I run Rockhopper again after trimming the adapters, and got better results, which I added to the original question.

ADD REPLY
0
Entering edit mode

I can provide an example of the data. What should I provide?

ADD REPLY
0
Entering edit mode

"having too much data is also not good for assemblies" implies that it's possible that only 1% or less of the reads were good?
If that's the case, shouldn't it be predictable from FastQC results?

ADD REPLY

Login before adding your answer.

Traffic: 2450 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6