Entering edit mode
13 months ago
langziv
▴
70
Hi
I run Rockhopper and I'm wondering if there's something wrong since the percentage of perfectly aligned reads is not higher than 1%, which seems very low.
Here's an example of alignment of 2 pairs of sequencing files:
Aligning reads to assembled transcripts using files:
9C-13-12_S103_L002_R1_001.fq
9C-13-12_S103_L002_R2_001.fq
Total reads in files: 6829001
Perfectly aligned reads: 61445 1%
Aligning reads to assembled transcripts using files:
9C-13-10_S101_L002_R1_001.fq
9C-13-10_S101_L002_R2_001.fq
Total reads in files: 8658801
Perfectly aligned reads: 32281 0%
What do you think?
UPDATE
@genomax's comment made me realize I didn't trim the reads before running Rockhopper, which is a huge mistake on my behalf.
I run Rockhopper after running Trim Galore on the reads, and got the following results:
Aligning reads to assembled transcripts using files:
9C-13-12_R1.fq
9C-13-12_R2.fq
Total reads in files: 6818641
Perfectly aligned reads: 2887029 42%
Aligning reads to assembled transcripts using files:
9C-13-10_R1.fq
9C-13-10_R2.fq
Total reads in files: 8642168
Perfectly aligned reads: 3548606 41%
Aligning reads to assembled transcripts using files:
9C-04-14_R1.fq
9C-04-14_R2.fq
Total reads in files: 6967804
Perfectly aligned reads: 2825965 41%
Aligning reads to assembled transcripts using files:
9S-04-13_R1.fq
9S-04-13_R2.fq
Total reads in files: 6839242
Perfectly aligned reads: 2602681 38%
Aligning reads to assembled transcripts using files:
9S-13-09_R1.fq
9S-13-09_R2.fq
Total reads in files: 7427471
Perfectly aligned reads: 2317790 31%
Aligning reads to assembled transcripts using files:
9S-13-11_R1.fq
9S-13-11_R2.fq
Total reads in files: 7326497
Perfectly aligned reads: 2899481 40%
Total number of assembled transcripts: 1606
Average transcript length: 1955
Median transcript length: 1487
Total number of assembled bases: 3140019
These results are obviously better. I'm still not sure if they are good enough, though.
Based on some recent threads you posted it is possible that your assembly is either not correct or is not of good quality. If you try to blast some sequences/transcripts from your assembled data do you get hits that approach 100% contiguous coverage for the query (which they should considering that there is a lot of K pneumoniae data out there).
How much data did you start with (in terms of gross coverage NNx based on genome size) that you used for this assembly? It may sound counterintuitive but having too much data is also not good for assemblies.
Thanks.
I just did blastn of some of the transcripts and got over 99.9% or 100% similarity.
By how much data you mean how many read are in each file? What does NNx mean?
That is good. Was the alignment end to end with no separate bunch of local HSP's. If yes, then we know then that the 1% perfect alignment is strange. It should be rather high.
As for the amount of data. Say the genome is 4 Mb. How much read data did you start with in terms of total bases (number of reads * read length)?
(number of bases/genome size = Gross fold coverage). If you started with 100x coverage then that is probably too much.
I came across this paper which compares rockhopper2 and trinity: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6970227/
Total bases (number of reads * read length) should be calculated after removal of adapters, right?
Correct. Data that went into the assembly.
By the way, I run Rockhopper on the untrimmed files, and the full match percentage grew to 20%-30%, and there are 1552 transcripts in the untrimmed transcriptome and 105 in the trimmed one.
Here are the total bases in 4 fastq files (after adapter removal): 1,027,852,411, 1,041,895,083, 1,121,786,520, 1,299,721,929.
I just remembered that I got an assembly, not a genome, so I don't think the genome size is available.
Perhaps not for your specific strain but plenty of other Kpn genomes are available and they are in ~ 6 Mb range. Not all 6 is going to code for RNA but close enough for rough calculations. So going by that size you had way too much data (~180 x for each fastq) going into your assemblies. If you actually merged all of the data above in one assembly then .. wow. That may be part of the issue.
It looks like there are between 5500 to 6000 CDS's reported for K. pneumoniae genomes. So your assembly can use improvement. Look into normalizing/downsampling the data and then redoing the assembly. May want to start at around 25-30x bases and experiment.
Thanks.
I'll try.
I reduced the data to 50x with seqtk and got the same percentage.
What do you think about combining the assembly so that it will be a genome?
Since we can't see your data we may be reaching the end of the line here in terms of advice. At this point in time, bacterial RNAseq should be a routine experiment and getting usable results should not be this difficult (unless there is some flaw in your data or you are unable to share critical information for secrecy reason).
Have you ever aligned the data to a public Kpn genome (as I had suggested in a separate thread) and what sort of alignment % do you see there? I would expect that number to be in 90+% range. If not there is some basic problem with your data.
Instead of simply downsampling you may need to normalize the data using something like
bbnorm.sh
.You mean blast the raw reads against the assembly?
Not blast. Align original/trimmed data using an NGS aligner against a complete Kpn genome (use the RefSeq) available at NCBI.
I used bowtie2. I aligned a pair of pair-ended reads files against an index that I created from the species' RefSeq file. It resulted in the following error messages:
Any idea? I never thought of the possibility of the data being the problem, but I'm starting to think it might be. I didn't extract the RNA myself and don't know what is its quality.
While that could affect results it looks like you have an immediate informatics problem.
Looks like your R1/R2 files are out of sync. Did you trim them independently (which you should not have done)? Either re-trim the data using both files at the same time with
bbduk.sh
/fastp
or userepair.sh
tool from BBMap suite to bring the two files back in sync removing the singletons. You will then need to re-align with the synced files.Meaning, I should have trimmed them only after merging them?
Are your read inserts so short that the reads are able to merge? But otherwise no. Reads should be scanned/trimmed before assembly. Especially you are doing
de novo
one.Thanks.
I run Rockhopper again after trimming the adapters, and got better results, which I added to the original question.
I can provide an example of the data. What should I provide?
"having too much data is also not good for assemblies" implies that it's possible that only 1% or less of the reads were good?
If that's the case, shouldn't it be predictable from FastQC results?