RNA seq aligner?
1
0
Entering edit mode
8 months ago
tesfaye • 0

Hi all, Our attempts to download the transcriptome of H37RV for indexing in the salmon tool have been unsuccessful due to unavailability on NCBI and other databases . Despite downloading the coding sequence and whole genome of H37Rv and utilizing both for indexing in salmon alignment and quantification of RNA reads, we have encountered significant challenges. We end up with low alignment rates or percentage despite having more than 12 million reads. Additionally, the kraken report indicates that reads align to mtb. Could you please offer guidance on obtaining or generating a comprehensive transcriptome of H37rv suitable for indexing in salmon? Alternatively, any suggestions on overcoming this issue would be greatly appreciated

Alignment • 1.3k views
ADD COMMENT
0
Entering edit mode

Is there any specific reason why you are using Salmon for alignment? You can also try hisat2 (splice aligner) for Mtb genome alignment.

ADD REPLY
0
Entering edit mode
8 months ago
Dave Carlson ★ 2.0k

Do you have both the genome assembly and the annotations in gff or gtf format? If so, you can use the gffread utility to generate the transcript fasta file:

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf
ADD COMMENT
0
Entering edit mode

Thanks, Dave Carlson. Can I use whole genome in salmon for -- decoy command?

after running this command using whole genome as decoy sequence:

salmon index -t /home/tesf/rna_analysis/transcripts.fa -i transcripts_index --decoys /home/tesf/rna_analysis/Mtb_ref/Mtb/NC_018143_TB.fna -k 31
I encountered the following error:
Version Info: This is the most recent version of salmon.
index ["transcripts_index"] did not previously exist  . . . creating it
[2024-03-06 12:55:24.734] [jLog] [info] building index
out : transcripts_index
[2024-03-06 12:55:24.734] [puff::index::jointLog] [info] Running fixFasta
[2024-03-06 12:55:24.773] [puff::index::jointLog] [critical] The decoy name TGGGCCCGCCGGTGGGCGGCCAGCGCATCCAAAAACGAATTGGCGGCCGCATAGTTGGCCTGGCCCGACG was encountered more than once --- please ensure all decoy names and sequences are unique.
[2024-03-06 12:55:24.775] [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1

How do I fix it?

ADD REPLY
1
Entering edit mode

Did you follow the instructions for generating the decoy-aware index here?

https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/

In your salmon index command, it looks like you're providing a fasta file for the --decoys flag, but you should be providing a text file that includes the contig/chromosome fasta header names from your genome assembly, not the actual sequence itself.

Please see the above link for a tutorial on how to generate the decoy-aware index.

ADD REPLY
0
Entering edit mode

Thanks, Dave Carlson

ADD REPLY
0
Entering edit mode

Dear Dave Carlson

Would it be possible for you to generate transcripts.fa from genome and gtf files from this link: Index of /genomes/all/GCA/000/195/955/GCA_000195955.2_ASM19595v2?

I have tried to generate transcripts.fa from this command:

 gffread -w transcripts.fa -g GCA_000195955.2_ASM19595v2_genomic.fna.gz GCA_000195955.2_ASM19595v2_genomic.gtf.gz

and encountered the following error:

Error: sequence lines in a FASTA record must have the same length!

I tried to resolve the issue by installing seqkit and tried to reformat the fasta genome, using this command:

seqkit seq -w 0 GCA_000195955.2_ASM19595v2_genomic.fna.gz > fixed_genomic.fna

but the issue still persists. thanks in advance

ADD REPLY
0
Entering edit mode

I don't think the link worked successfully.

ADD REPLY
1
Entering edit mode

You could get the cds_from_genomic.fna or rna_from_genomic.fna file present at the link you posted as an alternative.

ADD REPLY
0
Entering edit mode

As GenoMax said, you can probably just combine the following two files to be your transcripts.fa:

GCA_000195955.2_ASM19595v2_cds_from_genomic.fna

GCA_000195955.2_ASM19595v2_rna_from_genomic.fna

ADD REPLY
0
Entering edit mode

Dear Dave Carlson

I aligned my Mtb reads to the entire genome using Hisat2 and attempted to use feature count for gene quantification. However, I encountered an issue where feature count only provided counts for 80 genes out of a total of 4100 genes. How can I obtain more accurate results? the command: featureCounts -a GCA_000195955.2_ASM19595v2_genomic.gtf -o counts.txt outputs/*_sorted.bam

the reference link: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/195/955/GCA_000195955.2_ASM19595v2/

ADD REPLY

Login before adding your answer.

Traffic: 2364 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6