Question

CDS file (as reference) can be used for align my fastq files?

0

Entering edit mode

5.3 years ago

m986 ▴ 10

I'm new in rna-seq , so first, I'm sorry if this question is too dumb or if I confuse the definitions. My advisor gave me a fasta file (non-model organism) and told me that I had to use it to map (suggesting using Bowtie2), the "readme" file of this says [This file contains the organism "genes" annotated], but this is not a GTF/GFF file, my file looks like this:

>Id1
ATGGCTTCAAACAAGCGAGAAAGTCA...
>Id2
ATGGGCAGTCTTGGTCCTATTGAAAAT...
>Id3
ATGATACTTTCCGTTTTGTCGAGCCCT...

So, for me, this file is the CDS.fa (CoDing Sequence), I researched but it's still not clear to me: is it possible to make aligment using this file as a reference?

RNA-Seq rna-seq cds alignment • 2.8k views

ADD COMMENT • link updated 5.3 years ago by swbarnes2 14k • written 5.3 years ago by m986 ▴ 10

score 2 · Accepted Answer · 2019-08-09

2

Entering edit mode

5.3 years ago

lakhujanivijay 5.9k

Hi m986

Since you said that you are new to RNA-seq and that you have been asked to perform alignment, I am assuming that you have been asked to perform what is known as a reference based transcriptome assembly. I would suggest that you read this paper.

My advisor gave me a fasta file

For mapping, you must have fastq files and not fasta; that must be a typo. Anyways, generally the fastq files are aligned to the corresponding genome file using a splice-aware aligner like HISAT2, STAR etc. Addition to the genome file, you would also require the corresponding GTF/GFF file.

I would also suggest that you talk to your supervisor about the objective of the experiment first and then proceed.

ADD COMMENT • link 5.3 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Thank you! Actually I have fastq files which is what I want to align, the fasta file that my advisor gave me is the CDS (CoDing Sequence), it is the "reference" that he suggests using for alignment, I agree with you about using a reference genome, but my advisor insists on using that CDS file to align with my fastq files, is that correct?

ADD REPLY • link 5.3 years ago by m986 ▴ 10

0

Entering edit mode

While that is not incorrect, using a reduced representation of the genome (just CDS part, when the data came from full genome) raises an issue. Aligners will try their best to align reads to a location so it is possible that some reads may get aligned to positions they did not originate from.

Using a pseudomapper like kallisto (as noted below) or salmon would be the best option, if you don't have the full genome sequence or don't want to use full genome.

ADD REPLY • link 5.3 years ago by GenoMax 147k

0

Entering edit mode

Thanks genomax !!! That answers my main question, I only have one more, if I downoladed the transcriptome and the genome (from NCBI ) , do you recommend using some of these NCBI files or even using this CDS provided (unpublished)? This is because my final result must be DEG between two phenotypes in plants, so maybe using the CDS or NCBI files (trancriptome or genome) I could get different results, I don't know.

ADD REPLY • link 5.3 years ago by m986 ▴ 10

0

Entering edit mode

Link you provided has all kinds of data for this genome including transcriptome. You can use the entire genome and then use annotation with featureCounts to get gene counts. Or you could use the transcriptome sequences with salmon or kallisto. While results may not agree 100% if you did analysis by these two methods, top DE genes should be identified by either method.

What is special about your CDS file? Information available at NCBI should be essentially complete?

ADD REPLY • link 5.3 years ago by GenoMax 147k

0

Entering edit mode

My advisor told me that this CDS file (which is actually from another variety of chile, but he says it is almost the same as the one published), has "well identified and annotated genes", but for example, Id_genes only has names like "Id1", "Id2",... etc., so I made a Blastx to give them a "name", but I'm not sure I believe in "good identification and annotation" because I don't know how to prove it, NCBI has it all, so I don't know if I should use this CDS file or if I decide to use the NCBI files, a total confusion in my mind.

ADD REPLY • link 5.3 years ago by m986 ▴ 10

0

Entering edit mode

Don't mix and match. Either stick with NCBI data or use your own.

ADD REPLY • link 5.3 years ago by GenoMax 147k

score 2 · Accepted Answer · 2019-08-09

2

Entering edit mode

5.3 years ago

swbarnes2 14k

Aligning to a list of transcripts is a decent way to proceed, but I'd look into using a pseudomapper like Kallisto instead of Bowtie2, which is explicitly designed to use transcripts as the reference.

ADD COMMENT • link 5.3 years ago by swbarnes2 14k