I'm new in rna-seq , so first, I'm sorry if this question is too dumb or if I confuse the definitions. My advisor gave me a fasta file (non-model organism) and told me that I had to use it to map (suggesting using Bowtie2), the "readme" file of this says [This file contains the organism "genes" annotated], but this is not a GTF/GFF file, my file looks like this:
>Id1
ATGGCTTCAAACAAGCGAGAAAGTCA...
>Id2
ATGGGCAGTCTTGGTCCTATTGAAAAT...
>Id3
ATGATACTTTCCGTTTTGTCGAGCCCT...
So, for me, this file is the CDS.fa (CoDing Sequence), I researched but it's still not clear to me: is it possible to make aligment using this file as a reference?
Thank you! Actually I have fastq files which is what I want to align, the fasta file that my advisor gave me is the CDS (CoDing Sequence), it is the "reference" that he suggests using for alignment, I agree with you about using a reference genome, but my advisor insists on using that CDS file to align with my fastq files, is that correct?
While that is not incorrect, using a reduced representation of the genome (just CDS part, when the data came from full genome) raises an issue. Aligners will try their best to align reads to a location so it is possible that some reads may get aligned to positions they did not originate from.
Using a pseudomapper like kallisto (as noted below) or salmon would be the best option, if you don't have the full genome sequence or don't want to use full genome.
Thanks genomax !!! That answers my main question, I only have one more, if I downoladed the transcriptome and the genome (from NCBI ) , do you recommend using some of these NCBI files or even using this CDS provided (unpublished)? This is because my final result must be DEG between two phenotypes in plants, so maybe using the CDS or NCBI files (trancriptome or genome) I could get different results, I don't know.
Link you provided has all kinds of data for this genome including transcriptome. You can use the entire genome and then use annotation with
featureCounts
to get gene counts. Or you could use the transcriptome sequences withsalmon
orkallisto
. While results may not agree 100% if you did analysis by these two methods, top DE genes should be identified by either method.What is special about your CDS file? Information available at NCBI should be essentially complete?
My advisor told me that this CDS file (which is actually from another variety of chile, but he says it is almost the same as the one published), has "well identified and annotated genes", but for example, Id_genes only has names like "Id1", "Id2",... etc., so I made a Blastx to give them a "name", but I'm not sure I believe in "good identification and annotation" because I don't know how to prove it, NCBI has it all, so I don't know if I should use this CDS file or if I decide to use the NCBI files, a total confusion in my mind.
Don't mix and match. Either stick with NCBI data or use your own.