Hi all,
New to RNA-Seq analysis and I tried to find an answer to this elsewhere. I have performed salmon alignment on my pair-end fastq files which has generated the quant.sf files for all my samples. I now need to import into R with tximport which requires a dataframe with transcript ID's and the corresponding gene ID's. My transcriptomic .fasta reference file is from NCBI which does not contain the gene ID's, only the NM_transcript identifiers.
How do I find the gene ID's I need, and in the correct order? The reference transcriptome I have been working from is Thoroughbred racehorse (EquCab3.0).
Thanks in advance!
Gene names appear to be in the fasta headers. In
()
brackets.Edit: Removing the code because of the problem noted by @ATPoint below. Parsing should still be feasible but will need a program.
Careful with that, I came across situations where RefSeq has brackets in those headers as part of the gene description rather than for delimiting the gene name, e.g. (dummy example):
and there goes parsing these gene names...
Edit:
Here for example (real data):