I'm working with some existing RNAseq data, for which there is a fasta file with arbitrary IDs and the cDNA sequences, and a csv file with the arbitrary IDs and the read counts in different tissues. Currently, I've been blasting against the fasta file to find the arbitrary ID and then looking that up in the csv in the expression data. I would like to improve this by generating a csv with the actual locus ID/gene ID and the expression data, but I've been struggling to do this, I'm thinking I could blast the cDNA sequences against the genome and find the corresponding annotation then generate a list of the arbitrary IDs and the gene IDs, which I can use to replace the IDs in the expression csv. Does anyone have any advice on how to do this?
Are the existing identifiers genuinely arbitrary or do they reffer to, for example, identifiers in a not well known database? If so, I might use a title like "Find gene ID from a cDNA sequence", as currently your title makes it took from the front page like your question is going to be about, for e.g. converting ensembl IDs into gene names.
Hi, sorry for the ambiguity (I wasn't sure what to title the question!). They don't have any meaning outside of this dataset, I believe it was done before the genome was sequenced so some sort of description was needed.
OK, so basically, you have a FASTA file of a few thousand sequences that have meaningless IDs so you want actual IDs for them?
You can do what you're currently doing: BLAST to find the "real" IDs and replace those meaningless IDs.
Might take a while if you have that many sequences, so I'd suggest using an efficient alignment program (create an index of your organism and align your FASTA sequences to that index).