Hi all,
I am very interested in get fasta seq (mrna) from a given genes IDs. My first approach has been to download from here https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ the refMrna.fa.gz file and then use bedtools getfasta -rna
to retrieve it:
e.g: bedtools getfasta -fi refMrna.fa -rna -bed test.bed -fo test.fa.out
(bed file is eg: chr8 9999999 9999998
)
However, after getting the error: 'WARNING. chromosome (chr8) was not found in the FASTA file. Skipping.'
I decided to explore the ref mrna file and I realised the gene IDs are in format refseq format:
e.g: grep -o -E "^>\w+" refMrna.fa | tr -d ">"
> NM_001383, NR_164501 ..
Now I am wondering if there is a smartest way to get the mrna fasta seq or if I should transform the gene ID using the UCSC tool (RefSeq all) which provides an output with lots of refseqs for the given gene and use that file to be passed as a bed file.
Any feedbacks on this issue?
Thanks!
If your intervals are for main chromosomes why are you using the mRNA file instead of the entire genome? If you have the gene ID's you could also use the table browser to get the sequence.
It is because I want to pull the mRNA fasta files and not DNA (which I guess is the case if I pull it from the entire genome)
mRNA file is not going to have the same co-ordinates as the main chromosomes. Can you clarify what coordinates are in your BED file?
So this is my approach: I want to interrogate gene PVT1, I know the coordinates for it is chr8:127794532-128101252 In UCSC table I selected the following options:
This prints out a big file, but I can select the coordinates of interest (my BED file <test.bed> ), which are:
Using that 'portion' of the file I can then do:
Is this approach not correct? What is your suggestion then?
If coordinates in your BED file refer to chromosomal locations then you need to use the whole genome file and get those sections by the method you noted above.
That said you should be able to get the sequence using table browser directly.
Could you please elaborate how to retrieve the sequences directly from the table browser? Thanks!
Can you select the output format as sequence instead of BED?
Of course ... But as 'expected' the range
range=chr8:127794533-128101253
in the output sequence file belongs to: LF385466, LF385467, MA621043 and MA621044, which exactly the same sequences as my approach using the mrna.fa file.Not sure if I am missing something but you want 4 copies of the same sequence with a different fasta header?
They are +ve and -ve strands, but identical two by two ... I can filter that out later on. This is probably a very naive question, but what the header (LF385466, LF385467 ...) means? I know the are the qnames, but why they can be identical sequences and different headers?
Looks like those appear to be "polycomb associated non-coding RNA". You probably know what they are.
https://www.ncbi.nlm.nih.gov/nuccore/LF385466.1/
https://www.ncbi.nlm.nih.gov/nuccore/MA621043.1/
https://www.ncbi.nlm.nih.gov/nuccore/MA621044/
Yes! I've just realised about it! Thank you so much for following this issue!