Entering edit mode
10.8 years ago
shl198
▴
440
Hi all, I have a sam file, and in samfile the reference is in the format like gi|359802265|ref|NC_016434.1|, I want to get the name of this reference, and also the type(DNA or RNA). I looked at Biopython, it seems I have to either use the gi number or the accession number to search, which means I have to separate it first, anyone know any commands can use the format in sam file? And how to extract the type information? Basically, what I want is like this:
result = get_info("gi|359802265|ref|NC_016434.1|")
print result
[[Spodoptera litura granulovirus, complete genome], [DNA]]
Short answer is that you use EUtils: first ELink to get the taxonomy ID from the Entrez taxonomy database using the nucleotide ID, then EFetch to get the species information using the taxonomy ID. BioPython has modules for this.
I'll add a longer answer when I have time, unless someone else gets there first.
Could just use an EFetch to get the RefSeq entry and extract the sequence type,and description from there using SeqIO. Alternatively you could derive the sequence type from the RefSeq accession (see RefSeq accession numbers and molecule types).
True, assuming all the references are RefSeq IDs.
What has the SAM/BAM format got to do with this?