All - suppose you have a list of rsIDs or chr:pos and a genomic build. There are a plethora of tools out there that can be used to pull genomic sequence. For instance, using eFetch
, something like this might work:
varPs="${dataDir}/uniq_vars.txt"
while read chr startp endp ref alt; do
ACCN_ID=$(awk -v chr=chr${chr} '($1==chr){print $2}' chromosomeRefIds.tsv)
efetch -db nuccore -id ${ACCN_ID} -format fasta -seq_start ${startp} -seq_stop ${endp} >> "${varPs}.out"
echo ${chr} ${startp} ${ref} ${alt} >> "${varPs}.out"
sleep 3
done < "${varPs}"
Or using UCSC, we can do something like this:
# please see help docs at: https://genome.ucsc.edu/goldenPath/help/bigBed.html
chr="10"
startp="123237844"
endp="123357972"
bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg19/bbi/clinvar/clinvarMain.bb -chrom=chr${chr} -start=${startp} -end=${endp} stdout
This is all well and good. But, suppose that many different variants. Some are nicely accessioned SNPs, some are indels, some are larger insertion or deletion events, etc.
Starting with something fairly simple - e.g., VCF file format chr, pos, ref alt, etc. etc., what I would like to do is pull out the variant, plus, say, 100 basepairs on either side of the variant.
In the case of SNVs and insertions, this is fairly easy, one can simply download the sequences then make the change. However, for deletions I'd need, for example, to ascertain the number of NTs deleted, add that to the total length (200NTs + length of deletion, deal with any insertions, etc. etc.).
It is not that this is prohibitively difficult, it is just that I am trying to avoid re-inventing the wheel. For instance, I know that I could use one of the above, then use biopython to deal with strand issues etc., its just more script writing than I would like to do. To close, a sort of ideal solution might be able to take HGVS, or TranscriptID, or Chr:Pos - and return two sequences - one with the variant, and one without it.
Desired input ((Build GRCh38))
chr1 42928299 A TGC
or NT2448292.4 c.235_237delinsTAT (p.Lys79Tyr) versus c.[235A>T;237G>T] (p.[Lys79*;Lys79Asn])
this kind of info, whatever it may be, to NT sequence:
ATGCGCGATCGATGTACGTAGCTCGATCGTACTCGA
Asaf - thanks very much.
I appreciate you, and I wanted to let you know I have accepted your answer. However, I am creating a follow-up post (I was torn between that and an edit to this post; I decided to accept this but then link the two. I will post the edit with the link to my main post in just a second.
thank you!