Entering edit mode
23 months ago
Batel
•
0
Hi,
I'm trying to get the alleles of non-SNP locations from the hg19 reference genome in FASTA format.
1) What is the best way to work with FASTA - either python/R/bash 2) How do you navigate in FASTA files, specifically in hg19? 3) How do I find the list of non-SNP bases, that are indeed sequenced in hg19?
Thanks, B.
What is a non-SNP location? How it would be in FASTA format?
Could you make a mock-up of the result you want?
Yes, of course.
I want something that looks like this (positions and alleles are made up)
(2347645 is missing because it was a SNP)
A non-SNP is a location that has no variability. for example here, in the UCSC genome browser, I want to get all these bases (marked in yellow) excluding the ones that are SNPs (marked in red), but in a table format as above.
Oh I see, this is certainly an unusual use case. Here is a minimal example you can build upon: https://colab.research.google.com/drive/1-n7smJGy-rdkbD9aT0Lkk5BeID0ix2rv?usp=sharing
Because of the size of the dbSNP it is not a computationally speedy process. I didn't run this all the way through, so there might be errors. However, colab at least have good download speed and enough storage to handle it.
Barslmn, Thank you so much for your reply. I didn't realize it would take that much time to produce the table.
I can actually subset the desired positions to areas in the range of 40 bases around a SNP. so for example, I have these target SNPs:
I would like to get:
Then on to the next SNP. I'm okay to keep the SNPs in the table - I can exclude them afterward.
The first problem is I don't know if all these positions are actually sequenced in hg19.
The second is how to produce the output.
I would very much appreciate your help.
Thank you, B.
This is a better approach and easier one to solve. You can use samtools to retrieve your sequences.
After this you can check the colab notebook to format it into a table.
Thank you again, barslmn!
I've followed your answer and indeed got the sequence I wanted, but the table does not hold the position, and the order is mixed : for example, for SNP rs2843964:
Then:
If the alleles stay in the right order it will be easy to just concat the base position next to the alleles... Thank you again!
In previous example we need to
sort
in order thejoin
the two files. Here we don't need to sort anything. However, we need to get the correct position sincenl
starts from 1. We can add start position tonl
index in order to get the real position.This would output: