I am struggling to identify SNPs in population genomic data that I have. So far my pipeline follows:
- Download fastq files from EBI-ENA
- Trim reads to match quality threshold of 20
- Map reads to reference sequence (BWA aln/sampe)
- Convert SAM to pileup (view
-q 20
, sort, mpileup)
Ultimately, I want to be able to query the pileup file and retrieve SNP information at each base eg:
chr pos ref cov A T C G N
3L 12798928 C 34 0 0 29 5 0
As I'm very new to bioinformatics and processing this kind of data I don't know exactly where I've gone wrong or if there are better/more efficient ways to complete this.
Problems I have struck so far:
- All reference bases in my pileup file are N
- How can I convert pileup entries to something similar to above? Can this be done with the BAM files as they're much smaller?
Any advice is greatly appreciated, I am reading/referring to the Biostar handbook but this seems like quite a niche problem.
Regarding the huge size of pileup files, it would be better if you don't save all that to a file bu just connect the output to your script with a pipe (or use pysam).