Finding SNPs in population data: help with pipeline
1
1
Entering edit mode
7.6 years ago
Alex ▴ 10

I am struggling to identify SNPs in population genomic data that I have. So far my pipeline follows:

  1. Download fastq files from EBI-ENA
  2. Trim reads to match quality threshold of 20
  3. Map reads to reference sequence (BWA aln/sampe)
  4. Convert SAM to pileup (view -q 20, sort, mpileup)

Ultimately, I want to be able to query the pileup file and retrieve SNP information at each base eg:

chr pos        ref  cov A   T   C   G   N
3L  12798928    C   34  0   0   29  5   0

As I'm very new to bioinformatics and processing this kind of data I don't know exactly where I've gone wrong or if there are better/more efficient ways to complete this.

Problems I have struck so far:

  1. All reference bases in my pileup file are N
  2. How can I convert pileup entries to something similar to above? Can this be done with the BAM files as they're much smaller?

Any advice is greatly appreciated, I am reading/referring to the Biostar handbook but this seems like quite a niche problem.

SNP population pileup genome • 2.3k views
ADD COMMENT
0
Entering edit mode
  1. Did you specify the reference fasta genome (option -f in samtools mpileup)?
  2. You can parse the output of samtools mpileup, it shouldn't be too difficult: https://en.wikipedia.org/wiki/Pileup_format Alternatively, if you are using python you can use the pysam module, it is very nice and easy to use, and does all the parsing for you. More information here: http://pysam.readthedocs.io/en/latest/

Regarding the huge size of pileup files, it would be better if you don't save all that to a file bu just connect the output to your script with a pipe (or use pysam).

ADD REPLY
0
Entering edit mode
7.0 years ago

Ultimately, I want to be able to query the pileup file and retrieve SNP information at each base

I've writtern FindAllCoverageAtPosition http://lindenb.github.io/jvarkit/FindAllCoverageAtPosition.html (see also How to generate alignment report from bam files )

it will output the bases at each position:

$ find ./testdata/ -type f -name "*.bam" | \
 java -jar dist/findallcoverageatposition.jar -p rotavirus:100
#File              CHROM      POS  SAMPLE  DEPTH  M    I  D  N  S   H  P  EQ  X  Base(A)  Base(C)  Base(G)  Base(T)  Base(N)  Base(^)  Base(-)
./testdata/S4.bam  rotavirus  100  S4      126    126  0  0  0  29  0  0  0   0  5        0        0        121      0        0        0
./testdata/S1.bam  rotavirus  100  S1      317    317  1  0  0  50  0  0  0   0  27       0        1        289      0        1        0
./testdata/S2.bam  rotavirus  100  S2      311    311  0  1  0  60  0  0  0   0  29       1        0        281      0        0        1
./testdata/S3.bam  rotavirus  100  S3      446    446  1  0  0  86  0  0  0   0  39       0        1        406      0        1        0
ADD COMMENT

Login before adding your answer.

Traffic: 1489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6