Hey guys,
I am trying to retrieve consensus fasta sequences of a number of regions for specific individuals from 1000G, and further to make a multiple sequence alignment.
The pipeline I'm using for this task is not working. I would appreciate your help to let me understand what's wrong with it - and preferably how to speed up things.
Say, I have the following reference sequence (reference.fasta) for which I would like to retrieve the corresponding sequences for a number of 1000G individuals.
>hg19_knownGene_uc011dlx.1_7 range=chr6:29695734-29695883 5'pad=0 3'pad=0 strand=+ repeatMasking=none GCCCAGCTCCTGAGTGTCTCTACCTCTCAAACAAGTATTCTCATCCAGGAG
Use samtools to get the sub-section of interest
./samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam 6:29695734-29695883 > subsection.sam
Convert SAM to BAM (SAM header is present: 84 sequences)
./samtools view -bS subection.sam > subsection.bam
Sort and index the bam file
./samtools sort subsection.bam subsection_sorted.bam
./samtools index subsection_sorted.bam
Use the reference fasta sequence and the bam reads to create a consensus sequence
./samtools mpileup -uf reference.fasta subsection_sorted.bam | ./bcftools/bcftools view -cg - | ./bcftools/vcfutils.pl vcf2fq > subsection_consensus.fastq
Use seqtk to convert fastq to fasta (optional: while marking sites with <20x coverage)
./seqtk seq subsection_consensus.fastq (-q20) > subsection_consensus.fa
Final output: a huge file (~60 MB) containing a sequence string of "n".
What I would like to have instead after a number of iteration across individuals would hopefully look like this:
>reference GCCCAGCTCCTGAGTGTCTCTACCTCTCAAACAAGTATTCTCATCCAGGAG
>NA00001 ......................A...........................G..
>NA00002 ......................................T..............
Thanks in advance for any help / suggestion.
Hi Pierre, thank you for your reply. To see whether it is indeed a problem related to reference sequence, I used Grch37 reference and tried to retrieve the first 10 bases of the following transcript chr2:72356367-72356377: >ENSG00000003137|ENST00000001146|ENSP00000001146|2|72356367|72375167 ATGCTCTTTG However, I ended up with an even larger fasta file containing only non-bases. Any idea why this is not working?
I am very likely pointing to the wrong data file, however, cannot really figure out the mistake. Since I'm only interested in the exome, I thought I might try this instead: samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/exome_alignment/HG00096.mapped.ILLUMINA.bwa.GBR.exome.20120522.bam 2:72356367-72356377 This did not work.