Consensus Fasta Sequence Of A Given Genomic Region For Individuals From 1000G
2
2
Entering edit mode
11.1 years ago
Pierre ▴ 130

Hey guys,

I am trying to retrieve consensus fasta sequences of a number of regions for specific individuals from 1000G, and further to make a multiple sequence alignment.

The pipeline I'm using for this task is not working. I would appreciate your help to let me understand what's wrong with it - and preferably how to speed up things.

Say, I have the following reference sequence (reference.fasta) for which I would like to retrieve the corresponding sequences for a number of 1000G individuals.

>hg19_knownGene_uc011dlx.1_7 range=chr6:29695734-29695883 5'pad=0 3'pad=0 strand=+ repeatMasking=none                                     GCCCAGCTCCTGAGTGTCTCTACCTCTCAAACAAGTATTCTCATCCAGGAG

Use samtools to get the sub-section of interest

  ./samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam 6:29695734-29695883  > subsection.sam

Convert SAM to BAM (SAM header is present: 84 sequences)

./samtools view -bS subection.sam > subsection.bam

Sort and index the bam file

./samtools sort subsection.bam subsection_sorted.bam
./samtools index subsection_sorted.bam

Use the reference fasta sequence and the bam reads to create a consensus sequence

./samtools mpileup -uf reference.fasta subsection_sorted.bam | ./bcftools/bcftools view -cg - | ./bcftools/vcfutils.pl vcf2fq > subsection_consensus.fastq

Use seqtk to convert fastq to fasta (optional: while marking sites with <20x coverage)

./seqtk seq subsection_consensus.fastq (-q20) > subsection_consensus.fa

Final output: a huge file (~60 MB) containing a sequence string of "n".

What I would like to have instead after a number of iteration across individuals would hopefully look like this:

>reference GCCCAGCTCCTGAGTGTCTCTACCTCTCAAACAAGTATTCTCATCCAGGAG
>NA00001 ......................A...........................G..
>NA00002 ......................................T..............

Thanks in advance for any help / suggestion.

1000genomes consensus samtools • 5.8k views
ADD COMMENT
2
Entering edit mode
11.1 years ago

I suppose you're mixing the chromosomes from hg19 (with a 'chr' prefix) and the 1KGenomes chromosomes without 'chr' prefix; That is why you're getting some 'NNNNNNNNNN': the tool cannot find the chromosomes. Use the reference sequence of 1000genomes instead of hg19. http://www.1000genomes.org/faq/which-reference-assembly-do-you-use

ADD COMMENT
0
Entering edit mode

Hi Pierre, thank you for your reply. To see whether it is indeed a problem related to reference sequence, I used Grch37 reference and tried to retrieve the first 10 bases of the following transcript chr2:72356367-72356377: >ENSG00000003137|ENST00000001146|ENSP00000001146|2|72356367|72375167 ATGCTCTTTG However, I ended up with an even larger fasta file containing only non-bases. Any idea why this is not working?

ADD REPLY
0
Entering edit mode

I am very likely pointing to the wrong data file, however, cannot really figure out the mistake. Since I'm only interested in the exome, I thought I might try this instead: samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/exome_alignment/HG00096.mapped.ILLUMINA.bwa.GBR.exome.20120522.bam 2:72356367-72356377 This did not work.

ADD REPLY
0
Entering edit mode
9.1 years ago
greener ▴ 10

I am trying to do the same thing. I would like to generate a fasta file on a predefined region chr7:128575994-128592088 from the 1000 genomes project. Ideally I'd like to have the output of that region from each individual. It would be helpful to know if the above approach was every successful (and how they got it to work) or if folks could provide me tips on how to complete this. Any help or suggestions are appreciated.

Thanks

ADD COMMENT

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6