How to extract fasta sequence of a gene from BAM files generated using Miseq sequencing
1
1
Entering edit mode
8.8 years ago

I have 30 viral samples sequenced using MiSeq. I received forward and reverse fastq files and also bam files which are aligned with the reference virus genome. I have been asked to build phylogenetic tree for 30 sequenced viral genome based on a gene. I believe, I need to select a gene sequence from all the samples and do a multiple sequence alignment and build a phylogenetic tree. But I am struck in fetching the gene sequence from bam files. How do I select one gene sequence for all the 30 viral sequence? Any suggestions? Can samtools do it?

gene Assembly sequencing BAM alignment • 12k views
ADD COMMENT
0
Entering edit mode
8.8 years ago
GenoMax 147k

Extract Reads From A Bam File That Fall Within A Given Region

Check the manual page for samtools (regions): http://www.htslib.org/doc/samtools.html

You will obviously want to separate the reads into sample specific pools (not sure if your bam file has all 30 samples in the same file) before you do MSA.

ADD COMMENT
0
Entering edit mode

Thanks genomax for your help. I have 30 separate bam files. For example, this is a reference gene of interest for me NC_026434 which is NA Neuraminidase. For this gene, the CDS starts at1 and ends at 1410.

I am looking for 30 fasta sequence file to do multiple sequence alignment. How do I get the fasta sequence mapped to the reference gene for each viral bam file?

ADD REPLY
0
Entering edit mode

Are those bam files aligned or unaligned? If they are aligned what are they aligned against as a reference (hopefully the same genome as the gene of your interest). If they are not aligned or aligned against a non-interesting genome then additional work is going to be needed.

ADD REPLY
0
Entering edit mode

Dear genomax,

I have aligned against a viral genome which has only 8 gene segments. Currently no genome sequence is available, so I took the fasta sequence of all 8 genes and merged them to a single fasta file. That single merged fasta file has been used for aligning with the sequencing reads(fastq) to generate the bam file (aligned against viral referral genes).

ADD REPLY
1
Entering edit mode

Assuming that you used a multi-fasta file (with 8 sequences) to do the alignments you can easily split the bam for each sample into 8 gene specific bam's How To Split A Bam File By Chromosome. Then use something (BAM/SAM to FASTA conversion) to convert the bam files to fasta.

You will most likely need to sub-sample the reads (I assume you have a ton of coverage) to do the MSA.

Other option is you could try to generate a consensus for each gene from the individual bam's and then use the consensus to do the MSA. It depends on what you finally want to do.

ADD REPLY
0
Entering edit mode

I am currently working with Influenza virus. I don't have chromosome numbers in my case. This is the format of the reference fasta file, I used for alignment.

>gi|758967842|ref|NC_026438.1| segment 1 polymerase PB2 (PB2) gene, complete cds
Sequence.................
>gi|758899361|ref|NC_026435.1| segment 2 polymerase PB1 (PB1) gene, complete cds
Sequence.....
>gi|758967835|ref|NC_026437.1| segment 3 polymerase PA (PA) gene, complete cds
Sequence....
>gi|758899355|ref|NC_026433.1|segment 4 hemagglutinin (HA) gene, complete cds
Sequence....
>gi|758899363|ref|NC_026436.1|segment 5 nucleocapsid protein (NP) gene, complete cds
Sequence....
>gi|758899359|ref|NC_026434.1|segment 6 neuraminidase (NA) gene, complete cds
Sequence....
>gi|758899349|ref|NC_026431.1|segment 7 matrix protein 2 (M2) and matrix protein 1 (M1) genes, complete cds
Sequence....
>gi|758899352|ref|NC_026432.1|segment 8 nuclear export protein (NEP) and nonstructural protein 1 (NS1) genes, complete cds
Sequence....
ADD REPLY
0
Entering edit mode

Chromosome equates to a fasta identifier (it does not strictly have to be a real chromosome) for the splitting.

Though having all those extra characters (bars etc) in your headers may do some interesting thing. Tell us if the split works.

ADD REPLY
0
Entering edit mode

Thanks a lot for ur wonderful links and comments. It is really useful. I have two goals to achieve in my case.

1) Get the gene sequence for all 30 viral samples and build a phylogenetic tree for them and see how is the relationship between them

2) After fetching the gene sequence from sequencing reads, I am planning to perform codon usage analysis across all the 30 samples to investigate any codon bias in them.

ADD REPLY
0
Entering edit mode

Would it make sense to generate a consensus from the bam and then create the tree from the consensus for each sample?

Have you inspected the alignments and are there lots/small number of SNP or other changes?

ADD REPLY
0
Entering edit mode

After variant calling step, I found out 800 snps has been passed the filtering condition of GATK. Out of 800, 250 are non-synonymous and 550 are synonymous mutations.

ADD REPLY
0
Entering edit mode

Is that for one sample/gene? How many reads were there in total?

ADD REPLY
0
Entering edit mode

It is not for one sample or one gene. All samples have reads ranging from 57,000 to 430,000 reads.

sample01 - 57,000 reads
sample2 - 254,196
sample 3 - 77,716
sample 4 - 99,478
sample5 - 429,348
.....
sample 30 - 174,554

I supplied the above mentioned fasta sequence of reference virus and all 30 bam files together to the GATK pipeline to generate the sample VCF file.

Please find the sample VCF file for instance I listed only some vcf records from the total list of 800 vcf records.

#chrom                            Pos    ID    Ref    Alt    Qual        Filter    Info                    Sample01                            Sample02                        Sample03                            ..    ..    Sample30
gi|758967842|ref|NC_026438.1|     161    .     G      A      10917       Pass      AC=33;AF=1.00;AN=33     1:3,1499:1502:99:1:1.00:64814,0     1:0,172:172:99:1:1.00:7476,0    1:0,3013:3014:99:1:1.00:130399,0    ..    ..    1:0,921:921:99:1:1.00:39672,0
gi|758967842|ref|NC_026438.1|     174    .     A      G      2391        Pass      AC=1;AF=0.030;AN=33     0:1213,2:1218:99:0:0.00:0,52499     0:151,0:152:99:0:0.00:0,6523    0:2992,0:2998:99:0:0.00:0,129410    ..    ..    0:870,2:879:99:0:0.00:0,37491
gi|758967842|ref|NC_026438.1|     198    .     G      A      765371      Pass      AC=33;AF=1.00;AN=33     1:31,391:422:99:1:1.00:15535,0      1:1,35:36:99:1:1.00:1469,0      1:0,2989:2989:99:1:1.00:129052,0    ..    ..    1:3,657:660:99:1:1.00:28104,0
gi|758967842|ref|NC_026438.1|     201    .     T      C      1509        Pass      AC=1;AF=0.030;AN=33     0:1250,147:1296:99:0:0.00:0,61606   0:156,0:159:99:0:0.00:0,8034    0:3023,0:3045:99:0:0.00:0,156797    ..    ..    0:912,0:925:99:0:0.00:0,46734
gi|758899361|ref|NC_026435.1|     99     .     A      T      168057.91   Pass      AC=2;AF=0.061;AN=33     0:1459,0:1462:99:0:0.00:0,62645     0:36,0:36:99:0:0.00:0,1564      0:610,0:611:99:0:0.00:0,26080       ..    ..    0:299,0:301:99:0:0.00:0,12749
..                                                                 
gi|758967835|ref|NC_026437.1|     78     .     A      G      18535.58    Pass      AC=4;AF=0.125;AN=32     0:1059,7:1074:99:0:0.00:0,44473     0:1059,7:1074:99:0:0.00:0,44473 0:1629,4:1638:99:0:0.00:0,69144     ..    ..    0:950,2:955:99:0:0.00:0,40150
..
ADD REPLY
1
Entering edit mode

Get a consensus sequence for each gene for every sample (based on the alignment) and then go for MSA.

ADD REPLY
0
Entering edit mode

I am getting an error while using the following command to split bam file.

[sge_user@gw Sample_01]$ samtools view -b Sample_01.sorted.bam gi|758967842|ref|NC_026438.1| > NC_026438.bam
-bash: 758967842: command not found
-bash: ref: command not found
-bash: NC_026438.1: command not found
[main_samview] region "gi" specifies an unknown reference name. Continue anyway.
[sge_user@gw Sample_01]$ head -20 Sample_01.sorted.sam
@HD    VN:1.3    SO:coordinate
@SQ    SN:gi|758967842|ref|NC_026438.1|    LN:2280
@SQ    SN:gi|758899361|ref|NC_026435.1|    LN:2274
@SQ    SN:gi|758967835|ref|NC_026437.1|    LN:2151
@SQ    SN:gi|758899355|ref|NC_026433.1|    LN:1701
@SQ    SN:gi|758899363|ref|NC_026436.1|    LN:1497
@SQ    SN:gi|758899359|ref|NC_026434.1|    LN:1410
@SQ    SN:gi|758899349|ref|NC_026431.1|    LN:982
@SQ    SN:gi|758899352|ref|NC_026432.1|    LN:863
ADD REPLY
1
Entering edit mode

I was afraid that was going to happen because of the | in your headers. Try to see if escaping the | works:

$ samtools view Sample_01.sorted.bam gi\|758967842\|ref\|NC_026438.1\| > NC_026438.bam

Otherwise try one of the other options to split (bamtools).

ADD REPLY
0
Entering edit mode

Dear Genomax2,

Thanks for your suggestions and letting me know about bamtools.

I have two bam files. one is sample01.sorted.bam and sample01.sorted.picard.dedup.bam. Which bam file should I use to fetch the fasta sequence of the gene?

But for the below case, I used sorted.bam.

MacBook-Pro:bamtools Analysis$ bin/bamtools filter -region gi\|758967842\|ref\|NC_026438.1\|:1..2280 -in Sample01.sorted.bam -out Sample01_NC_026438.bam
MacBook-Pro:bamtools Analysis$ bin/bamtools convert -format fasta -in Sample01_NC_026438.bam -out Sample01_NC_026438.fasta
MacBook-Pro:bamtools Analysis$ less Sample01_NC_026438.fasta | grep '>' | wc -l
   13088

I could see 13,000 reads matching to this gene region NC_026438 (1-2280). Am I correct?

>HWI-M00640:23:000000000-AD82C:1:1101:18970:2397
AATATGGAGAGAATAAAAGAACTGAGAGATCTAATGTCGCAGTCCCGCAC
TCGCGAGATACTCACTAAGACCACTGTGGACCATATGGCCATAATCAAAA
AGTACACATCAGGAAGGCAAGAGAAGAACCCCGCACTCAGAATGAAGTGG
A
>HWI-M00640:23:000000000-AD82C:1:1101:16400:2426
AATATATTCAATATGGAGAGAATAAAAGAACTGAGAGATCTAATGTCGCA
GTCCCGCACTCGCGAGATACTCACTAAGACCACTGTGGACCATATCTGTC
TCTTATACACATCTCCGAGCCCACGAGACTAAGGCGAATCTCGTATGCCG
T
>HWI-M00640:23:000000000-AD82C:1:1101:13163:2471
GAGCAAAAGCAGGTCAAATATATTCAATATGGAGAGAATAAAAGAACTGA
GAGATCTAATGTCGCAGTCCCGCACTCGCGAGATACTCACTAAGACCACT
GTGGACCATATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAA
G
....

....

>HWI-M00640:23:000000000-AD82C:1:2114:18630:12156
GGTCGTTTTTAAACAATTCGACACTAATTGATGGCCATCCGA

What is the next step, how should I find the starting of the gene and ending of the gene from these reads?

ADD REPLY
1
Entering edit mode

What number do you get with the de-duped file?

Instead of getting the individual reads (which are 13000+ here and would be hard to handle with an MSA program) how about getting the consensus with samtools (look at the examples here: http://www.htslib.org/doc/samtools.html )

ADD REPLY

Login before adding your answer.

Traffic: 2178 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6