Entering edit mode
6.4 years ago
marongiu.luigi
▴
730
Hello,
I have a SAM file (BAM actually) that I would like to convert to a fasta. I found different solutions online for instance this or this, but the utput I get is in this form:
>file-1316
TAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCG
GCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAGGGTCAAATACGGTCCCTTT
TTCAAAGATTTTAGGGGAAT
>file-354
CGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGC
CCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGTGCTTATAG
GGTCAAATACGGTCCCTATT
>file-2000
TAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAAAGTCCAGGAA
AAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGGTAGAGGGGGT
GCTTATAGGGTCAAATACGG
>file-2930
GTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAGCTCACAAAAAACAGCAAGAA
AGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCAGTGCTGTGCTATATATGGGG
TAGAGGGGGTGCTTATAGGG
>file-1270
CCCCTCTAAATAGCGTCCATGTTAGTATATTAACAAGTCGTATAAATACCTGTTAGTTAG
CTCACAAAAAACAGCAAGAAAGTCCAGGAAAAAGCAGCCCAGACAGAAAGAGCTGCGGCA
GTGCTGTGCTATATATGGGG
Is there a way to obtain an aligned fasta? I think the term is consensus fasta, essentially I need all these fragments to be aligned in a single consecutive sequence.
Thank you
What you are asking for is not a simple conversion. Your sam file (using bam files would be better to save disk usage) contains an alignment: the reads are mapped to where they belong on the reference genome. What you are asking for is a genome assembly in which reads are stitched together to create a genome. You could create a consensus sequence using an alignment though, but that is not a trivial conversion.
Are you sure it wouldn't be easier to find the reference file used to make the bam, instead of trying to remake it yourself?
what's wrong with the 2nd solution ?
I mean could be more clear about "aligning these sequences to single seq" ?
OP wants consensus fasta sequence.
marongiu.luigi : Do you not search the web/biostars before posting a question: https://samtools.github.io/bcftools/howtos/consensus-sequence.html
I have seen this page, but I dont have a indexed VCF file
You need to first call variants using the alignment file and then do the consesus. Relevant instructions are linked on the page I included above (https://samtools.github.io/bcftools/howtos/variant-calling.html ).
Followinf your lead, I created a bcf with
and I index it with
which created a calls.bcf.csi file. I then used the instructions for the consensus sequence but I got:
But in
Failed to open -: unknown file type
, the reference is present in the working directory (ref.fa).You forgot the
-f
flag (which the error message says quite clearly). It's in your top line but not in the second part where you're actually running your commands.I see, now the command is:
This generates a file ref.fa.fai that contains:
Essentially the second command is a re-run of the first apart for the option -z.
-O
just controls output format, so there's almost zero difference between the processes.Can you explain to us why you want a FASTA out of a BAM? BAM documents how reads (sequences with base-quality scores) align to a reference genome. BAM does not document in any way the changes one would need to make to the reference genome to get a consensus sequence. VCF files can be used to extract that information. We need to understand why you are particular about BAM files, so we can evaluate your approach.
Because a BAM on its own give me little information. I need a sequence that I can design primers upon and remap with Clustal in case I need it. BAM is an alignment, I would like to see where they align that is all.
Well that's what bam tells you.
Do variant calling, mask the reference using the identified variants and use that for primers.
OK, so I get that the next step after alignments is variant calling. I'll continue from there then. Thanks.
that gives me a pile of fragments like before:
If I try
the file is empty, apart for the initial '>'.
I wanted a single fasta entry with all these fragments aligned together. So, instead of having:
I would like:
We don't know what exactly you are looking for. I can think of:
1) You want to get a fasta of all loci on the genome that has at least one read mapping?
2) You want to get a fasta of "consensus" reads. There is no consensus, since there might be SNPs/variants. You could potentially have two or more versions of the sequence at each polymorphism.
3) You want to perform variant calling/phasing and then output the possible variant fasta?