Question

If I need to repeatedly perform variant calling from BAM for single gene only, do I really need entire reference FASTA?

1

Entering edit mode

2.5 years ago

sbstevenlee ▴ 480

Hi all,

Let's imagine I'm developing a Python tool that can predict clinical phenotype for the gene YFG (Your Favorite Gene) which is located in chr1:1000-2000. It does so by first creating a VCF file from an aligned BAM file using bcftools (more specifically via the Python package pysam -- i.e. pysam.bcftools) and then somehow mapping variant records to a known phenotype.

For variant calling the tool currently requires two input files from a user: a BAM file and a reference FASTA file. However, I want to update the tool so that it already includes a "mini" FASTA file that only contains the sequence of YFG which is 1 kb long. This means users are now required to provide their BAM only. I can't just put an entire reference FASTA file to the tool because it is usually huge (~3 GB) so the tool won't be portable anymore (portability matters to me a lot). Including mini FASTA will also improve the reproducibility of said tool because currently different users have slightly different FASTA files (e.g. different contig names such as chr1 vs. 1).

That being said, my question is: Is this feasible? I keep thinking there got to be a simple way to achieve my goal, but I can't think of a good way to actually pull this off.

For instance, if I were to simply use mini FASTA as described above, bcftools will return most positions in chr1:1000-2000 as variants because every position has been left shifted by 1000. I know this because I have already tried it :(

I still think above approach could work if, for example, there is a way to tell bcftools that the sequence in mini FASTA starts at chr1:1000. But no luck thus far. Any advice and recommendation would be highly appreciated.

Another possible workaround might be to re-align sequence reads from BAM to mini FASTA and then perform variant calling with mini FASTA. I actually think this will work quite well, but the major problem is that there is no Python package that can perform efficient and accurate read alignment (e.g. bwa-mem). Of course, I can have users install a read aligner separately, but if I can I really want to package everything within the tool.

Thanks for reading this lengthy post!

variant-calling bam vcf bcftools fasta • 1.5k views

ADD COMMENT • link 2.5 years ago by sbstevenlee ▴ 480

0

Entering edit mode

I can't just put an entire reference FASTA file to the tool because it is usually huge (~3 GB) so the tool won't be portable anymore

most (all?) bioinfo tools require that the users have already download an indexed reference fasta sequence. you don't need to provide the REF with your tool.

you could of course shift all positions in your VCF using for example awk, but it's a bad idea IMHO

ADD REPLY • link 2.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for the reply. I'm aware these days it's not difficult to obtain reference FASTA from the web, but many users of my program are not necessarily proficient in bioinformatics and programming, so I'm always looking for ways to cut down end users' burden.

As for editing VCF, I'm pretty comfortable working with the format (I've developed a VCF parser tool), so we will see how difficult it will be!

ADD REPLY • link 2.5 years ago by sbstevenlee ▴ 480

0

Entering edit mode

somewhat unrelated to this post but I checked the source you linked, and I saw that you have an entire package and command line suite called fuc as a shorthand to frequently used commands.

The tool looks useful but the naming is not appropriate. It strongly resembles the most common four-letter profanity.

I'd like to strongly and emphatically urge to change the name so that it does not resemble the profanity - it is a shame to lose out on a tool and all that effort just to be funny and edgy.

In my opinion it is not an acceptable bioinformatics name and will bring you unexpected trouble in the future. Call faq or fbc, run, ngs etc or any other, just not that.

ADD REPLY • link 2.5 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks for the feedback! You have no idea how much regret I have with the package name, but given that it's already been downloaded >11k in bioconda alone, I think it's too late.

When I first developed fuc, it was meant to be a personal toolbox for working with genomic data, so I didn't put much thought into the package's name. I didn't expect it to grow like this.

To my (pathetic) defense, I put below in the very top of my README:

The main goal of the fuc package (pronounced "eff-you-see") is to wrap some of the most frequently used commands in the field of bioinformatics into one place.

But again, I appreciate your concern. I really do get your point :)

ADD REPLY • link 2.5 years ago by sbstevenlee ▴ 480

score 2 · Accepted Answer · 2022-05-26

2

Entering edit mode

2.5 years ago

Istvan Albert 102k

In normal operation, the coordinates of the BAM and FASTA file need to match.

One workaround could be to extract a subset of the BAM file, then post-process that BAM file to shift the POS column to match the mini-FASTA (subtract the offset from each POS field - other corrections may also be needed) then use the mini BAM file with the mini FASTA

ADD COMMENT • link 2.5 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks Istvan for introducing me an alternative way (as always). I like your approach of editing positions in subsetted BAM to make it compatible with mini FASTA! I will wait some time before accepting your answer, just in case other people have a different approach.

ADD REPLY • link 2.5 years ago by sbstevenlee ▴ 480