Question

A method for extracting two variable regions from a FASTA file containing long reads (see image)

0

Entering edit mode

6.0 years ago

zack.saud ▴ 50

Hi all,

I'm looking for help or suggestions regarding a technique for extracting the variable regions in the image. Does anyone know of any tools (or a script) I could used to extract both of the variable regions from each read, by using the constant regions that flank the two variable regions. Ideally I would like a tool that could extract both of these variable regions from each read, and either place them in Excel, or create another FASTA file where the two variable regions from each read are linked together (ie with an x inbetween.

Many thanks in advance

enter image description here

sequencing next-gen • 1.6k views

ADD COMMENT • link updated 6.0 years ago by ATpoint 89k • written 6.0 years ago by zack.saud ▴ 50

1

Entering edit mode

Can you clarify if these are amplicons or whole genome reads? What format were they in originally, fastq/fasta? What does the image represent? BAM or fasta multiple sequence alignment?

If this is a BAM alignment it would be easy to extract reads that span the intervals you are looking for by using samtools view but depending what kind of reads these are, it would be tricky to extract the specific nucleotides that represent 350-375 bp and 450-475 bp in image above.

ADD REPLY • link 6.0 years ago by GenoMax 153k

0

Entering edit mode

Hi genomax, I have files with both amplicon (450 bp) AND whole plasmid reads (4800 bp) from the same sample. I have each file in both fastq and fasta. The image is an aligment of my reads to the backbone sequence with the variable region deleted (minimap2), it has no significance, I just hoped it might help explain the problem I am attempting to solve. The alignment produced a BAM file. I'll give SAMtools view a try, thank you.

ADD REPLY • link 6.0 years ago by zack.saud ▴ 50

0

Entering edit mode

Have you looked up "regular expressions"?

ADD REPLY • link 6.0 years ago by swbarnes2 15k

0

Entering edit mode

Hi Swabarnes,

I'd never heard of regular expressions, but upon reading up in it, it seems to be exactly what I need! Do you know of any tools that incorporate it for Fasta/Fastq files? Or does one of the linux tools (grep, sed, awk) contain regular expressions?

Many thanks

ADD REPLY • link 6.0 years ago by zack.saud ▴ 50

0

Entering edit mode

A fastq file is a plain text file, just gzipped. You don't need special software to handle it. You can apply any programming language, like python or perl or R, or perhaps you could string a bunch of unix commands together starting with grep.

Though now that I see your image, your "constant" region is not very constant. I'd rather parse the sam file, because the aligner will have already handled the fact that the flanking sequence isn't perfect, and you can eyeball the coordinates you want.