This is an old question, but I recently had this same problem, and here is what I came up with. I've uploaded the scripts I used here
I had:
.fastq.gz
files from which I wanted to extract a subset of reads
a .bed
file which contained coordinates for which I wanted reads from the .fastq.gz
files
.bam
alignments for each set of .fastq.gz
files
I also wanted to make sure I was extracting the exact entries from the original .fastq.gz
files, instead of converting back & forth between file formats. I decided to use BioPython's SeqIO, and pysam Python packages to make the scripting less complicated. Also using Python's gzip
library to read and write to/from the .gz
files directly.
First thing I did was create a .tsv
formatted samplesheet that listed the samples, and the corresponding .bam
and .fastq.gz
files, shown here
Here is an overview of the script I set up for this.
Load packages and import file paths from samplesheet and target regions
import os
import sys
import csv
import gzip
from Bio import SeqIO
import pysam
# file with sample IDs, paths to .fastq and .bam files
samples_file = "sample_files.tsv"
# file with genomic regions to use
target_file = "targets.bed"
# directory to save the output fastq files
output_dir= "output_fastq"
# read in the samples
samples = []
with open(samples_file) as f:
reader = csv.DictReader(f, delimiter = '\t')
for sample in reader:
samples.append(sample)
# split the fastq and bam entries
for sample in samples:
sample['fastq'] = sample['fastq'].split(',')
sample['bam'] = sample['bam'].split(',')
# add 'qname' for later
sample['qname'] = []
# make fastq output file paths
sample['output_fastq'] = {}
for fastq in sample['fastq']:
sample['output_fastq'][fastq] = os.path.join(output_dir, os.path.basename(fastq)) # .strip('.gz')
# read in the target regions
targets = []
with open(target_file) as f:
for line in f:
chrom, start, stop = line.split()
targets.append((chrom, int(start), int(stop)))
.bam
Read ID'sGet the ID's of the sequences from the alignment file that match the regions we want
# get the read IDs ('qname') for the targets from each bam per sample
for chrom, start, stop in targets:
for sample in samples:
for bam_file in sample['bam']:
# load the bam
bam = pysam.AlignmentFile(bam_file, "rb")
for read in bam.fetch(chrom, start, stop):
sample['qname'].append(read.qname)
bam.close()
# get only the unique qnames per sample
for sample in samples:
sample['qname'] = sorted(set(sample['qname']))
.fastq.gz
ReadsFind the sequences in the .fastq.gz
files that match the ID's we found in the .bam
files for the desired regions.
# get the fastq reads that match
for sample in samples:
for fastq in sample['fastq']:
output_fastq = sample['output_fastq'][fastq]
with gzip.open(fastq) as gz_in, gzip.open(output_fastq, 'wb') as gz_out:
input_seq_iterator = SeqIO.parse(gz_in, "fastq")
seq_iterator = (record for record in input_seq_iterator if record.id in sample['qname'])
SeqIO.write(seq_iterator, gz_out, "fastq")
The full script for this is located here. Since this runs each sample sequentially, it is very slow (took 10hrs to extract ~10-20k reads from the listed .fastq files), I also wrote a version here that only operates on one sample at a time, so you can run multiple samples in parrallel such as on an HPC cluster using a script like this one
Hope this helps!
You can convert read sequences to FASTA, align to your genome with BLAT, and take the resulting PSL file and convert it to BED to do set operations on.
The resulting BED file can be used as a lookup table for read IDs, for use in filtering FASTQ records.
Thanks Alex. I think this will be a good solution as I need to extract exome regions from genome which is already mapped by illumina (paired-end). I don't think illumina uses bwa which I guess is a better aligner. I wanted to realign using bwa and to do that I am converting the aligned file to fastq. I don't want to realign the whole genome as I need only exonic regions. And hence this question.
Your fastq has no inherent mapping information in it. FastqToSam will make you an unmapped bam; bam format, with no mapping positions.
You need to map first, to know what reads are in the desired region, pull out those reads, then convert those back to fastq. BEDTools can get you the intersect between a .bed file and a .bam. Pipe that output to Picard or samtools, which can convert a bam to a fastq on the fly.
I have a rather different approach in mind, but not fully tested.
Using your BED file, you can extract the genomic sequences of interest as a fasta file (e.g. looping over the bed file and using samtools faidx). With BBSplit, you can extract all those fastq entries mapping to these regions. This should perform a bit faster than the map-against-all approach.
Old thread but in fact, there is a use case here:
We use Barrnap to find nanopore reads that contain 16s sequences (sequenced from soils for example) To do so we use seqtk to convert fastq to fasta, run barrnap and get a fasta with the 16s sequence matches from the initial nanopore to profile further with qiime2
Barrnap uses bedtools under the hood.
https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html
It would be nice if there was a getfastq tool version as well to extract parts of long nanopore fastq seqs too. Should be simple to write, if only I had C coding skills
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks a lot. Will try that.