Extract Reads In Fastq Using A Bed File.
6
1
Entering edit mode
11.4 years ago
Kasthuri ▴ 300

I need to extract reads from a fastq file using genomic coordinates in a bed file. The output should again be a fastq file. Is there any easier way to do this other than converting the fastq to sam using Picard?

Thanks

fastq bed • 10k views
ADD COMMENT
4
Entering edit mode
11.4 years ago
nilshomer ▴ 70

A fastq file does not contain coordinate information. First, you must map/align the data to your reference genome (say with bwa), and then you can extract the sequences/reads using the coordinates in the BED file.

ADD COMMENT
0
Entering edit mode

Thanks a lot. Will try that.

ADD REPLY
2
Entering edit mode
7.0 years ago
steve ★ 3.5k

This is an old question, but I recently had this same problem, and here is what I came up with. I've uploaded the scripts I used here

I had:

  • .fastq.gz files from which I wanted to extract a subset of reads

  • a .bed file which contained coordinates for which I wanted reads from the .fastq.gz files

  • .bam alignments for each set of .fastq.gz files

I also wanted to make sure I was extracting the exact entries from the original .fastq.gz files, instead of converting back & forth between file formats. I decided to use BioPython's SeqIO, and pysam Python packages to make the scripting less complicated. Also using Python's gzip library to read and write to/from the .gz files directly.

Setup

First thing I did was create a .tsv formatted samplesheet that listed the samples, and the corresponding .bam and .fastq.gz files, shown here

Scripting

Here is an overview of the script I set up for this.

Import Paths & Regions

Load packages and import file paths from samplesheet and target regions

import os
import sys
import csv
import gzip
from Bio import SeqIO
import pysam


# file with sample IDs, paths to .fastq and .bam files
samples_file = "sample_files.tsv"
# file with genomic regions to use
target_file = "targets.bed"
# directory to save the output fastq files
output_dir= "output_fastq"


# read in the samples
samples = []
with open(samples_file) as f:
    reader = csv.DictReader(f, delimiter = '\t')
    for sample in reader:
        samples.append(sample)

# split the fastq and bam entries
for sample in samples:
    sample['fastq'] = sample['fastq'].split(',')
    sample['bam'] = sample['bam'].split(',')
    # add 'qname' for later
    sample['qname'] = []
    # make fastq output file paths
    sample['output_fastq'] = {}
    for fastq in sample['fastq']:
        sample['output_fastq'][fastq] = os.path.join(output_dir, os.path.basename(fastq)) # .strip('.gz')

# read in the target regions
targets = []
with open(target_file) as f:
    for line in f:
        chrom, start, stop = line.split()
        targets.append((chrom, int(start), int(stop)))

Find .bam Read ID's

Get the ID's of the sequences from the alignment file that match the regions we want

# get the read IDs ('qname') for the targets from each bam per sample
for chrom, start, stop in targets:
    for sample in samples:
        for bam_file in sample['bam']:
            # load the bam
            bam = pysam.AlignmentFile(bam_file, "rb")
            for read in bam.fetch(chrom, start, stop):
                sample['qname'].append(read.qname)
            bam.close()


# get only the unique qnames per sample
for sample in samples:
    sample['qname'] = sorted(set(sample['qname']))

Find Matching .fastq.gz Reads

Find the sequences in the .fastq.gz files that match the ID's we found in the .bam files for the desired regions.

# get the fastq reads that match
for sample in samples:
    for fastq in sample['fastq']:
        output_fastq = sample['output_fastq'][fastq]
        with gzip.open(fastq) as gz_in, gzip.open(output_fastq, 'wb') as gz_out:
            input_seq_iterator = SeqIO.parse(gz_in, "fastq")
            seq_iterator = (record for record in input_seq_iterator if record.id in sample['qname'])
            SeqIO.write(seq_iterator, gz_out, "fastq")

Conclusion

The full script for this is located here. Since this runs each sample sequentially, it is very slow (took 10hrs to extract ~10-20k reads from the listed .fastq files), I also wrote a version here that only operates on one sample at a time, so you can run multiple samples in parrallel such as on an HPC cluster using a script like this one

Hope this helps!

ADD COMMENT
0
Entering edit mode
11.4 years ago

You can convert read sequences to FASTA, align to your genome with BLAT, and take the resulting PSL file and convert it to BED to do set operations on.

The resulting BED file can be used as a lookup table for read IDs, for use in filtering FASTQ records.

ADD COMMENT
0
Entering edit mode

Thanks Alex. I think this will be a good solution as I need to extract exome regions from genome which is already mapped by illumina (paired-end). I don't think illumina uses bwa which I guess is a better aligner. I wanted to realign using bwa and to do that I am converting the aligned file to fastq. I don't want to realign the whole genome as I need only exonic regions. And hence this question.

ADD REPLY
0
Entering edit mode
7.0 years ago

Your fastq has no inherent mapping information in it. FastqToSam will make you an unmapped bam; bam format, with no mapping positions.

You need to map first, to know what reads are in the desired region, pull out those reads, then convert those back to fastq. BEDTools can get you the intersect between a .bed file and a .bam. Pipe that output to Picard or samtools, which can convert a bam to a fastq on the fly.

ADD COMMENT
0
Entering edit mode
7.0 years ago
michael.ante ★ 3.9k

I have a rather different approach in mind, but not fully tested.

Using your BED file, you can extract the genomic sequences of interest as a fasta file (e.g. looping over the bed file and using samtools faidx). With BBSplit, you can extract all those fastq entries mapping to these regions. This should perform a bit faster than the map-against-all approach.

ADD COMMENT
0
Entering edit mode
2.8 years ago

Old thread but in fact, there is a use case here:

We use Barrnap to find nanopore reads that contain 16s sequences (sequenced from soils for example) To do so we use seqtk to convert fastq to fasta, run barrnap and get a fasta with the 16s sequence matches from the initial nanopore to profile further with qiime2

Barrnap uses bedtools under the hood.

https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

It would be nice if there was a getfastq tool version as well to extract parts of long nanopore fastq seqs too. Should be simple to write, if only I had C coding skills

ADD COMMENT

Login before adding your answer.

Traffic: 2727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6