deinterleave fastq file
4
2
Entering edit mode
9.6 years ago
anon ▴ 50

Hi All,

I would like to ask that is there an accurate tool/script for deinterleaving fastq files? I managed a 'bam to fastq' conversion according to http://gatkforums.broadinstitute.org/discussion/2908/howto-revert-a-bam-file-to-fastq-format and I would like to deinterleave the output fastq.

Many thanks!

fastq bamtofastq • 14k views
ADD COMMENT
8
Entering edit mode
9.6 years ago

In terms of performance, I have yet to see any program written in Perl, Python, C++, or C outperform this combination of shell tools. Other tools may be more convenient--and often that is more than a good enough reason to use them--but if you need blistering speed then there is no substitute for native built-in commands.

To summarize, the solution (using paste and process substitution) looks something like this.

paste - - - - - - - - < reads-int.fq \
    | tee >(cut -f 1-4 | tr "\t" "\n" > reads-1.fq) \
    | cut -f 5-8 | tr "\t" "\n" > reads-2.fq
ADD COMMENT
2
Entering edit mode

+1 For this one. I am a python devotee, but I have this method saved as a little shell script on my PATH and wouldn't consider doing it a different way.

ADD REPLY
1
Entering edit mode

You will have to do it another way unless your data is uncompressed 4-line fastq. That is the problem.

ADD REPLY
1
Entering edit mode

...and the file is completely uncorrupted.

I prefer solutions that detect and report misformatted or corrupted data to 1-liners that will pass anything, with the garbage-in, garbage-out philosophy. I've received many bug reports for reformat not working on this file or that file, and upon investigation, almost always it was a truncated or otherwise corrupted file that would not have been detected otherwise (and, I have to say, the user not reading the error message, which is generally pretty informative).

One-liners with no validation have their place, but it is not in situations where uncaught errors can cause huge problems downstream for years later, in my opinion.

ADD REPLY
0
Entering edit mode

Well zcat gets you around the compression anyway, so that's not really an issue.

In regards BB's comment, I agree - there's a time and a place for careful and for quick. My point was really that considering that the basic python and perl solutions people generally use for this sort of thing don't have a checking system in place anyway, there's simply no need to bring these languages to the table in the first place.

ADD REPLY
0
Entering edit mode

Comments are valid, but my original point stands: you can choose convenience (format checking and error handling) or performance.

ADD REPLY
0
Entering edit mode

I don't know, Reformat is pretty fast... I think you can have both.

ADD REPLY
0
Entering edit mode

...unless your data is uncompresed [sic] 4-line fastq.

Regarding compression, a very minor addition will suffice.

gunzip -c reads-int.fq | paste - - - - - - - - \
    | tee >(cut -f 1-4 | tr "\t" "\n" > reads-1.fq) \
    | cut -f 5-8 | tr "\t" "\n" > reads-2.fq

Regarding 4-line fastq, how often is this an issue? I work with NGS data daily and have yet to encounter a Fastq file that does not have 4-line records.

ADD REPLY
2
Entering edit mode

The point was that you hard coded that into the one-liner, and now you'll have to change it back for another data set. Every time you do that is time you'll never get back, and time you'll have to subtract from your perceived performance gains. I completely disagree with this approach and I think it is foolish to say you are saving time doing analysis this way (still, no validation is done).

In a broader context, I think this discussion highlights why people so often say that bioinformatics is "doing it all wrong" when it comes to manipulating data. Instead of packaging software, using version control/continuous integration, writing tests and documentation and making code that is reusable, we often see analysis pipelines put together with one-liners. This approach is not reusable, and it isn't documented or testable. For one-off tasks, this is the right approach. For common routines, this approach will only lead to wasted time/money, and I say that from experience.

ADD REPLY
0
Entering edit mode

Shell scripting != hard-coding. I wouldn't imagine that most people with scripts like this either a) hard code them or b) write them out every time. Surely they mostly resemble something like mine, as an example:

#! /bin/zsh

if (( $# == 2 )) ; then

    paste - - - - - - - - \
    | tee \
        >(cut -f 1-4 \
        | tr "\t" "\n" \
        > $1) \
    | cut -f 5-8 \
    | tr "\t" "\n" \
    > $2

else

    echo -e "\
${0##*/}
GY140920

Deinterleaves a FASTQ file of paired reads

Usage:

    cat <in.fastq> | deinterleave.sh <out1.fq> <out2.fq>

    zcat <in.fq.gz> | deinterleave.sh <out1.fq> <out2.fq>

    samtools bam2fq -ns singles.fq <in.bam> \\
    | deinterleave.sh <out1.fq> <out2.fq>"

fi

As I said above, I agree that there's a time and place for different methods, but I don't believe in time-wasting either, and I'm not about the re-invent the wheel every time I need to do something that I know can be easily achieved without even needing to leave the terminal. Would you also, for example, not use a shell function to simplify unarchiving (e.g. https://github.com/xvoland/Extract/blob/master/extract.sh ) - it's not documented and doesn't include a test-suite, after all?

ADD REPLY
0
Entering edit mode

You make good points and I agree with your philosophy on using core tools. The answer I provided above is in the same spirit as these comments. That is, it doesn't require you to leave the terminal or ever update any package.

The code you provided illustrates the idea I was trying to express in my previous comment. You added a lot of code and still you have to modify the command if it is compressed (cat vs zcat) and it only works with fastq (that is hard coded). Yes, we could go back and forth modifying the script because we all write shell scripts. The underlying issue is that you have to stop and think about that every time (I don't trust myself to do that) or write a lot more code to handle those cases. That is what I was referring to as inefficient since there are already solutions.

To be clear, I don't think there is a right and wrong approach, these are just points for discussion based on my experience. You have to do what works best in your work environment or workflow and that may mean using different solutions. As an aside, I started work on a branch of pairfq written in C to be a faster but it is not as convenient to use so I'm debating what is more important: speed or ease of use? I think the best solution would optimize speed while not sacrificing convenience too much (and not require an amount of development time that would offset any performance increases). It is tough to find the right balance of those factors for every use case and I think that is partly why we see different solutions.

ADD REPLY
3
Entering edit mode
9.6 years ago

Reformat can do that, if you're sure the resulting fastq file is interleaved. You can run with the "vint" and "ain" flags to verify that it is interleaved properly. If it's not, you can fix it by running repair.sh (also in the BBMap package).

ADD COMMENT
0
Entering edit mode

thanks, I'll try it!

ADD REPLY
2
Entering edit mode
9.6 years ago
SES 8.6k

Here is lightweight solution with Pairfq:

curl -sL git.io/pairfq_lite | perl - splitpairs -i interl.fq -f 1.fq -r 2.fq

Where the input is the interleaved fastq and 1.fq and 2.fq are the forward and reverse reads, respectively (you can choose better names though!). This assumes the pairs are labeled properly (e.g., "/1" or " 1").

ADD COMMENT
2
Entering edit mode
4.6 years ago
alienzj ▴ 30
seqtk seq interleaved.fq.gz -1 | pigz -c > 1.fq.gz  
seqtk seq interleaved.fq.gz -2 | pigz -c > 2.fq.gz

# or

seqtk seq interleaved.fq.gz | \
tee >(seqtk seq -1 - | pigz -c > 1.fq.gz) | \
seqtk seq -2 - | pigz -c > 2.fq.gz
ADD COMMENT

Login before adding your answer.

Traffic: 2572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6