In terms of performance, I have yet to see any program written in Perl, Python, C++, or C outperform this combination of shell tools. Other tools may be more convenient--and often that is more than a good enough reason to use them--but if you need blistering speed then there is no substitute for native built-in commands.
To summarize, the solution (using paste and process substitution) looks something like this.
+1 For this one. I am a python devotee, but I have this method saved as a little shell script on my PATH and wouldn't consider doing it a different way.
You will have to do it another way unless your data is uncompressed 4-line fastq. That is the problem.
ADD REPLY
• link
updated 21 months ago by
Ram
44k
•
written 9.5 years ago by
SES
8.6k
1
Entering edit mode
...and the file is completely uncorrupted.
I prefer solutions that detect and report misformatted or corrupted data to 1-liners that will pass anything, with the garbage-in, garbage-out philosophy. I've received many bug reports for reformat not working on this file or that file, and upon investigation, almost always it was a truncated or otherwise corrupted file that would not have been detected otherwise (and, I have to say, the user not reading the error message, which is generally pretty informative).
One-liners with no validation have their place, but it is not in situations where uncaught errors can cause huge problems downstream for years later, in my opinion.
Well zcat gets you around the compression anyway, so that's not really an issue.
In regards BB's comment, I agree - there's a time and a place for careful and for quick. My point was really that considering that the basic python and perl solutions people generally use for this sort of thing don't have a checking system in place anyway, there's simply no need to bring these languages to the table in the first place.
Regarding 4-line fastq, how often is this an issue? I work with NGS data daily and have yet to encounter a Fastq file that does not have 4-line records.
The point was that you hard coded that into the one-liner, and now you'll have to change it back for another data set. Every time you do that is time you'll never get back, and time you'll have to subtract from your perceived performance gains. I completely disagree with this approach and I think it is foolish to say you are saving time doing analysis this way (still, no validation is done).
In a broader context, I think this discussion highlights why people so often say that bioinformatics is "doing it all wrong" when it comes to manipulating data. Instead of packaging software, using version control/continuous integration, writing tests and documentation and making code that is reusable, we often see analysis pipelines put together with one-liners. This approach is not reusable, and it isn't documented or testable. For one-off tasks, this is the right approach. For common routines, this approach will only lead to wasted time/money, and I say that from experience.
ADD REPLY
• link
updated 21 months ago by
Ram
44k
•
written 9.5 years ago by
SES
8.6k
0
Entering edit mode
Shell scripting != hard-coding. I wouldn't imagine that most people with scripts like this either a) hard code them or b) write them out every time. Surely they mostly resemble something like mine, as an example:
As I said above, I agree that there's a time and place for different methods, but I don't believe in time-wasting either, and I'm not about the re-invent the wheel every time I need to do something that I know can be easily achieved without even needing to leave the terminal. Would you also, for example, not use a shell function to simplify unarchiving (e.g. https://github.com/xvoland/Extract/blob/master/extract.sh ) - it's not documented and doesn't include a test-suite, after all?
You make good points and I agree with your philosophy on using core tools. The answer I provided above is in the same spirit as these comments. That is, it doesn't require you to leave the terminal or ever update any package.
The code you provided illustrates the idea I was trying to express in my previous comment. You added a lot of code and still you have to modify the command if it is compressed (cat vs zcat) and it only works with fastq (that is hard coded). Yes, we could go back and forth modifying the script because we all write shell scripts. The underlying issue is that you have to stop and think about that every time (I don't trust myself to do that) or write a lot more code to handle those cases. That is what I was referring to as inefficient since there are already solutions.
To be clear, I don't think there is a right and wrong approach, these are just points for discussion based on my experience. You have to do what works best in your work environment or workflow and that may mean using different solutions. As an aside, I started work on a branch of pairfq written in C to be a faster but it is not as convenient to use so I'm debating what is more important: speed or ease of use? I think the best solution would optimize speed while not sacrificing convenience too much (and not require an amount of development time that would offset any performance increases). It is tough to find the right balance of those factors for every use case and I think that is partly why we see different solutions.
ADD REPLY
• link
updated 21 months ago by
Ram
44k
•
written 9.5 years ago by
SES
8.6k
Reformat can do that, if you're sure the resulting fastq file is interleaved. You can run with the "vint" and "ain" flags to verify that it is interleaved properly. If it's not, you can fix it by running repair.sh (also in the BBMap package).
Where the input is the interleaved fastq and 1.fq and 2.fq are the forward and reverse reads, respectively (you can choose better names though!). This assumes the pairs are labeled properly (e.g., "/1" or " 1").
ADD COMMENT
• link
updated 21 months ago by
Ram
44k
•
written 9.5 years ago by
SES
8.6k
+1 For this one. I am a python devotee, but I have this method saved as a little shell script on my PATH and wouldn't consider doing it a different way.
You will have to do it another way unless your data is uncompressed 4-line fastq. That is the problem.
...and the file is completely uncorrupted.
I prefer solutions that detect and report misformatted or corrupted data to 1-liners that will pass anything, with the garbage-in, garbage-out philosophy. I've received many bug reports for reformat not working on this file or that file, and upon investigation, almost always it was a truncated or otherwise corrupted file that would not have been detected otherwise (and, I have to say, the user not reading the error message, which is generally pretty informative).
One-liners with no validation have their place, but it is not in situations where uncaught errors can cause huge problems downstream for years later, in my opinion.
Well zcat gets you around the compression anyway, so that's not really an issue.
In regards BB's comment, I agree - there's a time and a place for careful and for quick. My point was really that considering that the basic python and perl solutions people generally use for this sort of thing don't have a checking system in place anyway, there's simply no need to bring these languages to the table in the first place.
Comments are valid, but my original point stands: you can choose convenience (format checking and error handling) or performance.
I don't know, Reformat is pretty fast... I think you can have both.
Regarding compression, a very minor addition will suffice.
Regarding 4-line fastq, how often is this an issue? I work with NGS data daily and have yet to encounter a Fastq file that does not have 4-line records.
The point was that you hard coded that into the one-liner, and now you'll have to change it back for another data set. Every time you do that is time you'll never get back, and time you'll have to subtract from your perceived performance gains. I completely disagree with this approach and I think it is foolish to say you are saving time doing analysis this way (still, no validation is done).
In a broader context, I think this discussion highlights why people so often say that bioinformatics is "doing it all wrong" when it comes to manipulating data. Instead of packaging software, using version control/continuous integration, writing tests and documentation and making code that is reusable, we often see analysis pipelines put together with one-liners. This approach is not reusable, and it isn't documented or testable. For one-off tasks, this is the right approach. For common routines, this approach will only lead to wasted time/money, and I say that from experience.
Shell scripting != hard-coding. I wouldn't imagine that most people with scripts like this either a) hard code them or b) write them out every time. Surely they mostly resemble something like mine, as an example:
As I said above, I agree that there's a time and place for different methods, but I don't believe in time-wasting either, and I'm not about the re-invent the wheel every time I need to do something that I know can be easily achieved without even needing to leave the terminal. Would you also, for example, not use a shell function to simplify unarchiving (e.g. https://github.com/xvoland/Extract/blob/master/extract.sh ) - it's not documented and doesn't include a test-suite, after all?
You make good points and I agree with your philosophy on using core tools. The answer I provided above is in the same spirit as these comments. That is, it doesn't require you to leave the terminal or ever update any package.
The code you provided illustrates the idea I was trying to express in my previous comment. You added a lot of code and still you have to modify the command if it is compressed (cat vs zcat) and it only works with fastq (that is hard coded). Yes, we could go back and forth modifying the script because we all write shell scripts. The underlying issue is that you have to stop and think about that every time (I don't trust myself to do that) or write a lot more code to handle those cases. That is what I was referring to as inefficient since there are already solutions.
To be clear, I don't think there is a right and wrong approach, these are just points for discussion based on my experience. You have to do what works best in your work environment or workflow and that may mean using different solutions. As an aside, I started work on a branch of pairfq written in C to be a faster but it is not as convenient to use so I'm debating what is more important: speed or ease of use? I think the best solution would optimize speed while not sacrificing convenience too much (and not require an amount of development time that would offset any performance increases). It is tough to find the right balance of those factors for every use case and I think that is partly why we see different solutions.