Hello! I have just started working with some RNAseq data that was generated using Nextseq. Each sample generated 8 files (1_L001*_R1*_001.fastq.gz, L001_R2, L002_R1, L002_R2, L003_R1, L003_R2, L004_R1 and L004_R2). I am trying to use cat to concatenate the files but it keeps saying command not found. Can someone assist with how this could be done?
ADD COMMENT
• link
updated 8.4 years ago by
chen
★
2.5k
•
written 8.4 years ago by
snp87
▴
80
1
Entering edit mode
You should be concatenating R1 and R2 files separately (and in the same order) to avoid issues with mis-ordered pairs. Processing the pairs in existing pieces can allow you to do things in parallel. Data can then be merged at the BAM level. Something to consider.
Some applications (e.g. HISAT2, TopHat etc) expect filenames for replicates (R1/R2) to be separated by commas but for a system command like cat you need to separate the input files with a space to signify that they are separate files being joined together (>) to create a new file.
However you do not need to do that if you want to align or run quantification of transcripts. Most tools can accept the chunk files. Even for QC it should be fine. The aligned file can be created on the fly with all the chunked fastq.gz files.
edit: Yes genomax is correct since they are paired end , so read-mates should be concatenates separately according to their pairs.
ah yes true, it was in a hurry, I edited , but I am curious to know what the OP wants to do by creating one file, memory efficient way is what should be the approach.
If you don't want to deal with separate files/processes then sure. Either way would be fine. If you are going to trim data make sure you use a paired-end aware trimming program and trim the files in pairs (R1/R2).
No it is not required. Even for trimming you can pass the chunks and then process them to the aligners. Just give a proper pattern for your input parsing for the programs to recognize your R1 and R2 chunks separately for operations.
This is incorrect. Concatenated gzipped files are, in fact, valid. There are a few specific programs which fail on concatenated gzipped files, but that is due to noncompliant gzip implementation, as far as I understand it. Mainstream gzip implementations handle it just fine.
But don't take my word for it - try it with gzip, and become a true believer!
You should be concatenating R1 and R2 files separately (and in the same order) to avoid issues with mis-ordered pairs. Processing the pairs in existing pieces can allow you to do things in parallel. Data can then be merged at the BAM level. Something to consider.
can you show us the command you're using?
Can you also post the error?
show me the output of
and
Thank you to everyone for the quick replies. It says command not found.
Pierre, the outputs are as follows: /bin/cat and A
Based on this output you should not get a command not found error.
If the command you posted above is correct (after I formatted it) the problem may be that you don't have spaces between the file names. So try this
Sorry, I thought there wasn't supposed to be spaces. Thanks so much!
Some applications (e.g. HISAT2, TopHat etc) expect filenames for replicates (R1/R2) to be separated by commas but for a system command like
cat
you need to separate the input files with a space to signify that they are separate files being joined together (>
) to create a new file.