Question

concatenate using cat

0

Entering edit mode

8.6 years ago

snp87 ▴ 80

Hello! I have just started working with some RNAseq data that was generated using Nextseq. Each sample generated 8 files (1_L001*_R1*_001.fastq.gz, L001_R2, L002_R1, L002_R2, L003_R1, L003_R2, L004_R1 and L004_R2). I am trying to use cat to concatenate the files but it keeps saying command not found. Can someone assist with how this could be done?

Thanks so much!

RNA-Seq sequencing • 3.3k views

ADD COMMENT • link updated 8.6 years ago by chen ★ 2.5k • written 8.6 years ago by snp87 ▴ 80

1

Entering edit mode

You should be concatenating R1 and R2 files separately (and in the same order) to avoid issues with mis-ordered pairs. Processing the pairs in existing pieces can allow you to do things in parallel. Data can then be merged at the BAM level. Something to consider.

ADD REPLY • link 8.6 years ago by GenoMax 148k

0

Entering edit mode

can you show us the command you're using?

ADD REPLY • link 8.6 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

cat 1_L001_R1_001.fastq.gz1_L002_R1_001.fastq.gz1_L003_R1_001.fastq.gz1_S3_L004_R1_001.fastq.gz > 1_R1_001.fastq.gz

ADD REPLY • link updated 8.6 years ago by GenoMax 148k • written 8.6 years ago by snp87 ▴ 80

0

Entering edit mode

Can you also post the error?

ADD REPLY • link 8.6 years ago by GenoMax 148k

0

Entering edit mode

show me the output of

which cat

and

echo "A" | cat

ADD REPLY • link 8.6 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

Thank you to everyone for the quick replies. It says command not found.

Pierre, the outputs are as follows: /bin/cat and A

ADD REPLY • link 8.6 years ago by snp87 ▴ 80

1

Entering edit mode

Based on this output you should not get a command not found error.

ADD REPLY • link 8.6 years ago by GenoMax 148k

1

Entering edit mode

If the command you posted above is correct (after I formatted it) the problem may be that you don't have spaces between the file names. So try this

cat 1_L001_R1_001.fastq.gz 1_L002_R1_001.fastq.gz 1_L003_R1_001.fastq.gz 1_S3_L004_R1_001.fastq.gz > 1_R1_001.fastq.gz

ADD REPLY • link 8.6 years ago by GenoMax 148k

0

Entering edit mode

Sorry, I thought there wasn't supposed to be spaces. Thanks so much!

ADD REPLY • link 8.6 years ago by snp87 ▴ 80

0

Entering edit mode

Some applications (e.g. HISAT2, TopHat etc) expect filenames for replicates (R1/R2) to be separated by commas but for a system command like cat you need to separate the input files with a space to signify that they are separate files being joined together (>) to create a new file.

ADD REPLY • link 8.6 years ago by GenoMax 148k

score 0 · Answer 1 · 2016-06-20

0

Entering edit mode

8.6 years ago

ivivek_ngs ★ 5.2k

cat input_dir/*R1*fastq.gz > path_to_output_dir/combined_R1.fastq.gz

cat input_dir/*R2*fastq.gz > path_to_output_dir/combined_R2.fastq.gz

However you do not need to do that if you want to align or run quantification of transcripts. Most tools can accept the chunk files. Even for QC it should be fine. The aligned file can be created on the fly with all the chunked fastq.gz files.

edit: Yes genomax is correct since they are paired end , so read-mates should be concatenates separately according to their pairs.

ADD COMMENT • link 8.6 years ago by ivivek_ngs ★ 5.2k

2

Entering edit mode

don't use zcat, just cat. zcat would uncompress the fastq.gz files.

ADD REPLY • link 8.6 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

ah yes true, it was in a hurry, I edited , but I am curious to know what the OP wants to do by creating one file, memory efficient way is what should be the approach.

ADD REPLY • link 8.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

I thought it might be better to concatenate first before aligning. Would you recommend to not do this?

ADD REPLY • link 8.6 years ago by snp87 ▴ 80

1

Entering edit mode

If you don't want to deal with separate files/processes then sure. Either way would be fine. If you are going to trim data make sure you use a paired-end aware trimming program and trim the files in pairs (R1/R2).

ADD REPLY • link 8.6 years ago by GenoMax 148k

0

Entering edit mode

No it is not required. Even for trimming you can pass the chunks and then process them to the aligners. Just give a proper pattern for your input parsing for the programs to recognize your R1 and R2 chunks separately for operations.

ADD REPLY • link 8.6 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Thanks so much for the suggestions.

ADD REPLY • link 8.6 years ago by snp87 ▴ 80

score 0 · Answer 2 · 2016-06-20

0

Entering edit mode

8.6 years ago

chen ★ 2.5k

~~You cannot cat two gzipped files, because it will break the gzip format.~~

~~gunzip them first and cat the unzipped files~~

ADD COMMENT • link 8.6 years ago by chen ★ 2.5k

5

Entering edit mode

This is incorrect. Concatenated gzipped files are, in fact, valid. There are a few specific programs which fail on concatenated gzipped files, but that is due to noncompliant gzip implementation, as far as I understand it. Mainstream gzip implementations handle it just fine.

But don't take my word for it - try it with gzip, and become a true believer!

ADD REPLY • link 8.6 years ago by Brian Bushnell 20k

1

Entering edit mode

You are correct, I just did a try to cat gz files, and it did work.

Thanks for your correction, man!

ADD REPLY • link 8.6 years ago by chen ★ 2.5k