Question

concatenate 24 stranded specific RNA seq fastq libraries in linux ?

0

Entering edit mode

2.9 years ago

slin023 • 0

Hi, I am asking for suggestion for how to concatenate 24 stranded specific RNA seq libraries in linux ? I have tried few tricks on cat but nothing made sense. So far the only command worked was ls *_1.fq.gz | sort | xargs cat > CG1-1_1.fq.gz, HOWEVER, when I gunzip the concatenated .fastq.gz , it showed

gzip: 80OF_01.fq.gz: invalid compressed data--crc error gzip: 80OF_01.fq.gz: invalid compressed data--length error

which suggested the concatenated .fastq.gz corrupted. Since it's stranded libraries, also it has to follow order to concatenate the fasq files

e.g.

cat control_401_01.fastq.gz control_402_01.fastq.gz control_403_01.fastq.gz > control_01.fastq
cat control_401_02.fastq.gz control_402_02.fastq.gz control_403_02.fastq.gz > control_02.fastq 
..
..

some others like that.

Here are all the labels of libraries from one sample:

enter image description here

If you have any suggestions, pls let me know, thank you for your time!

RNA-seq • 2.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 2.9 years ago by slin023 • 0

0

Entering edit mode

why do you need to sort files before merging them? if sorting files is not needed, you can try this cat *_1.fq.gz > CG1-1_1.fq.gz

ADD REPLY • link 2.9 years ago by cpad0112 21k

0

Entering edit mode

ls or find-ing is actually a good idea. There were posts (cannot find it now) that plain cat * (...) can lead to unwanted behaviour appending the newly generated file to itself. Hence, it is better to first list the files and then use the | xargs cat syntax.

ADD REPLY • link 2.9 years ago by ATpoint 85k

0

Entering edit mode

Ah, here is the thread I was referring to: merge large amount of fastq files into a single one

ADD REPLY • link 2.9 years ago by ATpoint 85k

0

Entering edit mode

User needs to make sure that new file doesn't have same pattern used by cat. In this case it would be cat *_1.fq.gz > CG1-1_1.fastq.gz or CG1-1_R1.fq.gz or some name which doesn't have _1.fq.gz in output.

ADD REPLY • link 2.9 years ago by cpad0112 21k

score 1 · Answer 1 · 2021-12-30

1

Entering edit mode

2.9 years ago

FrozenRainbow ▴ 10

You can try:
cat $(ls *_1.fq.gz | sort) > control_01.fq.gz

ADD COMMENT • link 2.9 years ago by FrozenRainbow ▴ 10

0

Entering edit mode

I tried your command; unfortunately, when I gunzip the .fq.gz, it still shows gzip: 80OF_01.fq.gz: invalid compressed data--crc error gzip: 80OF_01.fq.gz: invalid compressed data--length error

ADD REPLY • link 2.9 years ago by slin023 • 0

0

Entering edit mode

can you try gunzip and re gzip file 80OF_01.fq.gz only and then cat all files? Take a back up of your file before you do this.

ADD REPLY • link 2.9 years ago by cpad0112 21k

0

Entering edit mode

so I did some test. I can gunzip .fastq.gz from another sample, but not "80OF" sample, and I can also map the non-corrupted .fastq files using STAR ; apparently at least one of them gets corrupted in "80OF" sample . FastQC report should reveals which one, correct? or any command could show which one is it?

ADD REPLY • link 2.9 years ago by slin023 • 0

1

Entering edit mode

try checking with these tools: https://github.com/hhg7/fastq_corrupt_check, https://github.com/statgen/fastQValidator, https://github.com/nunofonseca/fastq_utils

ADD REPLY • link 2.9 years ago by cpad0112 21k