Hi,
I have spent several hours trying to figure out the best approach to do this. It would have been quicker to manually do it, but I will need to do this in future.
I have 40 paired-end RNAseq samples that were read across 5 lanes. I therefore have 400 fastq.gz files that I would like to process in Kallisto. The file name structure is as follows:
string_laneID_sampleID_pairID.fastq.gz
'string' is the same for every file
I want to concatenate the 5 lane files for each of the 40 samples, rather than running Kallisto for 200 paired end samples (is this the correct approach?).
Can someone please advise on the best way to concatenate these files? I have some knowledge of python and could do a bash script if someone could explain what each part means. Thank you
Thanks for your help - the trouble is I have no idea how to specify the right files in a for loop. I understand the principle, its how you construct the code that is the issue. I guess I might have more luck with python. I suppose I could try an os.walk across the directory and a for loop check for sampleID and then somehow execute a shell script to concatenate all files with a given sampleID into a new file.
Let us do a very simple two step approach (@Pierre has a fancier one liner).
Step 1: Grab the unique sample ID's in a file
ID file should have
Step 2: Walk through the ID file one record at a time to create the command line you need for each
cat
command. This can be done in more complex ways but I am using a command line that should be easy to understand.should get you output below (remove the word
echo
when everything looks good to actually execute the commands, repeat forR2
files.).Thank you so much to both of you! That makes total sense.
Thank you so much for this explanation and scripts. used it and worked perfectly fine for me! Special shoutout for giving a heads up as "remove the word echo when everything looks good to actually execute the commands"
been using this method for a while, always been dubious of it since
cat
ing multiple .gz like this seems sketchy but it always seems to work..