I have downloaded some RNAseq data from GEO in the form of FASTQ files which I plan to run through the nf-core pipeline. This is a small subset of data so that I can try it out before scaling up the number of samples.
I am trying to create an input csv file constructed from the file names of the FASTQ files I have downloaded using BASH in the UNIX environment on my Mac.
The structure I am aiming to create is:
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto
Each row represents a fastq file (single-end) or a pair of fastq files (paired end). Rows with the same sample identifier are considered technical replicates and merged automatically. The strandedness refers to the library preparation and will be automatically inferred if set to auto.
I have a directory of 6 fastq files consisting of 3 paired end reads that looks as follows:
SRR6727624_GSM3004545_TALL_JS_1_polyA_RNA_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR6727624_GSM3004545_TALL_JS_1_polyA_RNA_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR6727625_GSM3004546_TALL_JS_2_polyA_RNA_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR6727625_GSM3004546_TALL_JS_2_polyA_RNA_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR6727626_GSM3004547_TALL_JS_3_polyA_RNA_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR6727626_GSM3004547_TALL_JS_3_polyA_RNA_Homo_sapiens_RNA-Seq_2.fastq.gz
I am hoping to create a bash script that can extract the data and input it into a csv so that I can modify the script to work on upscaled numbers of fastq samples. Apologies if this is an obvious question and thanks in advance for you assistance
thank you this is super helpful!