Question

Creating a csv in a specific format from FASTQ file names

0

Entering edit mode

9 months ago

jamie3355 • 0

I have downloaded some RNAseq data from GEO in the form of FASTQ files which I plan to run through the nf-core pipeline. This is a small subset of data so that I can try it out before scaling up the number of samples.

I am trying to create an input csv file constructed from the file names of the FASTQ files I have downloaded using BASH in the UNIX environment on my Mac.

The structure I am aiming to create is:

sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto

Each row represents a fastq file (single-end) or a pair of fastq files (paired end). Rows with the same sample identifier are considered technical replicates and merged automatically. The strandedness refers to the library preparation and will be automatically inferred if set to auto.

I have a directory of 6 fastq files consisting of 3 paired end reads that looks as follows:

SRR6727624_GSM3004545_TALL_JS_1_polyA_RNA_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR6727624_GSM3004545_TALL_JS_1_polyA_RNA_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR6727625_GSM3004546_TALL_JS_2_polyA_RNA_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR6727625_GSM3004546_TALL_JS_2_polyA_RNA_Homo_sapiens_RNA-Seq_2.fastq.gz
SRR6727626_GSM3004547_TALL_JS_3_polyA_RNA_Homo_sapiens_RNA-Seq_1.fastq.gz
SRR6727626_GSM3004547_TALL_JS_3_polyA_RNA_Homo_sapiens_RNA-Seq_2.fastq.gz

I am hoping to create a bash script that can extract the data and input it into a csv so that I can modify the script to work on upscaled numbers of fastq samples. Apologies if this is an obvious question and thanks in advance for you assistance

nf-core bash • 1.1k views

ADD COMMENT • link updated 9 months ago by Mahmoud.Bassyouni • 0 • written 9 months ago by jamie3355 • 0

score 3 · Answer 1 · 2024-02-01

3

Entering edit mode

9 months ago

Pierre Lindenbaum 164k

something like:

find /path/to/dir -type f -name "*.fastq.gz" | LC_ALL=C sort | paste - - |\
awk -F '\t' 'BEGIN {printf("sample,fastq_1,fastq_2,strandedness\n");} {printf("CONTROL,%s,%s,auto\n",$1,$2);}'

ADD COMMENT • link 9 months ago by Pierre Lindenbaum 164k

score 3 · Answer 2 · 2024-02-01

3

Entering edit mode

9 months ago

Harshil ▴ 80

You can also try to use nf-core/fetchngs to download the data from GEO and it will automatically create a sample sheet compatible with the nf-core/rnaseq pipeline. Some docs here: https://nf-co.re/fetchngs/1.11.0/docs/usage#samplesheet-format

We are planning on getting another release out next week. In any case, please feel free to reach out to us on the nf-core Slack workspace in the #fetchngs or #rnaseq channels.

ADD COMMENT • link 9 months ago by Harshil ▴ 80

0

Entering edit mode

thank you this is super helpful!

ADD REPLY • link 9 months ago by jamie3355 • 0

score 0 · Answer 3 · 2024-02-06

Check out this script and make sure to make it executable through:

chmod +x script.sh

and then run it using the strandedness (unstranded|forward|reverse|auto) of your data files as input along with the output file name

./script.sh $strandedness $[output_file.csv]

#!/bin/bash

# Check if strandedness argument is provided
if [ "$#" -lt 1 ]; then
    echo "Usage: $0 <strandedness> [output_file]"
    exit 1
fi

# Arguments
STRANDEDNESS=$1
OUTPUT_FILE=${2:-"samples.csv"} # Default output filename if not provided

# Validate strandedness input
if ! [[ "$STRANDEDNESS" =~ ^(unstranded|forward|reverse|auto)$ ]]; then
    echo "Strandedness must be one of: unstranded, forward, reverse, auto"
    exit 1
fi

# Write the CSV header
echo "sample,fastq_1,fastq_2,strandedness" > "$OUTPUT_FILE"

# Process files in the current directory
for R1_FILE in ./*_1.fastq.gz; do
    # Determine the R2 filename by replacing _1 with _2 in the R1 filename
    R2_FILE="${R1_FILE/_1.fastq.gz/_2.fastq.gz}"

    # Check if the R2 file exists
    if [ -f "$R2_FILE" ]; then
        # Extract the sample name, excluding the final _1 or _2 and file extension
        SAMPLE_NAME=$(basename "$R1_FILE")
        SAMPLE_NAME=${SAMPLE_NAME%_*} # Remove the trailing _1
        SAMPLE_NAME=${SAMPLE_NAME%_*} # Remove the last part after the underscore

        # Append the information to the output file
        echo "$SAMPLE_NAME,$(realpath "$R1_FILE"),$(realpath "$R2_FILE"),$STRANDEDNESS" >> "$OUTPUT_FILE"
    else
        echo "Matching R2 file not found for $R1_FILE"
    fi
done

# Notify the user
echo "CSV file has been created: $OUTPUT_FILE"