Question

awk column number is a variable

1

Entering edit mode

4.4 years ago

DBScan ▴ 450

I am trying to filter GTEx for liver tissue. I was able to extract the sample names from the "SamplesAtrributesDS.txt" file and find the corresponding column number in the "transcript_expected_count.gct" file for each sample. I then want to use awk to extract the counts from the column number, but I always get this error:

awk: fatal: cannot open file `{print $myvar}' for reading (No such file or directory)

I use the following code:

#!/bin/bash
Genes=$(awk -F "\t" '{if ($6=="Liver") print $1}' GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt)
for gene in $Genes
do
        colnr=$(head -n 3 GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_expected_count.gct | tail -n 1 | cut -f3- | tr '\t' '\n' | cat -n | grep $gene | cut -f 1)
        echo $colnr
        if [ -z "$colnr" ]
        then
                echo "\$colnr is empty"
        else
                echo $colnr
                awk -v myvar=$colnr '{print $myvar}' GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_expected_count.gct
        fi
done

I would be glad if someone can point out what I am doing wrong.

RNA-Seq awk bash • 3.0k views

ADD COMMENT • link updated 4.4 years ago by user_without_id ▴ 150 • written 4.4 years ago by DBScan ▴ 450

0

Entering edit mode

what is the output of

echo $colnr

before the error ? Furthermore, can you please use

echo "($colnr)"

to show if there is any space.

ADD REPLY • link 4.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

The output of echo "($colnr)" is

(   281)

Why is it that way?

ADD REPLY • link updated 4.4 years ago by Pierre Lindenbaum 164k • written 4.4 years ago by DBScan ▴ 450

0

Entering edit mode

because cat -n | insert some spaces before the line number.

So the command :

awk -v myvar=$colnr '{print $myvar}' file

becomes:

   awk -v myvar=  281  '{print $myvar}' file

where myvar is empty , 281 is the awk script and '{print $myvar}' and file are the filenames.

ADD REPLY • link 4.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Remove $, try:

awk -v myvar=$colnr '{print myvar}'

ADD REPLY • link 4.4 years ago by zx8754 12k

2

Entering edit mode

no I think OP wants to print the myvar-th column

ADD REPLY • link 4.4 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

4.4 years ago

Pierre Lindenbaum 164k

how about this ? (not verified)

$ wget -O - -q "https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_expected_count.gct.gz" | gunzip -c | tail -n+3 | awk '(NR==1) {split($0,header);next;} {for(i=3;i<=NR;i++) printf("%s\t%s\t%s\t%s\n",$1,$2,header[i],$i);}' | grep -F -f <(wget -O - -q "https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt" | awk -F '\t' '($6=="Liver") {print $1;}' )
ENST00000481305.1   ENSG00000002933.7   GTEX-11DXY-0526-SM-5EGGQ    86.44
ENST00000484928.5   ENSG00000002933.7   GTEX-11DXY-0526-SM-5EGGQ    2781.95
ENST00000494349.5   ENSG00000002933.7   GTEX-11DXY-0526-SM-5EGGQ    1195.63
ENST00000000412.7   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    379.42
ENST00000536844.5   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000537621.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    22.62
ENST00000539143.5   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    94.39
ENST00000540837.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    2.81
ENST00000541507.5   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    457.08
ENST00000543159.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    23.87
ENST00000543258.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    2.26
ENST00000543704.5   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000543834.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000543845.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000544193.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000544245.1   ENSG00000003056.7   GTEX-11DXY-0526-SM-5EGGQ    0
ENST00000262820.7   ENSG00000003096.14  GTEX-11DXY-0526-SM-5EGGQ    8.07
ENST00000262820.7   ENSG00000003096.14  GTEX-11DXZ-0126-SM-5EGGY    4.05
ENST00000371876.5   ENSG00000003096.14  GTEX-11DXY-0526-SM-5EGGQ    2.24
ENST00000371876.5   ENSG00000003096.14  GTEX-11DXZ-0126-SM-5EGGY    0
(...)

ADD COMMENT • link 4.4 years ago by Pierre Lindenbaum 164k

score 2 · Accepted Answer · 2020-07-22

I would do something like this, in python, to filter for the columns of interest.

#!/usr/bin/env python3

path_ec = "~/Downloads/GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_expected_count.gct.gz"
path_sa = "~/Downloads/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt"

import pandas as pd
from itertools import count

SKIP_LINES_GTEx = 2
SEP = '\t'

df = pd.read_csv(path_sa, sep=SEP)
ids = df.SAMPID[df.SMTS == "Liver"]

all_cols = pd.read_csv(path_ec, sep=SEP, header=SKIP_LINES_GTEx, nrows=0).columns

# Not all SAMPIDs are in the transcript_expected_count file
ids = ids[ids.isin(all_cols)]

# Columns to keep
cols = ['transcript_id', 'gene_id'] + list(ids)

from tempfile import NamedTemporaryFile as TempFile
with TempFile(mode='w') as out_fd:

# OR:
# with open(output_filename, mode='w') as out_fd:

    print(F"Writing to temporary file {out_fd.name}")

    # Lines per chunk
    CHUNKSIZE = 1000

    # Read and write chunkwise
    for (df, n) in zip(pd.read_csv(path_ec, sep=SEP, header=SKIP_LINES_GTEx, chunksize=CHUNKSIZE, usecols=cols), count()):
        df.to_csv(out_fd, header=(n == 0), sep=SEP, mode='a', index=None)

    input("Done. Press ENTER")