How to generate a combined read count txt file with header as file name
1
0
Entering edit mode
5.2 years ago
Bioinfonext ▴ 470

I do have multiple txt file for RNAseq read count, is it possible to generate a single txt file with the file name as column header;

txt file having read count like this: so first column is same in all files.

BGIOSGA000001   0
BGIOSGA000002   12
BGIOSGA000003   0
BGIOSGA000004   0
BGIOSGA000005   0
BGIOSGA000006   0
BGIOSGA000007   0
BGIOSGA000008   15

and txt file name are like this:

Root_T3_S_R7_S56_L001.COUNT.txt
Leaf_T2_F_R5_S8_L001.COUNT.txt

so I want out put like this:

                  Root_T3_S_R7_S56       Leaf_T2_F_R5_S8

BGIOSGA000001         0                           4
BGIOSGA000002        12                           0
BGIOSGA000003         0                           3
BGIOSGA000004         0                           2
BGIOSGA000005         0                           4

I will be thankful for your help.

Kind Regards, Bioinfonext

bash linux awk R • 5.8k views
ADD COMMENT
1
Entering edit mode

You could have used featureCounts which does this when you feed it multiple BAM's on command line. featureCounts options BAM1 BAM2 BAM3. Provide them in the same order you want to group them by so you you don't need to mess with columns afterwards.

ADD REPLY
0
Entering edit mode

Hi genomax,

I used HTSeq for read count and I am having like 60 read count txt files.

Thanks Bioinfonext

ADD REPLY
0
Entering edit mode

Consider redoing the counts with featureCounts. You would be done with creating the count matrix in less time than it is going to take you to deal with 60 separate files :-)

ADD REPLY
0
Entering edit mode
echo -e '\tfile1\tfile2' && join -t $'\t' -1 1 -2 1 <(sort -t  $'\t' -k1,1 file1.txt) <(sort -t  $'\t' -k1,1 file2.txt)
ADD REPLY
0
Entering edit mode

Hi Pierre,

I am having 60 read count txt file so should I keep adding all like you have shown with two files.

Thanks Bioinfonext

ADD REPLY
0
Entering edit mode

Works great

ADD REPLY
5
Entering edit mode
5.2 years ago

something I wrote a while back (aka, there is likely a better/more efficient approach ;) )

n=0
for i in *.txt
do
echo $n
name=`echo $i | sed 's/_L001*//g'` 
echo -e "ID\t$name" > ${i}_tmp
head -n-1 $i | cut -f 1,2 | sort -k1 >> ${i}_tmp
((n++))
done

paste *_tmp > tmpOK
rm -f *_tmp

c="-f1"
for j in $(seq $n)
 do
 d=`expr 2 \* $j`
 c=$c,$d
done
echo $c

cut $c tmpOK > final_file
ADD COMMENT
0
Entering edit mode

thanks Lieven, your script works perfectly.

Thanks Again bioinfonext

ADD REPLY
0
Entering edit mode

After spending 4 hrs trying to combine the files with no luck, this finally worked. Thank you lieven.sterck.

ADD REPLY
0
Entering edit mode

I am getting head: illegal line count -- -1 and output only col names as file names. but putting a positve head -n value gives only those n rows. what can i have a workaround to get all those rows ?

ADD REPLY
0
Entering edit mode

you could try tail (look up the syntax for it) ; tail -n+2 (from the top of my head)

alternatively you can also get there using sed (sed '1d' )

ADD REPLY
0
Entering edit mode

Thanks @lieven.sterck! but since my files has uneven rows it all messed up.

ADD REPLY
0
Entering edit mode

that should not happen as it only makes sense to make a matrix of counts for mappings against the same reference (can't think of any case where this could be otherwise)

ADD REPLY
0
Entering edit mode

Agree, that should not happen. But the data I was looking from Geo Omnibus has raw counts files from experiment and surprisingly one of the replciate from their conditions has lesser rows (gene_id). Since it would have been much easier to not to go for download/alignment, I was trying to assemble their raw counts in to a combined matrix for analysis.

ADD REPLY
0
Entering edit mode

I was trying to assemble their raw counts in to a combined matrix for analysis.

exactly what you best do indeed :)

one of the replciate from their conditions has lesser rows (gene_id)

if it's only one file are you then not better of to 'fix' that one (add a bogus gene_id line or such?)

ADD REPLY
0
Entering edit mode

Yes its only one file, and agree with your solution. actually was thinking of adding those missing 'gene_id's in gene_id column and placing 0 or just blank ? what you suggest should be resonable ?

ADD REPLY
0
Entering edit mode

I think you can do either one of them ... zero might work better at first sight though

however, I would personally not really trust that data :/ , is there any mention of why there are less lines in that file? perhaps the file is truncated (when uploading or downloading it)?

ADD REPLY
0
Entering edit mode

Agree, the file might have got messed up in uploading or something else happened better known to them. i could not find any reason for this file truncation or less rows issue in their writeup.

ADD REPLY

Login before adding your answer.

Traffic: 1154 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6