Question

Match a column to another file (grep/awk)

0

Entering edit mode

2.6 years ago

Nathan ▴ 10

Hello, everyone. I would like a help in a simple issue that I am not being abble to solve.

I want to get a nucleotide sequence from the second column of a text file and match with a fasta file to know the headers which correspond to these sequences. I would also like to modify the header and acording to the first column of the text file and generate a new fasta file, as demonstrated below.

Text file:

1        AACTGA
1        AACTGC
2        CCAGAT
3        GGATCA
3        GGATCC

Original fasta file:

>Sample 1
AACTGA
>Sample 2
CCAGAT
>Sample 3
AACTGA
>Sample 4
CCAGAT
>Sample 5
GGATCA
>Sample 6
GGATCC
>Sample 7
GGATCA
>Sample 8
GGATCC
>Sample 9
AACTGC
>Sample 10
AACTGC

Expected output:

>1|Sample 1
AACTGA
>1|Sample 3
AACTGA
>1|Sample 9
AACTGC
>1|Sample 10
AACTGC
>2|Sample 4
CCAGAT
>2|Sample 2
CCAGAT
>3|Sample 5
GGATCA
>3|Sample 7
GGATCA
>3|Sample 6
GGATCC
>3|Sample 8
GGATCC

I am still a beginner in bioinformatics and simple things are still a challenge for me. Thank you for the help!

grep awk • 619 views

ADD COMMENT • link updated 2.6 years ago by Matthias Zepper 5.0k • written 2.6 years ago by Nathan ▴ 10

score 0 · Answer 1 · 2022-04-26

I suppose, this is a class exercise?

In that case, the first step is always to decide on a particular strategy and how to approach this task (derive the algorithm). All of this happens before you actually write the first line of code. Break down your task into single steps - this is what you need to practice, not writing code.

If you have done that and also show this effort in your question, the people on here will be happy to help you with whatever issue you might encounter while implementing it.

Assuming that your files are named textfile.txt and fasta.fa, this will work...but why you will still need to figure out!

paste - - < fasta.fa > temp
awk -F "\t" 'FNR==NR{a[$2]=$1;next}{print ">"a[$2]"|"substr($1,2)"\n"$2}' textfile.txt temp > output.fa
rm temp

Can you tell me the algorithm that I used?