Extracting accession number from header using sed
2
0
Entering edit mode
7.3 years ago
ToastedGoat ▴ 10

Hello! I'm trying to figure out how to extract the accession numbers from the headers. (about 120 headers) I have to use sed and can't seem to figure it out. Here is a sample of what my file looks like:

>Ref.49_cpx.GM.03.N26677.HQ385479 

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC

I need the part after the last period in the header. So the "HQ385479" part. Thanks in advance for the help!

sed accession number • 3.9k views
ADD COMMENT
0
Entering edit mode
cut -d '.' -f6 input.txt
ADD REPLY
0
Entering edit mode

I need the part after the last period in the header.

Do you want to keep the rest of the alignments intact? I assume so but please clarify.

Edit: Looks like you want to keep just the accessions based on a response below.

You could do (if all accession lines start with Ref) grep "^>Ref" input.txt | sed 's/^.*\.//g' > accession

ADD REPLY
0
Entering edit mode

Ah yes completely forgot I could use grep first. Thanks.

ADD REPLY
1
Entering edit mode

Not necessary to use grep. input (copy/pasted the first sequence and changed the id at the end, as second sequence):

>Ref.49_cpx.GM.03.N26677.HQ385479 
ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC
>Ref.49_cpx.GM.03.N26677.HQ385478
ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC

output:

$ sed -e  '/>/!d; s/.*\.//g' test.fa 
HQ385479 
HQ385478
ADD REPLY
0
Entering edit mode

Thank you so much for this!

ADD REPLY
1
Entering edit mode
7.3 years ago
Joe 21k

Since the OP has specifically requested sed

$  sed -i 's/^.*\.//g' input.txt

Gives:

HQ385479

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC
ADD COMMENT
0
Entering edit mode

Thanks! Is there a way within the sed command to remove all the nucleotide sequences as well so I'm just left with all the accession numbers? This is my first time doing any bioinformatics and I am still learning the whole programming/coding side of it all.

ADD REPLY
0
Entering edit mode

You can use this:

awk -F. 'NF>1{print $NF}' input.txt > output.txt
ADD REPLY
1
Entering edit mode

For future reference: hightlight the text you want to format as code and then click on the "101" button in the edit window to apply the formatting.

ADD REPLY
0
Entering edit mode

Is your file a fasta formatted file? (Header lines beginning with >)? Or is it exactly as you posted above?

ADD REPLY
0
Entering edit mode

It begins with > Didn't copy in correctly

ADD REPLY
1
Entering edit mode

I would just chain it to grep personally, but now the solution is getting a bit less elegant.

cat input.txt | grep ">" | sed 's/^.*\.//g'
ADD REPLY
0
Entering edit mode

very minor change to code. Please add > as replacement. So that sequence is still in fasta format.

$ sed -e 's/^.*\./>/g' test1.fa 
>HQ385479 

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC
ADD REPLY
0
Entering edit mode

The OP said he doesn't want the sequence, just the accession itself (I assuming they're making a list for a table or similar), so there's no need to sub in the ">".

ADD REPLY
0
Entering edit mode

okay. didn't read OP in full :)

ADD REPLY
0
Entering edit mode
7.3 years ago
bk11 ★ 3.0k
 awk -F. 'NF>1{print $NF}' input.txt
ADD COMMENT

Login before adding your answer.

Traffic: 1941 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6