File Format - Fasta
0
0
Entering edit mode
8.7 years ago

Hi,

I have a file with contents:

>FC20423AAXX_7_1_82_883
GTTAGAGGTTCGAAG     
>FC20423AAXX_7_1_198_886
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT    
>FC20423AAXX_7_1_115_888
GGGGGTGTAGGGTGGGGTTGG           
>FC20423AAXX_7_1_99_894
GTTCGTATCCCACTTCTGACACCA

My desired output is:

>dme0_count=3
GTTAGAGGTTCGAAG
>dme1_count=8
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT
>dme2_count=3
GGGGGTGTAGGGTGGGGTTGG
>dme3_count=6
GTTCGTATCCCACTTCTGACACCA

I am looking for a one liner or script to:

  • replace the header as shown such that dmex_count=y (x=unique number, y=number of times the read appeared in the file).
  • Delete non-unique reads as well as reads with length <15 and >30 to be deleted.
  • Delete reads with count <=2

Before now, I've had to undertake one step at a time. I'd love to avoid that and use a one liner awk, sed or perl...

Thanks!

sequence • 1.9k views
ADD COMMENT
3
Entering edit mode

Honestly, there's nothing to be gained by having everything on one line. This isn't Twitter.

EDIT: More importantly, some of your questions can only be answered after reading the file through once first.

ADD REPLY
1
Entering edit mode

What tools/commands did you use for each of the individual steps? With a bit of tweaking, it should be straightforward to combine those steps into a script that does what you want.

People are more likely to help if you explain what you've tried and where you're stuck.

ADD REPLY
1
Entering edit mode

At the moment, here are the commands that i use:

 #Deletes odd number lines(header)    
perl -ne 'print if $. % 2 == 0' droso1.fa > file1.fa 

#Append count number to the end of each read            
cat file1.fa | uniq -c | awk '{print $2, $1}' > table.txt   

#insert new header and delete reads with count <=2       
perl -lane ' print ">dme_" . $n++ . "_count=$F[1]\n$F[0]" if ($F[1] > 2) ' < table.txt > final.fa

I am still looking for a simple way (awk, sed, ...) to delete reads of certain lengths. Thanks

ADD REPLY
2
Entering edit mode

This might bite you nastily in case your read is longer than the frame (usually 60 nucleotides) and gets wrapped arround. In that case it's no longer the odd line containing the fasta header.

Go for something like this, assuming you have Python and Biopython installed.

from Bio import SeqIO
import sys
import collections

Infasta = SeqIO.parse(sys.argv[1], "fasta")
countdict = collections.Counter()
for seq_record in Infasta:
    countdict[seq_record.seq] += 1

output = []
counter = 1
for seq_record in Infasta:
   if seq_record.seq = certainReadLength:
       seq_record.id = "dme" + counter + "_count=" countdict[seq_record.seq]
       counter += 1
      output.append(seq_record)
SeqIO.write(output, "ReformatedFile.fasta", "fasta")

Let me know if you don't get it working or need additional help.

ADD REPLY
1
Entering edit mode

This is great decosterwouter :) One thing that might speed things up a lot is using countdict = collections.defaultdict(int) instead of collections.Counter()

.Counter is this huge thing which does a lot more than just let you +=1 to something that doesnt exist, however, if thats all you want to do, defaultdict(int) is much much faster.

ADD REPLY
1
Entering edit mode

Thanks for the advice, you're completely right.

ADD REPLY
1
Entering edit mode

Would you only count exact matches of reads or also partial overlaps? What is the purpose of this name? This sounds like a quite easy thing to solve with a Python script.

I would start with reading the entire file record by record (BioPython SeqIO), count occurrences in a collections.Counter() dictionary with key = sequence and value += 1 for every occurrence. Then, loop over the dictionary keys: for seq in dictionary.keys() while looping, increment a integer for your unique identifier x and add the value from the dict as y.

Probably not a one liner. Need more help with this?

ADD REPLY
0
Entering edit mode

vsearch should be able to do what you want although it won't be one line of code. https://github.com/torognes/vsearch

You can download it from here https://github.com/torognes/vsearch/releases There is also a supplied manual there which is very good.

vsearch --derep_fulllength will merge strictly identical reads vsearch --sortbysize will remove reads with count <= 2

the --relabel [string] flag will relabel sequences with [string] followed with an increasing number (1,2,3 etc). If you use this in conjunction with --sizeout it will keep count information in the header.

The following is an example of a fasta header where I kept the old header information, but appended count data:

M00598:152:000000000-AN3KL:1:1102:13247:5391 1:N:0:164;barcode=AAAAAAAAAAAA;barcodelabel=10B_S164;;size=715;

ADD REPLY

Login before adding your answer.

Traffic: 2593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6