Entering edit mode
8.7 years ago
Gabe Anderson
▴
10
Hi,
I have a file with contents:
>FC20423AAXX_7_1_82_883
GTTAGAGGTTCGAAG
>FC20423AAXX_7_1_198_886
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT
>FC20423AAXX_7_1_115_888
GGGGGTGTAGGGTGGGGTTGG
>FC20423AAXX_7_1_99_894
GTTCGTATCCCACTTCTGACACCA
My desired output is:
>dme0_count=3
GTTAGAGGTTCGAAG
>dme1_count=8
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT
>dme2_count=3
GGGGGTGTAGGGTGGGGTTGG
>dme3_count=6
GTTCGTATCCCACTTCTGACACCA
I am looking for a one liner or script to:
- replace the header as shown such that dmex_count=y (x=unique number, y=number of times the read appeared in the file).
- Delete non-unique reads as well as reads with length <15 and >30 to be deleted.
- Delete reads with count <=2
Before now, I've had to undertake one step at a time. I'd love to avoid that and use a one liner awk, sed or perl...
Thanks!
Honestly, there's nothing to be gained by having everything on one line. This isn't Twitter.
EDIT: More importantly, some of your questions can only be answered after reading the file through once first.
What tools/commands did you use for each of the individual steps? With a bit of tweaking, it should be straightforward to combine those steps into a script that does what you want.
People are more likely to help if you explain what you've tried and where you're stuck.
At the moment, here are the commands that i use:
I am still looking for a simple way (awk, sed, ...) to delete reads of certain lengths. Thanks
This might bite you nastily in case your read is longer than the frame (usually 60 nucleotides) and gets wrapped arround. In that case it's no longer the odd line containing the fasta header.
Go for something like this, assuming you have Python and Biopython installed.
Let me know if you don't get it working or need additional help.
This is great decosterwouter :) One thing that might speed things up a lot is using countdict = collections.defaultdict(int) instead of collections.Counter()
.Counter is this huge thing which does a lot more than just let you +=1 to something that doesnt exist, however, if thats all you want to do, defaultdict(int) is much much faster.
Thanks for the advice, you're completely right.
Would you only count exact matches of reads or also partial overlaps? What is the purpose of this name? This sounds like a quite easy thing to solve with a Python script.
I would start with reading the entire file record by record (BioPython SeqIO), count occurrences in a collections.Counter() dictionary with key = sequence and value += 1 for every occurrence. Then, loop over the dictionary keys: for seq in dictionary.keys() while looping, increment a integer for your unique identifier x and add the value from the dict as y.
Probably not a one liner. Need more help with this?
vsearch should be able to do what you want although it won't be one line of code. https://github.com/torognes/vsearch
You can download it from here https://github.com/torognes/vsearch/releases There is also a supplied manual there which is very good.
vsearch --derep_fulllength will merge strictly identical reads vsearch --sortbysize will remove reads with count <= 2
the --relabel [string] flag will relabel sequences with [string] followed with an increasing number (1,2,3 etc). If you use this in conjunction with --sizeout it will keep count information in the header.
The following is an example of a fasta header where I kept the old header information, but appended count data: