Selecting defined length fasta sequence and excluding them from a dataset
3
2
Entering edit mode
6.5 years ago
anis.geb.cu ▴ 30

I have a dataset of 5640 protein sequences in fasta format. I want to select and make a new dataset with the proteins having more than 100 amino acids. How can I do it using cmd command or Rstudio? What should be the script to do this? I badly need the answer!! Please help me.....

sequence R alignment • 7.3k views
ADD COMMENT
0
Entering edit mode

What format is your data in already? Have you created a dataframe?

Please provide more information - ideally a minimal representative dataset (the first 10 lines of a dataframe with head() would do for instance).

ADD REPLY
0
Entering edit mode

I haven't created a separate dataframe. The data are in fasta format in a fasta file. For example, below are the first 10 sequences of this dataset........

>gi|229269494|gb|ACQ51130.1| transition state regulatory protein, AbrB (plasmid) [Bacillus anthracis str. A0248] MRNCRKNPIEIFVEDEKIILQKSKSYDACTITADISEKIIPLANRQIVLSSDGIELLIKEIQQHLVK

>gi|229269492|gb|ACQ51128.1| hypothetical protein BAA_A0045 (plasmid) [Bacillus anthracis str. A0248] MPSKKCGEDKLDDEGPVIVRYIKDSILDAEEYRYMEFKEVKGDHSIRSILSVVDQYVVAYLNERKKQFPG TIIWGITDKERKVVGTKLIHKERDELRRSIVERLDQIQPPIPPSSYKIDIKEVYDSSLKHIEDTYIVEVI VDTVTGNDLFATAKNEVYIKTDGGKKKLNHIQIQQEILLRNQHKRFNLK

>gi|229269491|gb|ACQ51127.1| pXO1-135 (plasmid) [Bacillus anthracis str. A0248] MKAERIFEGTSADGIWNKAIFKVKEGSYYYISENSTIEITDNEYLNVEVLEEVHQGEDHNLFAVKGDERD MLQQLEMDGFNFEDVDWGDLNLLDNEIMSSTDAIEEWGIESSTLRKRIKDFPKGSIRKIGTTYAVTRFGM RCVFGSNRK

>gi|229269490|gb|ACQ51126.1| UDP-glucose 6-dehydrogenase (plasmid) [Bacillus anthracis str. A0248] MNISIVGTGYVGLVTGVCLSEVGHNVTCIDIDEEKVKKMKLGYSPIFEPCLEELMKNNIIKGRLHFTTNY VDGADGAEIFYIAVGTPQKEDGSADLSFIKQAAINIARTIKNDVIIVVKSTVPVGTNIYIKNLILKNLNY DVKVDIISNPEFLREGSAVNDTFYGDRIVIGSENKTSANVMEEVYKPFGTPIFKTDIQSAEMIKYASNTF LATKISFINGIANLCEMVGADVEKVAQGMGQDKRIGSQFLNAGIGYGGSCFPKDTHALVKVSESLQHKFH LLESVIKLNKNQQTVLIEKIKKRFGSIVGKKIALLGLSFKPNTDDLREAPSIPIARKLVEEGAQVIAYDP VAIKNAREVLPKEVHYVYSTMEALTEADIALVLTEWDEVVDSLLLKASQLMKEPVIFDGRNCFELNEAKN YDVEYHSIGRPSVLREVKGEITA

>gi|229269489|gb|ACQ51125.1| transposase X (plasmid) [Bacillus anthracis str. A0248] MIIIDFLEERNPILKTRELLTDIQRTFFYKIPSQMDERELIRYYTLSDEELQIVNQQRGDHNRLGFAIQI SYPRFPGRPLLAKEKIPQFLVNFLAKQIGVASWEVQNYARTRDTTRREHVNKIRNLYNLRTFTLREYREL ARWLLPLAMKTENGYLLVEALIQEMRKRKIILPAVYAIEHLAWSVRERARKRIFKYLTKGLSSYQYEQLD TLLYTHEEKKSSLLSWLRQPPGVIALKNFHEILDRLEFIQSLDLPLDNGKEIHQNRLIQMAREGSRYSIQ HLSRFNERKRYATIMAFLIHIYAFLTDQGLDMFAKLMGRMFNRGENKHNKLFQKDGKSINEKVRLYVEIG KALIEAKELETDPFEAVQSIISWEKFIHSVHEAESLARPIEFDYLDLLDNHYGHFRKMAPRLLNRYVFKA SSTSQTVLQALELLKEANQSGKRKIPDNVPTDFIKSKWMKYVYQDDKINRHYYEFSVFSELLNTLRSGDM WVKGSKQYKDFEEYLLPPHTWQHMKQTQQIPLEFTEDVEKYLNKRFEQLDIELSKVNCLIANQDLQGVTI YGDKIQVASLKKMVPEDVEEITRLVYDLLPRIKLTDLLVEVDTWTQFSKHFVHLHTQKAPKEKSILFAAI LSDGINLGLSKMADACPNISYSQLAWIADWYIRDESYTKALGDIANFHHKQPFSSYWGEGTTSSSDGQFF RAGGTSSPLAQVNGKYGHDPGLSFYTHISDQYHPFYVQVISSSEEAPHIIDGLLYHETDLQIEEHYTDAA GFVDHIFGMCHMLGFRFSPRIKTIDNHKIYTLSVPTIYDHINFMVGGTIQVKKIRENWDEILRLVSSVRQ GTVTASLILKKLSSYPRQNSLCIALREIGRIERSLYMLKWIQDPEHRRKVQVELNKGESKNSLARAVFFN QLGEIRDRSYEDQLHRASGLQLIISAIVTWNSIYISRAIETLRNNGIHIPEEYIQHISPLGWEHIALTGD YIWNLNQELNFENLRPLRKNRIKQK

>gi|229269487|gb|ACQ51123.1| hypothetical protein BAA_A0023 (plasmid) [Bacillus anthracis str. A0248] MKRNKICTVLFSVALMFSVVFGAFSFPTGASARVMDENSGTYNAPPKSSDTIKFSVTPGYGHLKVFIKNR GQSNLYVSLKHVASGKVYFENKEIRVGDPALEWKSNKEGFPQGVKAGDFELSFSSGGKAAYLDWAYKSAD IIWP

>gi|229269485|gb|ACQ51121.1| conserved hypothetical protein (plasmid) [Bacillus anthracis str. A0248] MDENKRNMLLSFIISILFIFTSLLPFSNNEYVYVISKIGAAAGVINIVMMSVFLYFQSKKTGHLLN

>gi|229269484|gb|ACQ51120.1| spore germination protein GerXA (plasmid) [Bacillus anthracis str. A0248] MKRTVEVNESILRVWFEGCKDVKIMNRKWCADTTTTTILLVYCQHVIDHTKLKQAIAPEMCNDLLQSSFK DSNLLASNSQFSVTTLELENSNENVSRMLFEGKLLIIFQEYKRGYTIDIAKLPTRSIEQSNTEMTIRGSR DGFVEELSTNIGLIRKRLKTSSLSYDEFIIGERTQTKVGLLYLKDVASQETISQVQFKLKEINIDGVVSS AQIEEFITGDQFSLFPLIEYTGRPDYAVNCLLHGRFILLVDGSPTATIAPVSFPFFVNTAEDQNYFYLFG SFVRLLSLFGIAISIFLPGFWVALVTYHPDQIPYTLLATLSLSREGIPFPAPLEGMIMITLFELLRQAGL RIPAAFGQTLSVVGGLIIGQAAISSGFVSPSMVVMIAISVVSTFTLVNQSFTGTLSILRYGVFLMSSFLG IVGFICSILLIVIHVANLRSFGLPFLAPYSPPVFSSMLPSTFRIPFTRMKKRPKELHTYDNTRQRTNNDE NK

>gi|229269482|gb|ACQ51118.1| DNA topoisomerase 1 (plasmid) [Bacillus anthracis str. A0248] MGKTLFIAEKPKVANEIMKSPRFRHSQKYIGSKPYYGYYENDHYIVSWCRGHLLELKNPEEMDPKYKLFQ LEHLPLIFQPSYKVIQENAEQLQILVKLLQRPDVDHAVNICDADREGELIYREVYEYAGVNKKQSRVYKS SFEAAELEAALNRLESASKYDGLAYSAKARQYLDYLLGMNITRGCTTKLAQNKFLLSSGRVQMCLLHEIR QRELAIENYREQSYYHLQLITDLGLKPVMKTEDQVLNPSPLKSLGENLKDQYLTVEDFKEGTRKQNPKLL YNLTDLYKDAHAQLQINAETAKKHIQNLYEEGFITYPRSSSRHLPTEQVDRVKGVMQALAKSRYSLLVQS VDIDAIDIKHKTFDDDLVSSHFAIIPTTKQYQEEGRPEIEKQLYSLVVKRFVGNFMRPAVYLVRDVSLID AMGNTYQIKESVLREKGFLEVFQEEVKEESVETFKVPILQKGQELQIYDFELQESKTKKPALHTESSILT FMETAGRKIDDEHLKELMKGKRIGTVATEAAFIPVLHEKNFIDIEKGKIITTPIGRAFIEQFPVQQIKDP LYTAEMEGMIHRIEKNEMSYENFIAQTNAFVQKITQEIIRIPDTVSYNLIETWKKQIEVCQCPCGNGIIL KGEFTAAIRLKTDLSICFEFPTTDDRTVGKCPLCQSRVIIGKTNYLCEQYKRGCDFIVSGMILEKRITAS QIKKLLEKNMTDTVKGFVSKKTGKSFDAKLTYDSTQKRVTFIYEKKK

>gi|229269481|gb|ACQ51117.1| pXO1-43 (plasmid) [Bacillus anthracis str. A0248] MKLRQATHAVFTVGVACLFLYGERWKTIMNLQFASTTNGGVLSAREAFTNPELVLKHMTEEGATFSVSLQ DIQIPGKLNYYWYDVIIETATNKKIEVNIAVLSQEEYNQLCERDSIDLYLKQTEDAAKFFLEKSKGLFQP EFNNRFTFHHKAQLYL

I just want to use a command to select the protein sequences having 100 or less than hundred amino acid and exclude them from the dataset and get a new dataset containing remaining proteins having >100 amino acid.

ADD REPLY
0
Entering edit mode

For an R solution, see https://stackoverflow.com/questions/8640377/remove-all-rows-where-length-of-string-is-more-than-n for a hint.

Your data is in a horrible, horrible format at the moment however. You've no easy way of demarcating the headers from the sequences as they're separated by a single whitespace and both your headers and sequences also have spaces in.

ADD REPLY
5
Entering edit mode
6.5 years ago

For fitlering sequences with minimum 100 aa/nt length i.e > 100,

$ seqkit seq -m 100 test.fa

For filtering sequences with maximum of 1000 aa/nt length<=1000

 $ seqkit seq -M 1000 test.fa

For filtering sequences between maxiumum size 200 and minimum size of 100:

 $ seqkit seq -m 100 -M 200 test.fa

Download seqkit from here

ADD COMMENT
0
Entering edit mode

I couldn't open seqkit in cmd, it couldn't read $ sign, how can i run seqkit in cmd? please show sequentially.....

ADD REPLY
1
Entering edit mode

Don't include the $ character. That character is there just to remind you that this is a command-line task.

ADD REPLY
0
Entering edit mode

I have named my file as test.fa and placed it in C\Users\User directory and yet I find this......

C:\Users\User>seqkit seq -m 100 test.fa [ERRO] fastx: open test.fa: The system cannot find the file specified. C:\Users\User>

What can I do now?

ADD REPLY
0
Entering edit mode

Install and use Cygwin or VirtualBox to install Linux on your Windows host, and use Linux to do your filtering.

ADD REPLY
0
Entering edit mode

what is OS version you are on and where is seqkit located (folder)? If you are on windows 10, you can use linux along with windows 10. (https://docs.microsoft.com/en-us/windows/wsl/install-win10). It is a long procedure. However it makes life easier. Once you install linux, try installing conda in linux. Conda/bioconda has most of the bioinformatics software and easy to install them.

ADD REPLY
1
Entering edit mode

For windows 7 and 10:

  1. seqkit installations are described here: https://bioinf.shenwei.me/seqkit/download/.
  2. Windows versions are named .tar.gz. Use a tool like IZarc to extract seqkit.exe (If you are on 64 bit OS, download 64 bit executable)
  3. Move seqkit.exe to C:\WINDOWS\system32 folder (as mentioned in the installation manual)
  4. Now go to the folder wherever the fasta file is present. Let us say Desktop.
  5. Hold shift button and right click on the same time in open area (white space or on freespace on desktop). Not on any file or directory.
  6. A right click menu appears with option: " Open powershell window here" (for windows 10) and "Open Command Window Here" for windows 7.. Click on it.
  7. A powershell/Command window appears and type seqkit.exe version. It should show seqkit version some thing like seqkit v0.8.0..
  8. If it doesn't show, then installation didn't go well. Try to read the installation manual for windows.
  9. Once it shows seqkit version, type dir in the terminal. It should show all the files. To list .fa files, type dir *.fa. It should show only .fa only. This should list your fasta file here.
  10. Now type in seqkit.exe seq -m 100 test.fa -o output.fa (replace test.fa with your file namem output.fa with output file name you want).
  11. Type again "dir". It should list new file.

For ubuntu:

  1. seqkit installations are described here: https://bioinf.shenwei.me/seqkit/download/.
    1. After installing seqkit, type seqkit version in the terminal.
  2. This would print the version of seqkit present on your system. If there are any errors, seqkit is either not in user path or there is installation problem. Refer to point 1.
  3. Once you have seqkit installed, navigate to the directory where your fasta file is located in the terminal. For eg. it is located on desktop and you are on linux/*nix/OS X, type "cd ~/Desktop".
  4. In ther terminal, type ls *.fa or ls.fasta or ls.fna depending on the the extension of fasta file. Seqkit can handle gzipped fasta file as well.
  5. You should see fasta file of interest here. Otherwise, you are in wrong directory. Refer to point 3.
  6. Once you see file of interest in your folder and you want to get all the sequences that are more than 100 amino acid length, type seqkit seq -m 100 test.fa -o output.fa (replace test.fa with your file namem output.fa with output file name you want).
  7. Repeat step 5. You should see output file in your current folder.
ADD REPLY
0
Entering edit mode

You will also need to install it...

ADD REPLY
0
Entering edit mode

how to direct the output to a file instead of showing millions of reads on the screen?

ADD REPLY
1
Entering edit mode

Use output redirect to send data to a new file like below.

$  seqkit seq -m 100 test.fa > new_file.fa
ADD REPLY
1
Entering edit mode
6.5 years ago

Here's an awk-based approach, if your FASTA records are single-lined (header on one line, sequence on the next line, and so on):

$ awk '!/^>/ { next } { getline seq } length(seq) > 100 { print $0 "\n" seq }' input.fa > output.fa

If multi-lined (header on one line, sequence split across two or more lines), then the input.fa would need preprocessing to make it single-lined. There are other answers on Biostars that explain how to do this.

Via: http://itrylinux.com/use-awk-to-filter-fasta-file-by-minimum-sequence-length/

ADD COMMENT
0
Entering edit mode

Explanation in that page is incorrect . Code can be shortened further, some thing like this:

$ awk '/^>/ { getline se } length(se) >100 { print $0 "\n" se }' test.fa

Example fasta: (copy/pasted from the page in the link):

 $ cat test.fa 
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

Ouptut from the code in linked page:

 $ awk '!/^>/ { next } { getline seq } length(seq) <37 { print $0 "\n" seq }' test.fa
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga

output from the in this post:

$ awk '/^>/ { getline se } length(se) < 37 { print $0 "\n" se }' test.fa
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga

Length of the sequences:

$ awk '!/^>/ { print length($NF) }' test.fa
36
41
ADD REPLY
0
Entering edit mode

Is it incorrect?

$ awk '!/^>/ { next } { getline seq } length(seq) > 35 { print $0 "\n" seq "\n" length(seq) }' /tmp/in.fa
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
36
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
41

Seems like the filter works:

$ awk '!/^>/ { next } { getline seq } length(seq) > 36 { print $0 "\n" seq "\n" length(seq) }' /tmp/in.fa
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
41

Am I missing something?

ADD REPLY
0
Entering edit mode

Code is correct and uses double negation. Explanation in the page is incorrect.

!/^>/ { next } { getline seq }
  • !/^>/ - get lines that do not start with > (i.e sequences in fasta).
  • { next } - get next lines of lines that do not start with > (i.e sequence headers)
  • getline - get next lines next to the lines that do not start with > (i.e sequences in fasta- back to step1)

Instead it may have been :

/^>/ { getline seq }
  • /^>/ - get lines with > (i.e headers)
  • get line - get lines next to lines start with > (i.e sequences)

Explanation in the page for !/^>/ {next} is that: (copy/pasted)

If a line (i.e. record) begins with a “>”, go to the next line (record).
ADD REPLY
0
Entering edit mode
6.5 years ago
Hugo ▴ 380

You can use SEDA (http://www.sing-group.org/seda/), which has an option to filter sequences by length (http://www.sing-group.org/seda/manual/operations.html#filtering) as well as other types of filtering.

Regards.

ADD COMMENT
1
Entering edit mode

Thank you very much hlfzeus!! I'm trying this....

ADD REPLY

Login before adding your answer.

Traffic: 1967 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6