Problem to run checkm.
2
0
Entering edit mode
5.4 years ago
pablo ▴ 310

Hello, I have several files which contain proteic fasta sequences. Each file correspond to a cluster of genes.

UniRef90_100.fasta
UniRef90_101.fasta
UniRef90_102.fasta
UniRef90_103.fasta
UniRef90_104.fasta
UniRef90_105.fasta
UniRef90_10.fasta
UniRef90_11.fasta
UniRef90_12.fasta
UniRef90_13.fasta

I want to determine the contamination of each cluster. For that, I want to run checkm . I used checkm lineage_wf bins checkm but it does not work. I get this error message : checkm: error: unrecognized arguments:followed by all my bin files.

My question is : do these files are bins? Each file is compound as the following structure :

>UniRef90_A0A1B2YXP8 - Cluster: Uncharacterized protein
MRILRNFLGLFLLTAFIFSCVDENESNADFVDTISEPTNISALVSISQDNTGLVTIIPTG
EGVVTFNVDYGDGSDISGSINPGNSTEHFYSEGTYEATIIGTALDGSTAQATVTVVVSFI
APENLVVDILTSSGSYNILVSASADYATSFEVLFGDEAGGDATPMQIGEQLSHSYELAGT
YNVTITALSGGAATTQYSEEITITDPPVFDGFSTFEDFEGEVPGNFSFGGVGNVQVVANP
DNSGINTSTSVMQCTKDQGAEVWGGMGFAVNGHINFNGNNVLRLKSYAPEVGKVVKVKLE
TSAGNVAGLTYEFDMVTTVANQWEILTYDFSGAPDLDYITAIVFYDFGNQNAGVYHFDDV
EVGIGEYIQGIENFEGDVPESFTFGGVGGVEVIPNPDPSGENITGNVLQFVKDEGAEVWG
GMGFAVDVIDFNGASQIHLKSYAPEAGKVVKVKLETSAGNVAGLTHEVDVTTTVANEWET
LIYDFTGAPDLEYVSFIVFYDFGNTVGATYRVDEIQLID
>UniRef90_A0A1B2YXU0 - Cluster: Uncharacterized protein
MKYKILFLSILILFSCNHDNEKLDAIIKEYQNHEGYNYEDYPLGNFSEEYFKAEKEFAES
LLLKLDDIDITKLDENDNISYELLSFVLNDIIAYYDFERFLNPLLSDSGFHSSLVYNVRP
MYNYEQVKNYLNKLNAIPQYVDQYLPLLRKGLEKGVSQPLVIFKGYESTYNDHITKDFES
NYFYSPFNKLPNDISEIQRDSIFVAAKNAIEKSVVPQFIRIKDFFEKEYYKKTRTTIGVS
QTPNGSEFYQNRINYYTTSESYTADEIHQIGLKEVARIKKEMIKIIDELKFKGSFEEFFK
FLRTDEQFYAKTPKELLMYARDISKRADEQLPRFFKTLPRKPYGVAPVPDAIAPKYTGGR
YVGTSKNSTDPGYYWVNTYDLKSRTLYTIPALTVHEAVPGHHLQSALNNELGDSIPRFRR
NLYLSAYGEGWGLYTEFLADEMGIYTTPYEKFGKFTYEMWRACRLVVDTGLHTKGWSKEK
AIDYMSKNTALSLHEVNTEIDRYISWPGQALSYKIGELKIRELRNKAKDQLNDKFDIREF
HEKILEYGTVTLPTLERRINNYIEKKNE
checkm contamination fasta • 4.1k views
ADD COMMENT
0
Entering edit mode

It would help to provide a link to the package this program belongs to. Have you checked the in-line help to see if that offers any assistance on what the minimal usage needs to look like?

ADD REPLY
0
Entering edit mode

I checked the documentation about this package and I think I did good ... My files are in the good format and the command line I used looks good.

ADD REPLY
2
Entering edit mode
5.4 years ago
Asaf 10k

If the input is protein you should use -g

Wait, I suspect you wrote bins/* is that right? Can you add the full command line?

ADD COMMENT
0
Entering edit mode

The -g option worked. Thanks.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id                  Marker lineage           # genomes   # markers   # marker sets    0     1    2    3    4    5+   Completeness   Contamination   Strain heterogeneity  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  UniRef90_1           k__Bacteria (UID203)           5449        104            58         4     17   19   13   7    44      95.34           283.45              6.85          
  UniRef90_19          k__Bacteria (UID203)           5449        104            58         7     20   20   12   12   33      91.22           255.84             11.38          
  UniRef90_14         k__Bacteria (UID2570)           433         267           178         61   128   56   13   8    1       79.42           42.78              17.65          
  UniRef90_12          k__Bacteria (UID203)           5449         99            53         24    24   16   7    4    24      69.27           141.02             21.72          
  UniRef90_24          k__Bacteria (UID203)           5449        104            58         23    20   29   10   8    14      67.63           103.59             21.25          
  UniRef90_22          k__Bacteria (UID203)           5449         99            53         35    24   9    6    2    23      62.49

I got this kind of output, it looks bad isn't it?

ADD REPLY
0
Entering edit mode

Pretty bad, yeah. Each bin is a few bacteria.

ADD REPLY
0
Entering edit mode

I only shew you the head of the output. I got some other lineages :

UniRef90_65      p__Proteobacteria (UID3880)        1495        261           164        188    60   13   0    0    0       27.85            5.12              38.46          
  UniRef90_71        p__Euryarchaeota (UID3)          148         188           125        132    46   10   0    0    0       26.89            4.25              20.00          
  UniRef90_73     f__Rhodobacteraceae (UID3356)        67         615           329        451   164   0    0    0    0       26.65            0.00               0.00          
  UniRef90_7         p__Euryarchaeota (UID3)          148         188           125        133    37   15   2    1    0       24.44            9.74              14.81          
  UniRef90_16          k__Bacteria (UID203)           5449        104            58         69    12   6    5    0    12      24.39           29.06              35.29          
  UniRef90_67      p__Proteobacteria (UID3880)        1495        261           164        195    61   5    0    0    0       24.39
ADD REPLY
0
Entering edit mode

Quarters of genomes. Usually people use completeness > 70-80% and contamination < 20%. You might get good bins with 0% completeness so watch for those too.

ADD REPLY
0
Entering edit mode

You meant "0% contamination" no?

ADD REPLY
0
Entering edit mode

No, 0% completeness, checkm can't find the proteins it's looking for but other than that the assembly looks good in term of size and N50.

ADD REPLY
0
Entering edit mode

Ok I got it. And is there is a file where the output is stored? I can't find it.

ADD REPLY
0
Entering edit mode

hello, Do you know how can interpret # genomes and # markers and # marker sets columns ?

ADD REPLY
0
Entering edit mode

I think you should open a new question if you're still struggling, with some more background

ADD REPLY
0
Entering edit mode

hi! can you post the command that you used to run on bash terminal? :)

ADD REPLY
1
Entering edit mode
5.4 years ago
vin.darb ▴ 300

According ti the READme:

By default, CheckM assumes genomes consist of contigs/scaffolds in nucleotide space and that the files to process end with the extension fna.

Example Usage

Assume you have putative genomes in the directory /home/donovan/bins with fa as the file extension and want to store the CheckM results in /home/donovan/checkm. To processes these genomes with 8 threads, simply run:

checkm lineage_wf -t 8 -x fa /home/donovan/bins /home/Donovan/checkm

Did you try to specify ' -x fasta ' ?

ADD COMMENT
0
Entering edit mode

Yes I tried but I always get the same results.. I don't understand why

ADD REPLY
0
Entering edit mode

I only need to add the -g option to work on proteic sequences.

ADD REPLY

Login before adding your answer.

Traffic: 1660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6