Question

How do I find out the adaptor sequences for SRA data?

1

Entering edit mode

5.5 years ago

MAPK ★ 2.1k

I would like to analyze some small RNA data from NCBI (eg. https://www.ncbi.nlm.nih.gov/sra/SRR5593145), but I am not sure where I can find the adaptor sequences for trimming. Can anyone please suggest.

adaptor sra • 4.9k views

ADD COMMENT • link updated 5.2 years ago by hermidalc ▴ 60 • written 5.5 years ago by MAPK ★ 2.1k

2

Entering edit mode

How Can I Tell What Is The Adapter Used In A Sequence Read Archive (Sra) Sample?
Identify adapter sequences for trimming from Illumina paired end fastq files

ADD REPLY • link 5.5 years ago by GenoMax 147k

0

Entering edit mode

Thank you! Not sure if I can use BBMAP if it's single ends though.

ADD REPLY • link 5.5 years ago by MAPK ★ 2.1k

1

Entering edit mode

TrueSeq small RNA kit sequences (based on the SRA link) should be in their sequence document.

ADD REPLY • link 5.5 years ago by GenoMax 147k

0

Entering edit mode

Hi - apologies if I missed the answer somewhere on biostars... so I take it that the —clip option in fasta-dump isn’t trimming the adapters? Or cannot be completely trusted?

ADD REPLY • link 5.2 years ago by hermidalc ▴ 60

0

Entering edit mode

Did never hear of that option, and never heard anyone would use it for adapter trimming. By default NCBI does not store information on the adapter sequence, so not only does the tool not know what to look for, nor would I put any trust in this option. If you do not know the sequence run fastqc to check for adapters and then remove with specialized software such as trimmomatic, cutadapt or bbduk.sh. Depends on library prep kit which adapter was used.

ADD REPLY • link 5.2 years ago by ATpoint 85k

0

Entering edit mode

Thank you ATpoint for the recommendations. Submitters to SRA do generally give information regarding library construction protocol like giving the RNA prep kit they used. The SRA toolkit is honestly quite confusing and I also wonder if ENA is removing these adapters. See below in list of options about --clip:

$ fastq-dump -h

Usage:
  fastq-dump [options] <path> [<path>...]
  fastq-dump [options] <accession>

INPUT
  -A|--accession <accession>       Replaces accession derived from <path> in 
                                   filename(s) and deflines (only for single 
                                   table dump) 
  --table <table-name>             Table name within cSRA object, default is 
                                   "SEQUENCE" 

PROCESSING

Read Splitting                     Sequence data may be used in raw form or
                                     split into individual reads
  --split-spot                     Split spots into individual reads 

Full Spot Filters                  Applied to the full spot independently
                                     of --split-spot
  -N|--minSpotId <rowid>           Minimum spot id 
  -X|--maxSpotId <rowid>           Maximum spot id 
  --spot-groups <[list]>           Filter by SPOT_GROUP (member): name[,...] 
  -W|--clip                        Remove adapter sequences from reads

Common Filters                     Applied to spots when --split-spot is not
                                     set, otherwise - to individual reads
  -M|--minReadLen <len>            Filter by sequence length >= <len> 
  -R|--read-filter <[filter]>      Split into files by READ_FILTER value 
                                   optionally filter by value: 
                                   pass|reject|criteria|redacted 
  -E|--qual-filter                 Filter used in early 1000 Genomes data: no 
                                   sequences starting or ending with >= 10N 
  --qual-filter-1                  Filter used in current 1000 Genomes data 

Filters based on alignments        Filters are active when alignment
                                     data are present
  --aligned                        Dump only aligned sequences 
  --unaligned                      Dump only unaligned sequences 
  --aligned-region <name[:from-to]>  Filter by position on genome. Name can 
                                   either be accession.version (ex: 
                                   NC_000001.10) or file specific name (ex: 
                                   "chr1" or "1"). "from" and "to" are 1-based 
                                   coordinates 
  --matepair-distance <from-to|unknown>  Filter by distance between matepairs. 
                                   Use "unknown" to find matepairs split 
                                   between the references. Use from-to to limit 
                                   matepair distance on the same reference 

Filters for individual reads       Applied only with --split-spot set
  --skip-technical                 Dump only biological reads 

OUTPUT
  -O|--outdir <path>               Output directory, default is working 
                                   directory '.' ) 
  -Z|--stdout                      Output to stdout, all split data become 
                                   joined into single stream 
  --gzip                           Compress output using gzip: deprecated, not 
                                   recommended 
  --bzip2                          Compress output using bzip2: deprecated, 
                                   not recommended 

... more options sections ...

ADD REPLY • link 5.2 years ago by hermidalc ▴ 60

1

Entering edit mode

ENA mirrows NCBI, they don't change data. You will always have to trim adapters yourself using any (but not exclusively) of the tools I suggested. It is true that the method text may contain infos on library prep but this is just text, there is nothing like a field to enter an adapter sequence. NCBI will always (at least I never saw anything else) raw sequencing data as they came from the sequencer (at least this should be what submitters upload) because everyone should be free to use whatever adapter-removal strategy (or general data manipulation pipeline) they want.

ADD REPLY • link 5.2 years ago by ATpoint 85k