SRA文件格式转换

楼主  收藏   举报   帖子创建时间:  2018-05-23 00:00 回复:4 关注量:140

最近NCBI的数据格式由于空间缘故都转换成了*.sra格式,不再支持*.fastq.gz,因此需要一个特别的转化工具来转换下载的*.sra数据文件。

下载地址:

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software

这里面包含了不同系统平台下的程序以及源代码。

转换命令 

$ fastq-dump -A <SRR_accession> -D <Path_to_SRR_Directory> -O <Output_Path>

基本的命令参数

 CommandDescription
‘-A’ or ‘--accession’Enables modification of the output name used for the fastq files. For example:
fastq-dump -A foo SRR000001
Will produce files named ‘foo.fastq’, ‘foo_1.fastq’, and ‘foo_2.fastq’
‘-D’ or ‘--table-path’Makes the archive path more explicitly specified, thus preventing confusion when more than option is specified. These two commands produce the same files:
fastq-dump ~/SRR000001
fastq-dump -D ~/SRR000001
However, the first command below will fail while the second will succeed:
fastq-dump -C ~/SRR000001
fastq-dump -C -D ~/SRR000001
(‘-C’ option is explained further below)
‘-N’ or ‘--minSpotId’Minimum spot number at which to start the dump process
‘-X’ or ‘--maxSpotId’Maximum spot number at which to stop the dump process
For example:
fastq-dump -N 5 -X 10 SRR000001
This command will dump six spots starting from spot ‘SRR000001.5’ and ending in spot ‘SRR000001.10’. Filtered spots can result in less than (maxSpotId - minSpotId + 1) total spots output.
‘-G’ or ‘--spot-group’Boolean option that results in fastq files divided into spot groups as defined in the Experiment (or eventually Run) xml. This command:
fastq-dump -G SRR051894
Produces these five fragment files:
SRR051894.fastq
SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CB028-01WG.fastq
SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CB036-01WG.fastq
SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CD021-01WG.fastq
SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CD036-01WG.fastq
‘-T’ or ‘--group-in-dirs’Boolean option directing the utility to produce fastq files in sub-directories rather than producing files within the same directory
‘-O’ or ‘--outdir’Indicates the directory where the fastq result should be placed
For example:
fastq-dump -O /tmp -T SRR000001
will create a directory, SRR000001, in /tmp with this tree structure:
>tree /tmp/SRR000001
/tmp/SRR000001
|-- 1
| -- fastq
|-- 2
| -- fastq
`-- fastq
‘-K’ or ‘--keep-empty-files’Has no effect - at one time this option would represent all three possible files even if one or two were empty
‘-M’ or ‘--minReadLen’Allows specification of the desired minimum read length to output (default is 25). The command ‘fastq-dump -M 0 SRR000001’ prevents any filtering based on read length.
‘-W’ or ‘--noclip’Prevents clipping of a spot sequence based on the right clip information. Toggling ‘show-clipped’ in the ‘customize’ area for reads in the SRA Run Brower enables observing the effect of this option (e.g. see SRR000001).
‘-F’ or ‘--origfmt’Results in fastq containing only the original identifier on the defline (i.e. no length or SRR identifier are present)
‘-C’ or ‘--dumpcs’Forces color space sequence to be dumped instead of base space. If the optional ‘cskey’ if provided (i.e. A, C, T, or G), then all fastq files produced will use that key at the start of each color space sequence.
‘-B’ or ’--dumpbase’Forces base space sequence to be dumped instead of color space.
‘-Q’ or ‘--offset’Allows using a different offset value to represent a different offset character in the fastq output. For example, using an offset of 64 represents using ‘@’ as the offset character.
‘-I’ or ‘--readids’Appends a read index to the run identifier starting with ‘1’ as the first index. Note that this differs from the spot descriptor in the Experiment xml where the read indices start with ‘0’. In the case of SRR000001, the first spot in each file would have the identifiers ‘SRR000001.5.4’, ‘SRR000001.1.2’, and ‘SRR000001.1.4’. Note that the first spot sequence in SRR000001.fastq, the fragment file, comes from the second biological/application read which has an index of ‘4’.
‘-E’ or ‘--no_qual_filter’This option turns off quality filtering based on leading/trailing low quality values. As reads have become longer this option has become a more viable alternative.
‘-SF’ or ‘--complete’Outputs the separated reads into a single file. For example, the command:
fastq-dump -SF SRR029338
Results in the first eight lines of the file, SRR029338.fastq, containing:
@SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36
GGTTGAGTAAAGTGTCTAAAGGCATAGCCTGATTAT
+SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36
IIIIIIIIIIIIIIIIIIIAIIA<I8I+7I9+II2I
@SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36
AAAGTCAAATTTGAATTGTTGTCAGCTTGTCAAAAT
+SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36
IIIIIIIIDIIIIIIIIIIIII.1F2II=8*2+//I
In the case of 454 pair submissions, the second technical read (i.e. linker) is included in this single output file.
‘-DB’ or ‘--defline-seq’Allows specification of the sequence defline format. For example:
-DB "@$ac.$si $sn length=$rl"
This specification produces the same output as the default output. See
Appendix D for a more in-depth explanation. Note that submission of a
‘fastq-dump’ command to a compute farm (e.g. Sun Grid Engine) can
require preceding a number of the characters with backslash characters
when using this option. The above example might require this version:
-DB "@\\\$ac.\\\$si \\\$sn length=\\\$rl"
‘-DQ’ or ‘--defline-qual’Allows specification of the quality defline format. For example:
-DQ "+$ac.$si $sn length=$rl"
‘-alt [n]’Provides alternative output formats without have to indicate the individual options. Alternate ‘1’, the only option, results in this format for SRR029338_1.fastq:
@SRR029338.1 080115_EAS112_0034:8:1:615:780/1
GGTTGAGTAAAGTGTCTAAAGGCATAGCCTGATTAT
+
IIIIIIIIIIIIIIIIIIIAIIA<I8I+7I9+II2I
And this format for SRR029338_2.fastq:
@SRR029338.1 080115_EAS112_0034:8:1:615:780/2
AAAGTCAAATTTGAATTGTTGTCAGCTTGTCAAAAT
+
IIIIIIIIDIIIIIIIIIIIII.1F2II=8*2+//I

转换*.sra 文件格式到SFF格式

$ sff-dump -A <SRR_accession> -D <Path_to_SRR_Directory> -O <Output_Path>

Options:

CommandDescription
-OAllows user to specify an output directory. If not used, output will default to the current directory.
-NMinimum spot ID to output. The first spot in the output will be the number given for this option.
-XMaximum spot ID to output. The last spot in the output will be the number given. Min and Max spot options can be combined to output subsections of an SRR.
-Gspotgroup-file Split into files by SPOT_GROUP
-Tspotgroup-dir Split into subdirectories (of -O ) by SPOT_GROUP
-LLog level: 0-13 or fatal|sys|int|err|warn|info|debug[1-10]. (default: info) Set to ‘4’ to mimic the unix standard of no messages for a successful operation.
-HPrints this help message and version information.

转换*.sra 文件格式到Illumina native文件格式

$illumina-dump [options] -path <directory_containing_the_accession> <acces

CommandDescription
-D, --table-pathPath to accession data.
-O, --outdirOutput directory. Default: '.'
-N, --minSpotIdMinimum spot id to output.
-X, --maxSpotIdMaximum spot id to output.
-G, --spot-groupSplit into files by SPOT_GROUP (member).
-T, --group-in-dirsSplit into subdirectories instead of files.
-K, --keep-empty-filesDo not delete empty files.
-L, --log-levelLogging level: 0-13 or fatal|sys|int|err|warn|info|debug[1-10]. Default: info
-H, --helpPrints this message

Format options:

CommandDescription
-r, --readOutput READ: "seq". Default: on
-q, --qual1Output QUALITY, into single (1) or multiple (2) files: "qcal". Default: 1
-p, --qual4Output full QUALITY: "prb". Default: off
-i, --intensityOutput INTENSITY, if present: "int". Default: off
-n, --noiseOutput NOISE, if present: "nse". Default: off
-s, --signalOutput SIGNAL, if present: "sig2". Default: off
-qseqOutput QSEQ format: "qseq". Default: off\

  • ***2lotus 2012-06-05 22:50
    #1

    我猜可能sra比较兼容各种高通量测序的数据吧,比如454下机的数据并不是fastq的。。

  • ***2IMP1990 2014-06-02 12:00
    #2

    能不能帮转一个SRA文件?菜鸟我已经可怜的把儿童节搭进去了

  • ***1怪羊基德 2014-06-02 01:40
    #3

    你要把SRA转成什么格式?如果是把SRA转成fastq的话,直接用fastq-dump SRA_ID就好了。新版的fastq-dump不需要下载SRA文件。

  • ***1Fluone 2014-11-25 13:24
    #4

    windows 系统中怎么转转