SAM/BAM格式中利用FLAG值部分进行统计
前几天在生物之窗提了一个关于利用SamTools统计reads问题,链接地址:http://home.plob.org/question/63
结合PLOB上几位网友的建议和自己收集的资料,在这里与大家分享一下解决方法。
The output from short read aligners like Bowtie and BWA is commonly stored in SAM/BAM format. When presented with one of these files a common first task is to calculate the total number of alignments (reads) captured in the file. In this post I show some examples for finding the total number of reads using samtools and directly from Java code. For the examples below, I use the HG00173.chrom11 BAM file from the 1000 genomes project which can be downloaded here.
First, we look at using the samtools command directly. One way to get the total number of alignments is to simply dump the entire SAM file and tell samtools to count instead of print (-c option):
##这里只是统计有多少比对记录,如果一个reads出现了多个比对结果,也没有进行去重复
$ samtools view -c HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5218322
If we’re only interested in counting the total number of mapped reads we can add the -F 4 flag. Alternativley, we can count only the unmapped reads with -f 4:
# Mapped reads only
$ samtools view -c -F 4 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5068340
# Unmapped reads only
$ samtools view -c -f 4 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
149982
To understand how this works we first need to inspect the SAM format. The SAM format includes a bitwise FLAG field described here. The -f/-F options to the samtools command allow us to query based on the presense/absence of bits in the FLAG field. So -f 4 only output alignments that are unmapped (flag 0×0004 is set) and -F 4 only output alignments that are not unmapped (i.e. flag 0×0004 is not set), hence these would only include mapped alignments.
An example for paired end reads you could do the following. To count the number of reads having both itself and it’s mate mapped:
$ samtools view -c -f 1 -F 12 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
4906035
The -f 1 switch only includes reads that are paired in sequencing and -F 12 only includes reads that are not unmapped (flag 0×0004 is not set) and where the mate is not unmapped (flag 0×0008 is not set). Here we add 0x0004 + 0x0008 = 12 and use the -F (bits not set), meaning you want to include all reads where neither flag 0×0004 or 0×0008 is set. For help understanding the values for the SAM FLAG field there’s a handy web tool here.
There’s also a nice command included in samtools called flagstat which computes various summary statistics. However, I wasn’t able to find much documentation describing the output and it’s not mentioned anywhere in the man page. This post examines the C code for the flagstat command which provides some insight into the output.
$ samtools flagstat HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5218322 + 0 in total ##等于samtools view -c *.bam (QC-passed reads + QC-failed reads)
273531 + 0 duplicates
5068340 + 0 mapped ##等于samtools view -c -F 4 *.bam (97.13%:-nan%)
5205999 + 0 paired in sequencing
2603248 + 0 read1
2602751 + 0 read2
4881994 + 0 properly paired (93.78%:-nan%)
4906035 + 0 with itself and mate mapped ##等于samtools view -c -f 1 -F 12 *.bam
149982 + 0 singletons ##等于samtools view -c -f 4 *.bam(2.88%:-nan%)
19869 + 0 with mate mapped to a different chr
15271 + 0 with mate mapped to a different chr (mapQ>=5)
资料来源:
相关推荐:
- NCBI在线BLAST使用方法与结果详解 2938
- 神经网络术语:Epoch、Batch Size和迭代 527
- Consed的安装与使用教程 465
- 陈连福的NGS生物信息学培训教材V2.1 277
- WGCNA分析使用教程 272
最新创建圈子
-
原料药研发及国内外注册申报
2019-01-25 10:41圈主:caolianhui 帖子:33 -
制药工程交流
2019-01-25 10:40圈主:polysciences 帖子:30 -
健康管理
2019-01-25 10:40圈主:neuromics 帖子:20 -
发酵技术
2019-01-25 10:39圈主:fitzgerald 帖子:17 -
医学肿瘤学临床试验
2019-01-25 10:39圈主:bma 帖子:58