这个转录组比对工具很快,十几分钟一个样品

前面我们做了STAR基因组索引构建所需资源的评估,现在我们看下reads比对对计算资源和时间的需求。

下载原始测序数据

首先下载获得样品SRR1039517的原始测序数据,数据量约为34million长度为63个碱基的双端reads,总碱基数4.3G左右。具体见NGS基础:测序原始数据批量下载。

fastq-dump-v--split-3--gzipSRR1039517seqkitstatsSRR1039517_1._1._seqssum_lenmin_lenavg_lenmax_lenSRR1039517_1.,298,2602,160,790,380636363SRR1039517_1.,298,2602,160,790,380636363
序列拆分为小文件,模拟不同测序量

把所有序列拆分为最多包含20万条reads的小的原始测序数据文件。并重命名,去掉文件名前面填充的0。主要是便于后面进行批量处理。

seqkitsplit2-s200000-1SRR1039517_1._2.!/usr/bin/awk-ffunctionto_minutes(time_str){a=split(time_str,array1,":");minutes=0;count=1;for(i=a;i=1;--i){minutes+=array1[count]*60^(i-2);count+=1;}returnminutes;}BEGIN{OFS="\t";FS=":";}ARGIND==1{if(FNR==1)header=$0;elsedatasize=$0;}ARGIND==2{if(FNR==1outputHeader==1)print"Time_cost\tMemory_cost\tnCPU",header;if($1~/Elapsed/){time_cost=to_minutes($2);}elseif($1~/Maximumresidentsetsize/){memory_cost=$2/10^6;}elseif($1~/CPU/){cpu=($2+0)/100};}END{printtime_cost,memory_cost,cpu,datasize}

具体整合代码

i=SRR1039517/bin/rm-f${i}_1.part_1.${i}_2.part_2._39517_star_reads_`seq1172`;docat${i}_1./${i}_1.part_${part}.${i}_1.part_1.${i}_1./${i}_2.part_${part}.${i}_2.part_2.~/soft/seqkitstats-j8${i}_1.part_1.${i}_2.part_2.|sed's/,//g'|\awk'BEGIN{OFS="\t"}{reads+=$4;bases+=$5}END{print"nReads\tnBases";printreads/10^6,bases/10^9}'|\awk-voutputHeader=${part}-f./${i}.${part}.log\GRCh38_39517_star_reads_/bin/rm-f${i}_1.part_1.${i}_2.part_2.(forpartin`seq1172`;dodu-s${i}.${part}|\awk-vi=${part}'{if(i==1)print"outputSize";print$1/10^6}';done)\GRCh38_39517_star_reads__39517_star_reads_

结果如下

outputSize(G)Time_cost(minutes)Memory_cost(Gb)nCPUnReads(million)nBases(Gb)0.0343320.289.90721.90.40.02520.0670640.22813.23562.280.80.05040.0995040.34016715.58531.981.20.07560.1316720.32617.36442.551.60.10080.1641560.41318.79122.4320.1260.1968040.46316719.9692.522.40.15120.2291120.461520.9392.882.80.17640.2613680.51666721.75892.833.20.20160.2944840.56822.52432.893.60.22680.327440.58666723.1823.0640.2520.3595880.60123.71263.264.40.27720.3917720.68333324.17223.094.80.30240.4242320.82616724.5952.985.20.32760.4563880.85433324.95123.285.60.35280.4884880.91716725.27513.2960.3780.5207160.93725.57133.336.40.40320.5529240.99916725.83443.326.80.42840.5847121.048526.08173.377.20.45360.6169441.096526.29193.397.60.47880.649441.13526.5123.4580.5040.6819841.1751726.67063.498.40.52920.7149481.1468326.86393.528.80.55440.7483121.1088327.00443.479.20.57960.7809281.1427.17473.559.60.60480.81381.2631727.29563.5100.630.8469961.327.42163.5410.40.65520.8803521.48627.53973.5910.80.68040.913641.4998327.64633.611.20.70560.9465121.5508327.75063.5311.60.73080.9796241.6196727.83613.49120.7561.012861.6251727.9433.5712.40.78121.046761.6676728.01453.5912.80.80641.080591.675528.13293.6313.20.83161.113061.714528.18423.6313.60.85681.144751.773528.26653.65140.8821.176651.8956728.3083.6214.40.90721.208141.9606728.36193.5514.80.93241.239691.94828.43743.6715.20.95761.271522.0058328.46583.6915.60.98281.303362.0563328.53523.65161.0081.334832.0498328.55663.6816.41.03321.367082.124528.62433.6516.81.05841.399742.2541728.62173.6717.21.08361.431712.2781728.70853.7217.61.10881.463212.32228.76883.71181.1341.494942.28228.80253.7118.41.15921.52682.330528.83383.7218.81.18441.558542.3606728.86273.7319.21.20961.590122.41228.893.6819.61.23481.621922.4311728.9173.69201.261.653612.4983328.94363.6820.41.28521.684812.5683328.96863.720.81.31041.716622.5836728.99263.7421.21.33561.748592.6433329.01573.7521.61.36081.780632.7203329.03763.71221.3861.812932.744529.05893.7422.41.41121.845332.7918329.07923.7522.81.43641.877532.776529.09823.7423.21.46161.910082.8676729.11763.7223.61.48681.942563.0051729.13563.75241.5121.975223.0693329.15353.7224.41.53722.007843.0663329.17023.7224.81.56242.040293.08329.18523.6425.21.58762.072883.028529.20073.7725.61.61282.10553.04929.21643.78261.6382.138613.0858329.23363.7426.41.66322.17273.1841729.25443.7526.81.68842.205043.2286729.26823.7827.21.71362.237343.3561729.28123.7227.61.73882.269823.363529.29383.78281.7642.301743.3856729.30513.828.41.78922.333853.4541729.3163.7928.81.81442.366333.4993329.32723.7829.21.83962.39873.4433329.33793.7829.61.86482.430733.54529.34783.81301.892.463583.7411729.35853.7930.41.91522.497043.7916729.36943.8130.81.94042.529713.6856729.37863.8231.21.96562.561693.731529.38713.7831.61.99082.593983.7893329.39553.77322.0162.626463.74829.40363.7832.42.04122.658623.7696729.41133.8132.82.06642.690663.9276729.41913.833.22.09162.722783.9598329.42653.8133.62.11682.754994.02429.43383.8342.1422.786924.0508329.44033.8234.42.16722.819064.109529.4473.8134.82.19242.85144.1816729.45323.7835.22.21762.883684.0946729.45983.8135.62.24282.916154.2531729.46613.78362.2682.948964.4368329.47263.8136.42.29322.981384.4933329.47813.7936.82.31843.014114.3806729.48373.7737.22.34363.046974.3486729.48983.8237.62.36883.07994.38229.49553.83382.3943.112754.3546729.50123.8138.42.41923.145414.924529.50653.4238.82.44443.178234.6231729.51163.7739.22.46963.211224.6373329.51723.8239.62.49483.244874.6846729.52313.83402.523.278454.71129.52853.8540.42.54523.311614.839529.53483.840.82.57043.34444.786529.54133.8241.22.59563.377594.88329.54973.8141.62.62083.410415.1493329.55693.83422.6463.442785.20429.56183.8542.42.67123.475525.12429.56593.842.82.69643.50855.1078329.57093.8143.22.72163.540845.10929.57523.8443.62.74683.573015.1268329.57873.78442.7723.606395.2326729.58263.8244.42.79723.639725.6141729.58673.6444.82.82243.672396.5203329.59273.4845.22.84763.705167.0518329.50073.4245.62.87283.738576.94929.60243.48462.8983.771626.88829.60693.4546.42.92323.804195.5213329.61013.8746.82.94843.836145.857529.61283.8447.22.97363.868665.9148329.61563.8547.62.99883.901385.7821729.61973.8483.0243.933315.7488329.62263.8448.43.04923.965615.7731729.62523.8348.83.07443.998355.7626729.62953.8349.23.09964.031385.9736729.63373.8249.63.12484.065188.0101729.39613.38503.154.099698.13829.64263.3850.43.17524.132529.75529.43493.2650.83.20044.16566.4923329.54483.6351.23.22564.199478.87629.49033.3751.63.25084.233316.503529.65383.64523.2764.266176.4848329.65573.8452.43.30124.299166.6838329.65833.8452.83.32644.332116.5141729.66073.8453.23.35164.365146.3781729.66293.8653.63.37684.398646.48129.66533.81543.4024.43276.4316729.66893.8454.43.42724.466278.35129.67313.0154.83.45244.497797.486529.64943.655.23.47764.5298110.194229.45333.1955.63.50284.5618510.778229.27923.19563.5284.593229.715529.33523.3856.43.55324.625159.3033329.52083.4156.83.57844.657497.29329.68293.5657.23.60364.689366.4853329.68453.8557.63.62884.721186.5373329.6863.84583.6544.754136.6973329.68763.8458.43.67924.787256.4603329.68953.8458.83.70444.819376.38929.69083.8159.23.72964.851056.3303329.69233.8459.63.75484.883359.4218329.69382.62603.784.915687.2148329.46643.5460.43.80524.947686.420529.69663.7860.83.83044.979466.33729.69783.8561.23.85565.011397.0698329.6993.8661.63.88085.043557.1501729.70013.86623.9065.075427.21529.70133.8562.43.93125.107477.2591729.70273.8562.83.95645.139667.2778329.7043.8663.23.98165.172087.374529.70543.8363.64.00685.204587.3323329.7073.86644.0325.237597.4106729.70873.8564.44.05725.270227.4646729.713.8764.84.08245.303097.4966729.71123.8665.24.10765.336027.55429.71253.8665.64.13285.369217.576529.71383.88664.1585.40227.2553329.71523.8666.44.18325.435036.964529.71643.8566.84.20845.4686.9981729.71763.8767.24.23365.500817.0146729.7193.8767.64.25885.534057.0288329.72023.86684.2845.567597.15229.72143.8568.44.30925.584277.08929.72213.8568.59654.32158
STAR比对的时间随数据量的变化

在数据量少于50Million或3Gb时,比对时间与数据量近乎完美正相关

在数据了更多时,比对时间波动性大,趋势不明显。

总体时间差别不大,单个样品在十几分钟内就可以完成。


library(ImageGP)library(ggplot2)library(patchwork)p1-sp_scatterplot("GRCh38_39517_star_reads_",melted=T,xvariable="nReads",yvariable="Time_cost",smooth_method="auto",x_label="Sequencingreads(Million)",y_label="Runningtime(minutes)")+scale_x_continuous(breaks=seq(0,70,by=5))+scale_y_continuous(breaks=seq(1,12,length=12))p2-sp_scatterplot("GRCh38_39517_star_reads_",melted=T,xvariable="nBases",yvariable="Time_cost",smooth_method="auto",x_label="Sequencingreads(Gb)",y_label="Runningtime(minutes)")+scale_x_continuous(breaks=seq(0.5,5,by=0.5))+scale_y_continuous(breaks=seq(1,12,length=12))p1+p2
STAR比对所需内存随数据量的变化

在测序量小于20Milion(1.2G)时,STAR比对所需内存随测序量增加快速增加,从9.9G快速增到28G。

测序量再增加时,STAR比对所需内存变化不大,稳定在30G以内。


p1-sp_scatterplot("GRCh38_39517_star_reads_",melted=T,xvariable="nReads",yvariable="Memory_cost",smooth_method="auto",x_label="Sequencingreads(Million)",y_label="Maximumphysicalmemoryrequired(Gb)")+scale_x_continuous(breaks=seq(0,70,by=5))+scale_y_continuous(breaks=seq(9,30,length=22))p2-sp_scatterplot("GRCh38_39517_star_reads_",melted=T,xvariable="nBases",yvariable="Memory_cost",smooth_method="auto",x_label="Sequencingreads(Gb)",y_label="Maximumphysicalmemoryrequired(Gb)")+scale_x_continuous(breaks=seq(0.5,5,by=0.5))+scale_y_continuous(breaks=seq(9,30,length=22))p1+p2
STAR比对时对CPU的利用率

提供了4个线程,不是所有阶段都能用满。

数据量越大,CPU利用效率也越高。

后面再测试不同线程数对比对的影响。

CPU利用率降低的数据量部分看上去跟程序运行时间异常的部分一致。

推测是这时硬盘读写繁忙,导致时间增加,CPU利用效率降低。


从下面这张图可以看出,比对时间异常的样本,它们的CPU利用率也相应的低,但没有太明显规律性。推测这部分程序运行时可能受到了其它程序对硬盘读写的影响。


p1-sp_scatterplot("GRCh38_39517_star_reads_",melted=T,xvariable="nReads",yvariable="nCPU",smooth_method="auto",x_label="Sequencingreads(Million)",y_label="NumberofCPUsused")+scale_x_continuous(breaks=seq(0,70,by=5))+scale_y_continuous(breaks=seq(1,4.5,by=0.5),limits=c(1.5,4))p2-sp_scatterplot("GRCh38_39517_star_reads_",melted=T,xvariable="nBases",yvariable="nCPU",smooth_method="auto",x_label="Sequencingreads(Gb)",y_label="NumberofCPUsused")+scale_x_continuous(breaks=seq(0.5,5,by=0.5))+scale_y_continuous(breaks=seq(1,4.5,by=0.5),limits=c(1.5,4))p1+p2
STAR比对结果文件随数据量的变化

完美正相关,数据量越大,生成的结果文件也越大。


p1-sp_scatterplot("GRCh38_39517_star_reads_",melted=T,xvariable="nReads",yvariable="outputSize",smooth_method="auto",x_label="Sequencingreads(Million)",y_label="Diskspaceusages(Gb)")+scale_x_continuous(breaks=seq(0,70,by=5))+scale_y_continuous(breaks=seq(0,6,by=0.5),limits=c(0,6))p2-sp_scatterplot("GRCh38_39517_star_reads_",melted=T,xvariable="nBases",yvariable="outputSize",smooth_method="auto",x_label="Sequencingreads(Gb)",y_label="Diskspaceusages(Gb)")+scale_x_continuous(breaks=seq(0.5,5,by=0.5))+scale_y_continuous(breaks=seq(0,6,by=0.5),limits=c(0,6))p1+p2
未完待续

版权声明:本站所有作品(图文、音视频)均由用户自行上传分享,仅供网友学习交流,不声明或保证其内容的正确性,如发现本站有涉嫌抄袭侵权/违法违规的内容。请举报,一经查实,本站将立刻删除。

相关推荐