大数据专家,详解HadoopMapReduce处理海量小文件:压缩文件

前言

在HDFS上存储文件,大量的小文件是非常消耗NameNode内存的,因为每个文件都会分配一个文件描述符,NameNode需要在启动的时候加载全部文件的描述信息,所以文件越多,对NameNode来说开销越大。我们可以考虑,将小文件压缩以后,再上传到HDFS中,这时只需要一个文件描述符信息,自然大大减轻了NameNode对内存使用的开销。MapReduce计算中,Hadoop内置提供了如下几种压缩格式:

DEFLATE

gzip

bzip2

LZO


使用压缩文件进行MapReduce计算,它的开销在于解压缩所消耗的时间,在特定的应用场景中这个也是应该考虑的问题。不过对于海量小文件的应用场景,我们压缩了小文件,却换来的Locality特性。假如成百上千的小文件压缩后只有一个Block,那么这个Block必然存在一个DataNode节点上,在计算的时候输入一个InputSplit,没有网络间传输数据的开销,而且是在本地进行运算。

倘若直接将小文件上传到HDFS上,成百上千的小Block分布在不同DataNode节点上,为了计算可能需要“移动数据”之后才能进行计算。文件很少的情况下,除了NameNode内存使用开销以外,可能感觉不到网络传输开销,但是如果小文件达到一定规模就非常明显了。下面,我们使用gzip格式压缩小文件,然后上传到HDFS中,实现MapReduce程序进行任务处理。使用一个类实现了基本的Map任务和Reduce任务,代码如下所示:(原创:时延军(包含链接:))

;;;;;;;;;;;;;;;publicclassGzipFilesMaxCostComputation{publicstaticclassGzipFilesMapperextsMapperLongWritable,Text,Text,LongWritable{privatefinalstaticLongWritablecostValue=newLongWritable(0);privateTextcode=newText();@Overrideprotectedvoidmap(LongWritablekey,Textvalue,Contextcontext)throwsIOException,InterruptedException{//aline,suchas'SG2536525365619850464'Stringline=();String[]array=("\\s");if(==4){StringcountryCode=array[0];StringstrCost=array[3];longcost=0L;try{cost=(strCost);}catch(NumberFormatExceptione){cost=0L;}if(cost!=0){(countryCode);(cost);(code,costValue);}}}}publicstaticclassGzipFilesReducerextsReducerText,LongWritable,Text,LongWritable{@Overrideprotectedvoidreduce(Textkey,IterableLongWritablevalues,Contextcontext)throwsIOException,InterruptedException{longmax=0L;IteratorLongWritableiter=();while(()){LongWritablecurrent=();if(()max){max=();}}(key,newLongWritable(max));}}publicstaticvoidmain(String[]args)throwsIOException,ClassNotFoundException,InterruptedException{Configurationconf=newConfiguration();String[]otherArgs=newGenericOptionsParser(conf,args).getRemainingArgs();if(!=2){("Usage:gzipmaxcostinout");(2);}Jobjob=newJob(conf,"gzipmaxcost");().setBoolean("",true);().setClass("",,);();();();();();();();();(1);(job,newPath(otherArgs[0]));(job,newPath(otherArgs[1]));intexitFlag=(true)?0:1;(exitFlag);}}

上面程序就是计算最大值的问题,实现比较简单,而且使用gzip压缩文件。另外,如果考虑Mapper输出后,需要向Reducer拷贝的数据量比较大,可以考虑在配置Job的时候,指定

压缩选项,详见上面代码中的配置。


下面看运行上面程序的过程:

准备数据

xiaoxiang@ubuntu3:/opt/stone/cloud/$du-sh../dataset/gzipfiles/*147M../dataset/gzipfiles/data_10./dataset/gzipfiles/data_50000_1.gz16M../dataset/gzipfiles/data_50000_2.gzxiaoxiang@ubuntu3:/opt/stone/cloud/$bin/hadoopfs-mkdir/user/xiaoxiang/datasets/gzipfilesxiaoxiang@ubuntu3:/opt/stone/cloud/$bin/hadoopfs-copyFromLocal../dataset/gzipfiles/*/user/xiaoxiang/datasets/gzipfilesxiaoxiang@ubuntu3:/opt/stone/cloud/$bin/hadoopfs-ls/user/xiaoxiang/datasets/gzipfilesFound3items-rw-r--r--3xiaoxiangsupergroup13-03-2412:56/user/xiaoxiang/datasets/gzipfiles/data_10:56/user/xiaoxiang/datasets/gzipfiles/data_50000_1.gz-rw-r--r--3xiaoxiangsupergroup3-03-2412:56/user/xiaoxiang/datasets/gzipfiles/data_50000_2.gz

运行程序

xiaoxiang@ubuntu3:/opt/stone/cloud/$bin//user/xiaoxiang/datasets/gzipfiles/user/xiaoxiang/output/smallfiles/gzip13/03/2413:06:28:Totalinputpathstoprocess:313/03/2413:06:28:Loadedthenative-hadooplibrary13/03/2413:06:28:Snappynativelibrarynotloaded13/03/2413:06:28:Runningjob:job_201303111631_003913/03/2413:06:29:map0%reduce0%13/03/2413:06:55:map33%reduce0%13/03/2413:07:04:map66%reduce11%13/03/2413:07:13:map66%reduce22%13/03/2413:07:25:map100%reduce22%13/03/2413:07:31:map100%reduce100%13/03/2413:07:36:Jobcomplete:job_201303111631_003913/03/2413:07:36:Counters:2913/03/2413:07:36:JobCounters13/03/2413:07:36:Launchedreducetasks=113/03/2413:07:36:SLOTS_MILLIS_MAPS=7823113/03/2413:07:36:Totaltimespentbyallreduceswaitingafterreservingslots(ms)=013/03/2413:07:36:Totaltimespentbyallmapswaitingafterreservingslots(ms)=013/03/2413:07:36:Launchedmaptasks=313/03/2413:07:36:Data-localmaptasks=313/03/2413:07:36:SLOTS_MILLIS_REDUCES=3441313/03/2413:07:36:FileOutputFormatCounters13/03/2413:07:36:BytesWritten=133713/03/2413:07:36:FileSystemCounters13/03/2413:07:36:FILE_BYTES_READ=28812713/03/2413:07:36:HDFS_BYTES_READ=21413102613/03/2413:07:36:FILE_BYTES_WRITTEN=38572113/03/2413:07:36:HDFS_BYTES_WRITTEN=133713/03/2413:07:36:FileInputFormatCounters13/03/2413:07:36:BytesRead=21413062813/03/2413:07:36:Map-ReduceFramework13/03/2413:07:36:Mapoutputmaterializedbytes=910513/03/2413:07:36:Mapinputrecords=1408000313/03/2413:07:36:Reduceshufflebytes=607013/03/2413:07:36:SpilledRecords=2283413/03/2413:07:36:Mapoutputbytes=/03/2413:07:36:CPUtimespent(ms)=9020013/03/2413:07:36:Totalcommittedheapusage(bytes)=68819353613/03/2413:07:36:Combineinputrecords=1409291113/03/2413:07:36:SPLIT_RAW_BYTES=39813/03/2413:07:36:Reduceinputrecords=69913/03/2413:07:36:Reduceinputgroups=23313/03/2413:07:36:Combineoutputrecords=1374713/03/2413:07:36:Physicalmemory(bytes)snapshot=76544819213/03/2413:07:36:Reduceoutputrecords=23313/03/2413:07:36:Virtualmemory(bytes)snapshot=221123788813/03/2413:07:36:Mapoutputrecords=14079863

运行结果

xiaoxiang@ubuntu3:/opt/stone/cloud/$bin/hadoopfs-ls/user/xiaoxiang/output/smallfiles/gzipFound3items-rw-r--r--3xiaoxiangsupergroup02013-03-2413:07/user/xiaoxiang/output/smallfiles/gzip/_SUCCESSdrwxr-xr-x-xiaoxiangsupergroup02013-03-2413:06/user/xiaoxiang/output/smallfiles/gzip/_logs-rw-r--r--3xiaoxiangsupergroup13372013-03-2413:07/user/xiaoxiang/output/smallfiles/gzip/@ubuntu3:/opt/stone/cloud/$bin/hadoopfs-copyToLocal/user/xiaoxiang/output/smallfiles/gzip//xiaoxiang@ubuntu3:/opt/stone/cloud/$gunzip-c./98489AM999978568AO999989628AQ999995031AR999999563AS999935982AT999999909AU999937089AW999965784AZ999996557BA999994828BB999992177BD999992272BE999925057BF999999220BG999971528BH999994900BI999982573BJ999977886BM999991925BN999986630BO999995482BR999989947BS999983475BT999992685BW999984222BY999998496BZ999997173CA999991096CC999969761CD999978139CF999995342CG999957938CH999997524CI999998864CK999968719CL999967083CM999998369CN999975367CO999999167CR999980097CU999976352CV999990543CW999996327CX999987579CY999982925CZ999993908DE999985416DJ999997438DK999963312DM999941706DO999992176DZ999973610EC999971018EE999960984EG999980522ER999980425ES999949155ET999987033FI999989788FJ999990686FK999977799FM999994183FO999988472FR999988342GA999982099GB999970658GD999996318GE999991970GF999982024GH999941039GI999995295GL999948726GM999984872GN999992209GP999996090GQ999988635GR999999672GT999981025GU999975956GW999962551GY999999881HK999970084HN999972628HR999986688HT999970913HU999997568ID999994762IE999996686IL999982184IM999987831IN999973935IO999984611IQ999990126IR999986780IS999973585IT999997239JM999986629JO999982595JP999985598KE999996012KG999991556KH999975644KI999994328KM999989895KN999991068KP999967939KR999992162KW999924295KY999985907KZ999992835LA999989151LB999989233LC999994793LI999986863LK999989876LR999984906LS999957706LT999999688LU999999823LV999981633LY999992365MA999993880MC999978886MD999997483MG999996602MH999989668MK999983468ML999990079MM999989010MN999969051MO999978283MP999995848MQ999913110MR999982303MS999997548MT999982604MU999988632MV999975914MW999991903MX999978066MY999995010MZ999981189NA999976735NC999961053NE999990091NF999989399NG999985037NI999965733NL999988890NO999993122NP999972410NR999956464NU999987046NZ999998214OM999967428PA999944775PE999998598PF999959978PG999987347PH999981534PK999954268PL999996619PM999998975PR999978127PT999993404PW999991278PY999993590QA999995061RE999998518RO999994148RS999999923RU999995809RW999980184SA999973822SB999972832SC999991021SD999963744SE999972256SG999977637SH999999068SI999980580SK999998152SL999999269SM999941188SN999990278SO999978960SR999997483ST999980447SV999999945SX999938671SY999990666SZ999992537TC999969904TD999999303TG999977640TH999979255TJ999983666TK999971131TM999958998TN999979170TO999959971TP999986796TR999996679TT999984435TV999974536TW999975092TZ999992734UA999972948UG999980070UM999998377US999918442UY999989662UZ999982762VA999987372VC999991495VE999997971VG999954576VI999990063VN999974393VU999976113WF999961299WS999970242YE999984650YT999994707ZA999998692ZM999993331ZW999943540


版权声明:本站所有作品(图文、音视频)均由用户自行上传分享,仅供网友学习交流,不声明或保证其内容的正确性,如发现本站有涉嫌抄袭侵权/违法违规的内容。请举报,一经查实,本站将立刻删除。

相关推荐