通译：How to Benchmark a Hadoop Cluster

2013-10-31

翻译：How to Benchmark a Hadoop ClusterHow to Benchmark a Hadoop Cluster如何检测你的Hadoop集群性能？

翻译：How to Benchmark a Hadoop Cluster
How to Benchmark a Hadoop Cluster
如何检测你的Hadoop集群性能？

比较早的一篇文章了，有些可能翻译的不是很顺，比如Benchmark这个词就比较晕，姑且理解为性能吧，原文地址：
http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIOTestFDSIO.0.0.4Usage: TestFDSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile resultFileName] [-bufferSize Bytes]

Benchmarking HDFS with TestDFSIO
使用TestDFSIO测试HDFS性能

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000

% cat TestDFSIO_results.log----- TestDFSIO ----- : write Date & time: Sun Apr 12 07:14:09 EDT 2009 Number of files: 10Total MBytes processed: 10000 Throughput mb/sec: 7.796340865378244Average IO rate mb/sec: 7.8862199783325195 IO rate std deviation: 0.9101254683525547 Test exec time sec: 163.387

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000

----- TestDFSIO ----- : read Date & time: Sun Apr 12 07:24:28 EDT 2009 Number of files: 10Total MBytes processed: 10000 Throughput mb/sec: 80.25553361904304Average IO rate mb/sec: 98.6801528930664 IO rate std deviation: 36.63507598174921 Test exec time sec: 47.624

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean

% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar randomwriter random-data

接下来我们可以运行排序程序：

% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort random-data sorted-data

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar testmapredsort -sortInput random-data \ -sortOutput sorted-data

SUCCESS! Validated the MapReduce framework's 'sort' successfully.

引用Other benchmarks
其他测试工具

引用There are many more Hadoop benchmarks, but the following are widely used:

MRBench (invoked with mrbench) runs a small job a number of times. It acts as a good counterpoint to sort, as it checks whether small job runs are responsive.

NNBench (invoked with nnbench) is useful for load testing namenode hardware.

Gridmix is a suite of benchmarks designed to model a realistic cluster workload, by mimicking a variety of data-access patterns seen in practice. See src/benchmarks/gridmix2 in the distribution for further details

还有更多的Hadoop性能检查工具，以下为使用的比较多的：

MRBench 会跑几次小的job，是排序的补充，用于查看规模较小的job耗时是否正常。

NNBench 硬件压力测试工具

Gridmix 一个用于模拟实际集群压力的套件，可以模拟实际的data-access。在src/benchmarks/gridmix2中可以看到更多细节。

User Jobs
用户Job

引用For tuning, it is best to include a few jobs that are representative of the jobs that your users run, so your cluster is tuned for these and not just for the standard benchmarks. If this is your first Hadoop cluster and you don’t have any user jobs yet, then Gridmix is a good substitute.

在集群调试时，最好可以包括用户经常要跑的一些典型的job用例，这样你的集群可以更有针对性。如果这是你的第一个Hadoop集群，你还没有任何用户，那么Gridmix会更适合你。

引用When running your own jobs as benchmarks you should select a dataset for your user jobs that you use each time you run the benchmarks to allow comparisons between runs. When you set up a new cluster, or upgrade a cluster, you will be able to use the same dataset to compare the performance with previous runs.

当运行你自己的job来测试时，最好选择你每次都会使用的一些数据，这样可以更好的在结果之间进行比对。当你搭建了一个新的集群，或者更新了集群，你可以使用同样的数据集来进行测试，可以和之前的测试结果进行比对。

引用In a similar vein, PigMix is a set of benchmarks for Pig available from http://wiki.apache.org/pig/PigMix.

与之相似，PigMix是一个Pig的性能测试套件，主页地址： http://wiki.apache.org/pig/PigMix

热点排行

软件架构设计

通译：How to Benchmark a Hadoop Cluster