利用SemanticAnalyzerHook回过滤不加分区条件的Hive查询

2013-11-08

利用SemanticAnalyzerHook来过滤不加分区条件的Hive查询我们Hadoop集群中将近百分之80的作业是通过Hive来

利用SemanticAnalyzerHook来过滤不加分区条件的Hive查询
我们Hadoop集群中将近百分之80的作业是通过Hive来提交的，由于Hive写起来简单便捷，而且我们又提供了Hive Web Client，所以使用范围很广，包括ba，pm，po，sales都在使用hive进行ad-hoc查询，但是hive在降低用户使用门槛的同时，也使得用户经常写不合理开销很大的语句，生成了很多的mapreduce job，占用了大量slot数，其中最典型的例子就是分区表查询，不指定分区条件，导致hive没有做partition pruner优化，进而读入了所有的表数据，占用大量IO和计算资源。为了尽可能规避这种情况，我们可以利用了hive的hook机制，在hook中实现一些方法来对语句做预判，第一期先不会直接block住语句，而是记录有问题的语句来公告警示. 具体做法是实现HiveSemanticAnalyzerHook接口，preAnalyze方法和postAnalyze方法会分别在compile函数之前和之后执行，我们只要实现preAnalyze方法，遍历传进来的ASTNode抽象语法树，获取左子树的From表名和右子树的where判断条件key值，如果该From表是分区表的话，会通过metastore client获取它的所有分区key名字，用户指定的where条件中只要出现任何一个分区key，则此语句通过检测，否则会在标准错误中输出一条warning，并且在后台log中记录用户名和执行语句，每隔一段时间会将这些bad case在hive-user组邮箱进行公示，希望能通过这种方式来起到相互警示和学习的效果.
compile函数中根据hiveconf中指定的hive.semantic.analyzer.hook来反射实例化hook类，此处为实现AbstractSemanticAnalyzerHook的DPSemanticAnalyzerHook

package org.apache.hadoop.hive.ql.parse;import java.io.Serializable;import java.util.ArrayList;import java.util.List;import org.apache.commons.lang.StringUtils;import org.apache.hadoop.hive.metastore.api.FieldSchema;import org.apache.hadoop.hive.ql.exec.Task;import org.apache.hadoop.hive.ql.metadata.Hive;import org.apache.hadoop.hive.ql.metadata.HiveException;import org.apache.hadoop.hive.ql.metadata.Table;import org.apache.hadoop.hive.ql.session.SessionState;import org.apache.hadoop.hive.ql.session.SessionState.LogHelper;import org.apache.hadoop.hive.shims.ShimLoader;public class DPSemanticAnalyzerHook extends AbstractSemanticAnalyzerHook {  private final static String NO_PARTITION_WARNING = "WARNING: HQL is not efficient, Please specify partition condition! HQL:%s ;USERNAME:%s";  private final SessionState ss = SessionState.get();  private final LogHelper console = SessionState.getConsole();  private Hive hive = null;  private String username;  private String currentDatabase = "default";  private String hql;  private String whereHql;  private String tableAlias;  private String tableName;  private String tableDatabaseName;  private Boolean needCheckPartition = false;  @Override  public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context, ASTNode ast)      throws SemanticException {    try {      hql = ss.getCmd().toLowerCase();      hql = StringUtils.replaceChars(hql, '\n', ' ');      if (hql.contains("where")) {        whereHql = hql.substring(hql.indexOf("where"));      }      username = ShimLoader.getHadoopShims().getUserName(context.getConf());      if (ast.getToken().getType() == HiveParser.TOK_QUERY) {        try {          hive = context.getHive();          currentDatabase = hive.getCurrentDatabase();        } catch (HiveException e) {          throw new SemanticException(e);        }        extractFromClause((ASTNode) ast.getChild(0));        if (needCheckPartition && !StringUtils.isBlank(tableName)) {          String dbname = StringUtils.isEmpty(tableDatabaseName) ? currentDatabase              : tableDatabaseName;          String tbname = tableName;          String[] parts = tableName.split(".");          if (parts.length == 2) {            dbname = parts[0];            tbname = parts[1];          }          Table t = hive.getTable(dbname, tbname);          if (t.isPartitioned()) {            if (StringUtils.isBlank(whereHql)) {              console.printError(String.format(NO_PARTITION_WARNING, hql, username));            } else {              List<FieldSchema> partitionKeys = t.getPartitionKeys();              List<String> partitionNames = new ArrayList<String>();              for (int i = 0; i < partitionKeys.size(); i++) {                partitionNames.add(partitionKeys.get(i).getName().toLowerCase());              }              if (!containsPartCond(partitionNames, whereHql, tableAlias)) {                console.printError(String.format(NO_PARTITION_WARNING, hql, username));              }            }          }        }      }    } catch (Exception ex) {      ex.printStackTrace();    }    return ast;  }  private boolean containsPartCond(List<String> partitionKeys, String sql, String alias) {    for (String pk : partitionKeys) {      if (sql.contains(pk)) {        return true;      }      if (!StringUtils.isEmpty(alias) && sql.contains(alias + "." + pk)) {        return true;      }    }    return false;  }  private void extractFromClause(ASTNode ast) {    if (HiveParser.TOK_FROM == ast.getToken().getType()) {      ASTNode refNode = (ASTNode) ast.getChild(0);      if (refNode.getToken().getType() == HiveParser.TOK_TABREF && ast.getChildCount() == 1) {        ASTNode tabNameNode = (ASTNode) (refNode.getChild(0));        int refNodeChildCount = refNode.getChildCount();        if (tabNameNode.getToken().getType() == HiveParser.TOK_TABNAME) {          if (tabNameNode.getChildCount() == 2) {            tableDatabaseName = tabNameNode.getChild(0).getText().toLowerCase();            tableName = BaseSemanticAnalyzer.getUnescapedName((ASTNode) tabNameNode.getChild(1))                .toLowerCase();          } else if (tabNameNode.getChildCount() == 1) {            tableName = BaseSemanticAnalyzer.getUnescapedName((ASTNode) tabNameNode.getChild(0))                .toLowerCase();          } else {            return;          }          if (refNodeChildCount == 2) {            tableAlias = BaseSemanticAnalyzer.unescapeIdentifier(refNode.getChild(1).getText())                .toLowerCase();          }          needCheckPartition = true;        }      }    }  }  @Override  public void postAnalyze(HiveSemanticAnalyzerHookContext context,      List<Task<? extends Serializable>> rootTasks) throws SemanticException {    // LogHelper console = SessionState.getConsole();    // Set<ReadEntity> readEntitys = context.getInputs();    // console.printInfo("Total Read Entity Size:" + readEntitys.size());    // for (ReadEntity readEntity : readEntitys) {    // Partition p = readEntity.getPartition();    // Table t = readEntity.getTable();    // }  }}

本文链接http://blog.csdn.net/lalaguozhe/article/details/11988047，转载请注明

2楼wzc1989前天 09:28

感觉直接 enable strict mode 可以有效解决低效查询的问题，虽然可能不是非常友好。

Re: lalaguozhe前天 09:31

回复wzc1989nstrict mode会对多种情况进行判断，比如不允许笛卡尔积，order by后面一定要跟limit等等，我们只对没加partition的做过滤，放在hook里面对代码侵入最小~

Re: wzc1989前天 11:06: 回复lalaguozhen不过我感觉 strict mode 中其他情况也是低效查询的表现，也理应控制。

1楼suannai0314前天 09:20: 您的文章已被推荐到CSDN首页，感谢您的分享。

热点排行

云计算

利用SemanticAnalyzerHook回过滤不加分区条件的Hive查询