html标签内容轮换为纯文本

2012-09-14

html标签内容替换为纯文本在做接口的时候，有些模板xml标签不是CDATA规范的，这样就会在碰到&这样的字符出现

html标签内容替换为纯文本
在做接口的时候，有些模板xml标签不是CDATA规范的，这样就会在碰到&这样的字符出现错误，需要替换，简单的思路是字符串替换或正则匹配，将&替换成&

特别是有些专辑描述是html的内容，写入xml文件的时候需要将html内容转化为文本内容。

下面给出html内容输出为存文本的两种工具方法，方便以后查阅。

方法一：正则替换

/**
     * html转化为text
     * @param inputString
     * @return
     */
    public static String html2Text(String inputString) {
          String htmlStr = inputString; // 含html标签的字符串
          String textStr = "";
          java.util.regex.Pattern p_script;
          java.util.regex.Matcher m_script;
          java.util.regex.Pattern p_style;
          java.util.regex.Matcher m_style;
          java.util.regex.Pattern p_html;
          java.util.regex.Matcher m_html;
          try {
           String regEx_script = "<[\\s]*?script[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?script[\\s]*?>"; // 定义script的正则表达式{或<script>]*?>[\s\S]*?<\/script>
           // }
           String regEx_style = "<[\\s]*?style[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?style[\\s]*?>"; // 定义style的正则表达式{或<style>]*?>[\s\S]*?<\/style>
           // }
           String regEx_html = "<[^>]+>"; // 定义HTML标签的正则表达式

           p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE);
           m_script = p_script.matcher(htmlStr);
           htmlStr = m_script.replaceAll(""); // 过滤script标签

           p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE);
           m_style = p_style.matcher(htmlStr);
           htmlStr = m_style.replaceAll(""); // 过滤style标签

           p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);
           m_html = p_html.matcher(htmlStr);
           htmlStr = m_html.replaceAll(""); // 过滤html标签

           textStr = htmlStr;

          } catch (Exception e) {
           System.err.println("Html2Text: " + e.getMessage());
          }

          return textStr;
        }

方法二：采用api形式，先倒入jar包htmllexer.jar,htmlparser.jar,sitecapturer.jar,thumbelina.jar,filterbuilder.jar

public static String getHtmlText(String htmlContent) throws Exception
        {
        if(htmlContent==null)htmlContent="";
//         增加一个<br/>,经测试，如果正文为纯文本,org.htmlparser会把参数当作一个文件处理
           StringBuffer sbf = new StringBuffer("");
           sbf.append("<br />").append(htmlContent);
           Parser parser = new Parser(sbf.toString());
           TextExtractingVisitor visitor = new TextExtractingVisitor();
           parser.visitAllNodesWith(visitor);
           String sReturn = visitor.getExtractedText();
           sReturn = sReturn.replace(" ", "");//去掉空格以便统计字数
           return sReturn;
        }

热点排行

CSS

html标签内容轮换为纯文本