稽查xml文件中包含非法xml字符的个数(

2012-10-16

检查xml文件中包含非法xml字符的个数(xml中需要过滤的字符分为两类，一类是不允许出现在xml中的字符，这些字

检查xml文件中包含非法xml字符的个数(

xml中需要过滤的字符分为两类，一类是不允许出现在xml中的字符，这些字符不在xml的定义范围之内。另一类是xml自身要使用的字符，如果内容中有这些字符则需被替换成别的字符。

第一类字符

对于第一类字符，我们可以通过W3C的XML文档来查看都有哪些字符不被允许出现在xml文档中。XML允许的字符范围是“#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]”。因此我们可以把这个范围之外的字符过滤掉。需要过滤的字符的范围为：\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f利用.NET中 Regex的 Replace 方法对字符串中在这3个范围段的字符进行替换，如：string content = “as fas fasfadfasdfasdf<234234546456″;content = Regex.Replace(content, “[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]“, “*”);Response.Write(content);利用PB8，对这个范围的字符进行过滤如下:string content = “as fas fasfadfasdfasdf<234234546456″;int i_count_eliminate=30char i_spechar_eliminate[]={“~001″ , “~002″ , &“~003″ , “~004″ , “~005″ , “~006″ , “~007″ , &“~008″ , “~011″ , “~012″ , “~014″ , “~015″ , &“~016″ , “~017″ , “~018″ , “~019″ , “~020″ , &“~021″ , “~022″ , “~023″ , “~024″ , “~025″ , &“~026″ , “~027″ , “~028″ , “~029″ , “~030″ , &“~031″ , ‘”‘ ? ?, “`” ?} //需要消除的字符,将直接替换为空for vi=1 to i_count_eliminatevpos=1vlen=lenw(i_spechar_eliminate[vi])do while truevpos = posw(content,i_spechar_eliminate[vi],vpos)if vpos<1 then exitcontent=replacew(content,vpos,vlen,”")loopnext

第二类字符

对于第二类字符一共有5个，如下：字符 ? ? ? ? ? ? ? ?HTML字符 ? ? ? ?字符编码和(and) & ? ? ? ?& ? ? ? ? ? ?&单引号 ?’ ' ? ? ? ? ? ?'双引号 ?” ? ? ? ? ?" ? ? ? ? ? ?"大于号 ?> ? ? ? ?> ? ? ? ? ? ? ? ? ?>小于号 ?< ? ? ? ?< ? ? ? ? ? ? ? ? ? <我们只需要对这个五个字符，进行相应的替换就可以了

???

xml支持的字符范围

Character Range
[2]??? ?Char ??? ::=??? ?#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] ?/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
意思是xml支持的字符范围是任何unicode字符，排除surrogate blocks(代理块),? FFFE和 FFFF.

其中0xD800 至 0xDBFF（高代理high surrogate）和 0xDC00 至 0xDFFF（低代理low surrogate）被称为surrogate blocks（代理块）

代理块是为了表示增补字符增补字符是在 [#x10000-#x10FFFF] 范围之间的字符

增补字符是扩展16位unicode不能表示的字符。Unicode 最初设计是作为一种固定宽度的 16 位字符编码。16 位编码的所有 65，536 个字符并不能完全表示全世界所有正在使用或曾经使用的字符。于是，Unicode 标准扩展到包含多达 1，112，064 个字符，这些扩展字符就是增补字符。

XMLCheck用于检查xml文件中包含非法xml字符的个数。

使用方法 XMLCheck filename

import java.io.*;

public class XMLCheck {

?/**
? * @author lxn
? *
? */
?public static void main(String[] args) throws IOException{
?
? if(args.length == 0)
? {
?? System.out.print("Usage: XMLCheck filename");
?? return;
? }
?
?
? File xmlFile = new File(args[0]);
? if(!xmlFile.exists())
? {
?? System.out.print("File not exist");
?? return;
? }
?
? //输入xml文件
? BufferedReader? in = new BufferedReader(new FileReader(xmlFile));
? String s;
? StringBuilder xmlSb = new StringBuilder();
? //xml文件转换成String
? while((s = in.readLine())!=null)
?? xmlSb.append(s+"\n");
? in.close();
? String xmlString = xmlSb.toString();
? // TODO Auto-generated method stub
? //无特殊字符的
? //int i = checkCharacterData("<?xml version="1.0" encoding="gbk"?><CC>卡号</CC>");
? //有特殊字符的
? //int i = checkCharacterData("<?xml version="1.0" encoding="gbk"?><CC>\u001E卡号</CC>");
?
? int errorChar = checkCharacterData(xmlString);
? System.out.println("This XML　file contain "+errorChar+" errorChar.");
?}
?
?//判断字符串中是否有非法字符
?public static int checkCharacterData(String text){
? int errorChar=0;
? if(text==null){
?? return errorChar;
? }
? char[] data = text.toCharArray();
? for(int i=0,len=data.length;i<len;i++){
?? char c = data[i];
?? int result=c;
?? //先判断是否在代理范围（surrogate blocks）
?? //增补字符编码为两个代码单元，
?? //第一个单元来自于高代理（high surrogate）范围（0xD800 至 0xDBFF），
?? //第二个单元来自于低代理（low surrogate）范围（0xDC00 至 0xDFFF）。
?? if(result>=0xD800 && result<=0xDBFF){
??? //解码代理对（surrogate pair）
??? int high = c;
??? try{
???? int low=text.charAt(i+1);
???
???? if(low<0xDC00||low>0xDFFF){
????? char ch=(char)low;
???? }
???? //unicode说明定义的算法计算出增补字符范围0x10000 至 0x10FFFF
???? //即若result是增补字符集，应该在0x10000到0x10FFFF之间，isXMLCharacter中有判断
???? result = (high-0xD800)*0x400+(low-0xDC00)+0x10000;
???? i++;
??? }
??? catch(Exception e){
???? e.printStackTrace();
??? }
?? }
?? if(!isXMLCharacter(result)){
???? errorChar++;
?? }
? }
?? return errorChar;
?}
?private static boolean isXMLCharacter(int c){
? //根据xml规范中的Character Range检测xml不支持的字符
? if(c <= 0xD7FF){
?? if(c >= 0x20)return true;
?? else{
??? if (c == '\n') return true;
??? if (c == '\r') return true;
??? if (c == '\t') return true;
??? return false;
?? }
? }
? if (c < 0xE000) return false;? if (c <= 0xFFFD) return true;
? if (c < 0x10000) return false;? if (c <= 0x10FFFF) return true;
??? return false;
?}

}

热点排行

XML SOAP

稽查xml文件中包含非法xml字符的个数(