判断文本文件使用的字符集

2009-02-17

JAVA技巧

最近一段时间作东西，总是因为要处理可能不通字符集的文件而烦心，虽然遭到一个程序库，好象是模仿Mozilla的一个Ｃ库使用ｊａｖａ实现的，但是普通的情况下使用这个库是不是太大了点．于是将GBK作为默认编码，主要判断utf-8，我实现了一个程序，在Ｗｉｎｄｏｗｓ下边如果使用记事本保存的ＴＸＴ文件带有ＢＯＭ当然很方便，但是如果没有ＢＯＭ的话，就只有通过文件中的字符的编码来判断了，经过几次简单的实验，发现尚且可用。ＪＡＶＡ源码如下，恳请指正：
　　static String get_charset(File file) {
　　String charset = "GBK";
　　byte [] first3Bytes = new byte[3];
　　try {
　　boolean checked = false;
　　BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
　　bis.mark(0);
　　int read = bis.read(first3Bytes, 0, 3);
　　if (read == -1) return charset;
　　if (first3Bytes[0] == (byte)0xFF && first3Bytes[1] == (byte)0xFE) {
　　charset = "UTF-16LE";
　　checked = true;
　　}
　　else if(first3Bytes[0] == (byte)0xFE && first3Bytes[1] == (byte)0xFF) {
　　charset = "UTF-16BE";
　　checked = true;
　　}
　　else if(first3Bytes[0] == (byte)0xEF && first3Bytes[1] == (byte)0xBB && first3Bytes[2] == (byte)0xBF) {
　　charset = "UTF-8";
　　checked = true;
　　}
　　bis.reset();
　　if (!checked) {
　　int len = 0;
　　int loc = 0;
　　while ((read = bis.read()) != -1) {
　　loc ++;
　　if (read >= 0xF0)
　　break;
　　if (0x80<=read && read <= 0xBF) //单独出现BF以下的，也算是GBK
　　break;
　　if (0xC0<=read && read <= 0xDF) {
　　read = bis.read();
　　if (0x80<= read && read <= 0xBF)//双字节 (0xC0 - 0xDF) (0x80 - 0xBF),也可能在GB编码内
　　continue;
　　else
　　break;
　　} else if (0xE0 <= read && read <= 0xEF) {//也有可能出错，但是几率较小
　　read = bis.read();
　　if (0x80<= read && read <= 0xBF) {
　　read = bis.read();
　　if (0x80<= read && read <= 0xBF) {
　　charset = "UTF-8";
　　break;
　　} else
　　break;
　　} else
　　break;
　　}
　　}
　　System.out.println(loc + " " + Integer.toHexString(read));
　　}
　　bis.close();
　　} catch (Exception e) {
　　e.printStackTrace();
　　}
　　return charset;
　　}

3COME考试频道为您精心整理，希望对您有所帮助，更多信息在http://www.reader8.com/exam/

热点排行

复习指导

判断文本文件使用的字符集