Java如何识别并读取不同编码的文本文件
相信大部分人都知道,txt文件有四种编码格式,"GBK", "UTF-8", "Unicode", "UTF-16BE",每一种编码格式的区分在于写入文件头的信息不同.为了避免读取乱码的现象,我们应该在读取文本之前先读取文件头信息,以便做出正确的读取编码方式.下面给出方法.
/**
* 判断文件的编码格式
* @param fileName :file
* @return 文件编码格式
* @throws Exception
*/
public static String codeString(String fileName) throws Exception{
BufferedInputStream bin = new BufferedInputStream(
new FileInputStream(fileName));
int p = (bin.read() << + bin.read();
String code = null;
switch (p) {
case 0xefbb:
code = "UTF-8";
break;
case 0xfffe:
code = "Unicode";
break;
case 0xfeff:
code = "UTF-16BE";
break;
default:
code = "GBK";
}
return code;
}
然后,以字符流的方式读取文本
FileInputStream fInputStream = new FileInputStream(file);
//code为上面方法里返回的编码方式
InputStreamReader inputStreamReader = new InputStreamReader(fInputStream, code);
BufferedReader in = new BufferedReader(inputStreamReader);
String strTmp = "";
//按行读取
while (( strTmp = in.readLine()) != null) {
sBuffer.append(strTmp + "/n");
}
return sBuffer.toString();