Tomcat关于encoding编码的默认设置以及乱码产生的原因
注意:乱码和request的具体实现类有关,现在已经查到的是RequestDispatcher.forward调用前使用的是org.apache.catalina.connector.RequestFacade类而RequestDispatcher.forward调用后使用的是org.apache.catalina.core.ApplicationHttpRequest,他们内部在ParseParameter的时候, 用来解码的默认的编码逻辑不同,使用不同的协议时,影响乱码的因素不同!
具体参考:Tomcat源码分析--ServletRequest.getParameterValues内部分析,Request字符集&QueryStringEncoding
乱码的产生
譬如汉字“中”,以UTF-8编码后得到的是3字节的值%E4%B8%AD,然后通过GET或者POST方式把这3个字节提交到Tomcat容器,如果你不告诉Tomcat我的参数是用UTF-8编码的,那么tomcat就认为你是用ISO-8859-1来编码的,而ISO8859-1(兼容URI中的标准字符集US-ASCII)是兼容ASCII的单字节编码并且使用了单字节内的所有空间,因此Tomcat就以为你传递的用ISO-8859-1字符集编码过的3个字符,然后它就用ISO-8859-1来解码,得到??--,解码后。字符串??--在Jvm是以Unicode的形式存在的,而HTTP传输或者数据库保存的其实是字节,因此根据各终端的需要,你可以把unicode字符串??--用UTF-8编码后得到相应的字节后存储到数据库(3个UTF-8字符),也可以取得这3个字符对应的ISO-8859-1的3个字节,然后用UTF-8重新编码后得到unicode字符“中”(特性:把其他任何编码的字节流当作ISO-8859-1编码看待都没有问题),然后用response传递给客户端(根据你设置的content-type不同,传递的字节也是不同的!)
总结:
1,HTTP GET或者POST传递的是字节?数据库保存的也是字节(譬如500MB空间就是500M字节)2,乱码产生的原因是编码和解码的字符集(方式)不同导致的,即对于几个不同的字节,在不同的编码方案下对应的字符可能不同,也可能在某种编码下有些字节不存在(这也是乱码中?产生的原因)3,解码后的字符串在jvm中以Unicode的形式存在4,如果jvm中存在的Unicode字符就是你预期的字符(编码,解码的字符集相同或者兼容),那么没有任何问题,如果jvm中存在的字符集不是你预期的字符,譬如上述例子中jvm中存在的是3个Unicode字符,你也可以通过取得这3个unicode字符对应的3个字节,然后用UTF-8对这3个字节进行编码生成新的Unicode字符:汉字“中”5,ISO8859-1是兼容ASCII的单字节编码并且使用了单字节内的所有空间,在支持ISO-8859-1的系统中传输和存储其他任何编码的字节流都不会被抛弃。换言之,把其他任何编码的字节流当作ISO-8859-1编码看待都没有问题。
下面的代码显示,使用不同的编码来Encoder会得到不同的结果,同时如果Encoder和Decoder不一致或者使用的汉字在编码ISO-8859-1中不存在时,都会表现为乱码的形式!
try { // 汉字“中”用UTF-8进行URLEncode的时候,得到%e4%b8%ad(对应的ISO-8859-1的字符是??-)String item = new String(new byte[] { (byte) 0xe4, (byte) 0xb8, (byte) 0xad }, "UTF-8");// 中System.out.println(item);item = new String(new byte[] { (byte) 0xe4, (byte) 0xb8, (byte) 0xad }, "ISO-8859-1");// ??-System.out.println(item);System.out.println(new BigInteger("253").toByteArray());System.out.println(Integer.toBinaryString(253));// 中item = new String(item.getBytes("ISO_8859_1"), "UTF-8");System.out.println(item);// ??-item = new String(item.getBytes("UTF-8"), "ISO_8859_1");System.out.println(item); // 汉字中以UTF-8编码为 %E4%B8%AD(3字节) System.out.println(URLEncoder.encode("中", "UTF-8")); // 汉字中以UTF-8编码为 %3F (1字节 这是由于汉字在ISO-8859-1字符集中不存在,返回的是?在ISO-8859-1下的编码) System.out.println(URLEncoder.encode("中", "ISO-8859-1")); // 汉字中以UTF-8编码为 %D6%D0 (2字节) System.out.println(URLEncoder.encode("中", "GB2312")); // 把汉字中对应的UTF-8编码 %E4%B8%AD 用UTF-8解码得到正常的汉字 中 System.out.println(URLDecoder.decode("%E4%B8%AD", "UTF-8")); // 把汉字中对应的ISO-8859-1编码 %3F 用ISO-8859-1解码得到? System.out.println(URLDecoder.decode("%3F", "ISO-8859-1")); // 把汉字中对应的GB2312编码 %D6%D0 用GB2312解码得到正常的汉字 中 System.out.println(URLDecoder.decode("%D6%D0", "GB2312")); // 把汉字中对应的UTF-8编码 %E4%B8%AD 用ISO-8859-1解码 // 得到字符??-(这个就是所谓的乱码,其实是3字节%E4%B8%AD中每个字节对应的ISO-8859-1中的字符) // ISO-8859-1字符集使用了单字节内的所有空间 System.out.println(URLDecoder.decode("%E4%B8%AD", "ISO-8859-1")); // 把汉字中对应的UTF-8编码 %E4%B8%AD 用GB2312解码 // 得到字符涓?,因为前2字节 %E4%B8对应的GB2312的字符就是涓,而第3字节%AD在GB2312编码中不存在,故返回? System.out.println(URLDecoder.decode("%E4%B8%AD", "GB2312")); } catch (UnsupportedEncodingException e) { // TODO Auto-generated catch block e.printStackTrace(); }
/** Licensed to the Apache Software Foundation (ASF) under one or more* contributor license agreements. See the NOTICE file distributed with* this work for additional information regarding copyright ownership.* The ASF licenses this file to You under the Apache License, Version 2.0* (the "License"); you may not use this file except in compliance with* the License. You may obtain a copy of the License at** http://www.apache.org/licenses/LICENSE-2.0** Unless required by applicable law or agreed to in writing, software* distributed under the License is distributed on an "AS IS" BASIS,* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.* See the License for the specific language governing permissions and* limitations under the License.*/package filters;import java.io.IOException;import javax.servlet.Filter;import javax.servlet.FilterChain;import javax.servlet.FilterConfig;import javax.servlet.ServletException;import javax.servlet.ServletRequest;import javax.servlet.ServletResponse;/** * <p>Example filter that sets the character encoding to be used in parsing the * incoming request, either unconditionally or only if the client did not * specify a character encoding. Configuration of this filter is based on * the following initialization parameters:</p> * <ul> * <li><strong>encoding</strong> - The character encoding to be configured * for this request, either conditionally or unconditionally based on * the <code>ignore</code> initialization parameter. This parameter * is required, so there is no default.</li> * <li><strong>ignore</strong> - If set to "true", any character encoding * specified by the client is ignored, and the value returned by the * <code>selectEncoding()</code> method is set. If set to "false, * <code>selectEncoding()</code> is called <strong>only</strong> if the * client has not already specified an encoding. By default, this * parameter is set to "true".</li> * </ul> * * <p>Although this filter can be used unchanged, it is also easy to * subclass it and make the <code>selectEncoding()</code> method more * intelligent about what encoding to choose, based on characteristics of * the incoming request (such as the values of the <code>Accept-Language</code> * and <code>User-Agent</code> headers, or a value stashed in the current * user's session.</p> * * @author Craig McClanahan * @version $Id: SetCharacterEncodingFilter.java 939521 2010-04-30 00:16:33Z kkolinko $ */public class SetCharacterEncodingFilter implements Filter { // ----------------------------------------------------- Instance Variables /** * The default character encoding to set for requests that pass through * this filter. */ protected String encoding = null; /** * The filter configuration object we are associated with. If this value * is null, this filter instance is not currently configured. */ protected FilterConfig filterConfig = null; /** * Should a character encoding specified by the client be ignored? */ protected boolean ignore = true; // --------------------- Public Methods /** * Take this filter out of service. */ public void destroy() { this.encoding = null; this.filterConfig = null; } /** * Select and set (if specified) the character encoding to be used to * interpret request parameters for this request. * * @param request The servlet request we are processing * @param result The servlet response we are creating * @param chain The filter chain we are processing * * @exception IOException if an input/output error occurs * @exception ServletException if a servlet error occurs */ public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)throws IOException, ServletException { // Conditionally select and set the character encoding to be used if (ignore || (request.getCharacterEncoding() == null)) { String encoding = selectEncoding(request); if (encoding != null) request.setCharacterEncoding(encoding); }// Pass control on to the next filter chain.doFilter(request, response); } /** * Place this filter into service. * * @param filterConfig The filter configuration object */ public void init(FilterConfig filterConfig) throws ServletException {this.filterConfig = filterConfig; this.encoding = filterConfig.getInitParameter("encoding"); String value = filterConfig.getInitParameter("ignore"); if (value == null) this.ignore = true; else if (value.equalsIgnoreCase("true")) this.ignore = true; else if (value.equalsIgnoreCase("yes")) this.ignore = true; else this.ignore = false; } // ------------------------------------------------------ Protected Methods /** * Select an appropriate character encoding to be used, based on the * characteristics of the current request and/or filter initialization * parameters. If no character encoding should be set, return * <code>null</code>. * <p> * The default implementation unconditionally returns the value configured * by the <strong>encoding</strong> initialization parameter for this * filter. * * @param request The servlet request we are processing */ protected String selectEncoding(ServletRequest request) { return (this.encoding); }}