如果解析超大XML文件
最近工作中碰到需要解析超大XML的问题(XML文件超过1G),并且在处理中还碰到无法解析的异常(try { SAXReader saxReader = new SAXReader(); saxReader.addHandler("/list/XXXX", new ElementHandler() { public void onStart(ElementPath path) { // do nothing here... } public void onEnd(ElementPath path) { // process a ROW elementElement row = path.getCurrent();Document document = row.getDocument();System.out.println(document.asXML());row.detach(); } }); } final File file = new File(getFileName(language, isProduct)); saxReader.setErrorHandler(new ErrorHandler() { public void error(SAXParseException e) { System.out.println("file:" + file.getName() + " ERROR: " + e); } public void fatalError(SAXParseException e) { System.out.println("file:" + file.getName() + " FATAL: " + e); } public void warning(SAXParseException e) { System.out.println("file:" + file.getName() + " WARNING: " + e); } }); InputStreamReader source = new InputStreamReader(new FileInputStream(file)); saxReader.read(source); } catch (DocumentException e) { logger.error("error", e); return; } catch (FileNotFoundException e) { logger.error(" error", e); return; }
如果XML文件中包含了一些不可见的无效字符,就会导致JDom在解析该文件是抛出异常(An invalid XML character Unicode: 0x19 etc)。我们可以通过一些xml工具来保证,如果在xml文件出现了,也可以通过下面这个方法来过滤。
public static String stripNonValidXMLCharacters(String in) { StringBuffer out = new StringBuffer(); // Used to hold the output. char current; // Used to reference the current character. if (in == null || ("".equals(in))) return ""; // vacancy test. for (int i = 0; i < in.length(); i++) { current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught // here; it should not happen. if ((current == 0x9) || (current == 0xA) || (current == 0xD) || ((current >= 0x20) && (current <= 0xD7FF)) || ((current >= 0xE000) && (current <= 0xFFFD)) || ((current >= 0x10000) && (current <= 0x10FFFF))) out.append(current); } return out.toString();}
?