用R读取PDF并开展数据挖掘

2012-10-14

用R读取PDF并进行数据挖掘用R读取PDF并进行数据挖掘，例子如下：# here is a pdf for miningurl - http://

用R读取PDF并进行数据挖掘

用R读取PDF并进行数据挖掘，例子如下：

# here is a pdf for miningurl <- "http://www.noisyroom.net/blog/RomneySpeech072912.pdf"dest <- tempfile(fileext = ".pdf")download.file(url, dest, mode = "wb")# set path to pdftotxt.exe and convert pdf to textexe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe"system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F)# get txt-file name and open itfiletxt <- sub(".pdf", ".txt", dest)shell.exec(filetxt); shell.exec(filetxt) # strangely the first try always throws an error..# do something with it, i.e. a simple word cloudlibrary(tm)library(wordcloud)library(Rstem)txt <- readLines(filetxt) # don't mind warning..txt <- tolower(txt)txt <- removeWords(txt, c("\\f", stopwords()))corpus <- Corpus(VectorSource(txt))corpus <- tm_map(corpus, removePunctuation)tdm <- TermDocumentMatrix(corpus)m <- as.matrix(tdm)d <- data.frame(freq = sort(rowSums(m), decreasing = TRUE))# Stem wordsd$stem <- wordStem(row.names(d), language = "english")# and put words to column, otherwise they would be lost when aggregatingd$word <- row.names(d)# remove web address (very long string):d <- d[nchar(row.names(d)) < 20, ]# aggregate freqeuncy by word stem and# keep first words..agg_freq <- aggregate(freq ~ stem, data = d, sum)agg_word <- aggregate(word ~ stem, data = d, function(x) x[1])d <- cbind(freq = agg_freq[, 2], agg_word)# sort by frequencyd <- d[order(d$freq, decreasing = T), ]# print wordcloud:wordcloud(d$word, d$freq)# remove filesfile.remove(dir(tempdir(), full.name=T)) # remove files

热点排行

PowerDesigner

用R读取PDF并开展数据挖掘