【Deep Learning学习笔记】Deep learning for nlp without magic_Bengio_ppt_acl2012看完180多页的ppt,真心
【Deep Learning学习笔记】Deep learning for nlp without magic_Bengio_ppt_acl2012
看完180多页的ppt,真心不容易。记得流水账如下:
Five reason to explore Deep Learning:1. learning representation; 2. the need for distribution representation -- curse dimensionality; 3. unsurperwised feature and weight learning; 4. multi-level representation; 5. why now (RBM,训练方法等出现)
1. the basic
1.1 from logistic regression to neural nets看问题角度很有意思。逻辑回归本身就是一个单一神经元的神经网络(感知机)。而(三层)神经网络,就是多个逻辑回归模型放到一起,各自输出各自的,然后再加一个softmax层,变成分类器。From Maxent Classifiers to Neural Networks最大熵的函数形式,也可以转成sigmoid函数形式,所以最大熵也等同于只有一个神经元的神经网络。在实际应用中,最大熵也可以作为softmax层来使用。
训练神经网络:(1)Stochastic gradient descent (梯度下降);(2)Conjugate gradient or L-BFGS
为什么神经网络需要非线性(Non-linearities)?如果都是线性的话,多层神经网络的描述能力相当于只有一个层的神经网络。
1.2 word representationone-hot representation;distributional representation;class-based representation (hard class -- cluster, or soft class -- LDA);word embedding
1.3 unsuperwised word vector learningfeed-forward computation:如何计算一个语句s(cat chills on a mat)的概率?构建三层神经网络,输入层是每个词(和对应的实数向量),中间隐含层,输出层是单个节点变量,表示句子概率。训练的时候,给定一个ngram窗口,来构建上述神经网络,输出ngram概率s;同时,在当前ngram的基础上,构建反例,同样用上述网络计算反例概率sc。则,目标优化函数是最小化这个数值
J = max (0, 1-S+Sc)
google 的 word2vec,用的就是这个目标函数。为了优化这个目标函数,可以用梯度下降方法计算梯度,bp方式逐层更新网络权重。
1.4 backpropagation training介绍bp的基本原理
1.5 learning word level classifiers: pos and ner和1.3中的训练ngram的网络结构类似,只不过“replaces the single scalar score with a SoBmax/Maxent classifier”,即最上一层是softmax层,用来做分类器。The interesting twist in deep learning is that the input features are also learned——同传统bp过程不同的是,word embedding中,输入向量(指word embedding)也被学习了。word embedding也有助于在各个资源(词典)之间share信息——以词为单位,信息源融合
1.6 sharing statistical strengthsemi-supervised learning:指先用unsupervised learning做pretrain,然后用supervised learning做细调。pretrain能成功的一个理由是:原则上我们要得到条件概率p(c|x),不过pretrain得到的是p(x),后者能够很好地逼近前者。autoencoder:multi-level NN with output = inputpca = linear manifold = linear auto-encoder正常autoencoder相当于non-linear pca附:"manifold"这个词的含义相当于“复印”,即在某个方向上存在微小变化,但是总体来讲还和原来的物体一致。Minimizing reconstruction error forces latent representation of “similar inputs” to stay on manifold。autoencoder改进:对于离散输入,用交叉熵或者log-likelihood作为准则函数;Undercomplete、Sparsity、Denoising、Contractive等问题的解决,其中Sparsity的解决是强迫参数在0附近。
2. recursive NN
2.1 motivationRNN可以学习句子的句法结构,但只能是二叉树的结构。
2.2 RNN for parsing可以参考“leanring meanings for sentence”
2.3 theory: bp through structure介绍很简略,不过基本过程与bp一致。对于语法树中的每一个节点,节点的label计算,可以在节点的向量表示的基础上,加上softmax层,进行训练和标记。实验表明,这种方法对短句效果比较好,对长句的效果比较差还讲了几个应用:paraphrase detection、scene parsing(用NLP中的parsing应用在图像上面,分析图像结构)
2.4 recusive auto-encoders类似RNN,只不过目标函数不再是一个surpervised score,而是reconstruct errorsemi-supervised autoencoders,在目标函数中加入了cross entropy
2.5 applications to sentiment detection(情感倾向性检测)and paraphrase detectionsentiment detection(情感倾向性检测):bag of words方法,采用本文自动学习向量的方法(在此基础上再构件分类器,区分是“正面”倾向还是“负面”倾向的情感)paraphrase detection:how to compare the meanings of two sentences?recusive auto-encoder to full sentence paraphrase detection (sochar 2011): 用2.3的方法分别计算两个句子的语法树、以及非叶子结点,同叶子节点一起,两颗语法树的节点之间计算相似度,形成相似度矩阵,在矩阵基础之上,再用NN方法,计算paraphrase的可能性。个人疑问:句子的长度不同,形成的相似度矩阵的大小(两个维度)不同,如何将不同规模的矩阵,用同样的NN方法来计算相似度的值,ppt中没说,只能看sochar原文了。
2.6 compositionality through recursive matrix-vector spaces上文中,语法树每个中间节点都由一个vector来表示,本小节中的方法,除了vector之外,还有一个matrix。方法比较复杂,介绍比较简略。
3. applications
3.1 applications
3.1.1 nerual language modelLM: Bengio 2003ASR: Mikolov 2011 word2vecoutput bottleneck:通常,NNLM的输出是个向量,向量的维度与词表大小有关,最简单的,one-hot表示方法,或者输出向量是ngram中要预测的词语的向量,但是该向量要与词表中每个词语做相似度计算,从而确定预测出的是哪个词语。对这个问题,Mikolov 借鉴class-based language model的想法,在NNLM上也是输出为word class,然后再用p(word|class, context)来还原计算p(word|context)SMT:也是从LM角度来做的,将从前SMT中的ngram换成NNLM
3.1.2 structured embedding fo knowledge basesBengio aaai2011
3.1.3 assorted speech and nlp applicationslearn multiple word vectors:处理一词多义现象——用多个word vector来表示这个词语......
3.2 resources (tutorials and code)?? See “Neural Net Language Models” Scholarpedia entry
?? Deep Learning tutorials: http://deeplearning.net/tutorials
?? Stanford deep learning tutorials with simple programming assignments and reading list
http://deeplearning.stanford.edu/wiki/
?? Recursive Autoencoder class project
http://cseweb.ucsd.edu/~elkan/250B/learningmeaning.pdf
?? Graduate Summer School: Deep Learning, Feature Learning
http://www.ipam.ucla.edu/programs/gss2012/
?? ICML 2012 Representation Learning tutorial http://www.iro.umontreal.ca/~bengioy/talks/deep-learning-tutorial-2012.html
?? Paper references in separate pdf
softwares?? Theano (Python CPU/GPU) mathema>cal and deep learning library http://deeplearning.net/so\ware/theano
?? Can do automatic, symbolic differen>a>on
?? Senna: POS, Chunking, NER, SRL
?? by Collobert et al. http://ronan.collobert.com/senna/
?? State-of-the-art performance on many tasks
?? 3500 lines of C, extremely fast and using very liCle memory
?? Recurrent Neural Network Language Model
http://www.fit.vutbr.cz/~imikolov/rnnlm/
?? Recursive Neural Net and RAE models for paraphrase detection, sentiment analysis, relation classification www.socher.org
3.3 deep learning tricks
?? Stochastic gradient descent and seáng learning rates
?? Main hyper-parameters
?? Learning rate schedule & Early stopping
?? Minibatches
?? Parameter initialization
?? Number of hidden units
?? L1 or L2 weight decay
?? Sparsity regularization
?? Debugging à Finite difference gradient check (Yay)
?? How to efficiently search for hyper-parameter configurations
tanh(z)=2logistic(2z)?1
tanh is better than sigmoid(logistic) in deep learning
Ordinary gradient descent is a batch method, very slow, should never be used. Use 2nd order batch method such as LBFGS.
learning rate: Better results can generally be obtained by allowing learning rates to decrease, typically in O(1/t)parameter initialization: Initialize hidden layer biases to 0 and output (or reconstruction) biases to optimal value if weights were 0
Initialize weights ~ Uniform(-r,r), r inversely proportional to fanin (previous layer size) and fan-out (next layer size)