Notes on natural language processing.
A user enters a URL for a wikipedia page. The history is retrieved and sentiment analysis is conducted on each edit to determine what, if any, bias change is detected. The following refereces are guiding the project:
-
seq2seq: A general-purpose encoder-decoder framework for Tensorflow that can be used for Machine Translation, Text Summarization, Conversational Modeling, Image Captioning, and more.
-
Open NMT: OpenNMT is an open source ecosystem for neural machine translation and neural sequence learning.
-
Harvard NLP code: Includes seq2seq , Open NMT, Attention models.
-
Attention is all you need: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
-
The annotated Transformer from Attention is all you need:In this post I present an “annotated” version of the paper in the form of a line-by-line implementation. I have reordered and deleted some sections from the original paper and added comments throughout. This document itself is a working notebook, and should be a completely usable implementation. In total there are 400 lines of library code which can process 27,000 tokens per second on 4 GPUs.
-
Text Blob: TextBlob aims to provide access to common text-processing operations through a familiar interface. You can treat TextBlob objects as if they were Python strings that learned how to do Natural Language Processing. Good for sentiment analysis.
-
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews
-
Global Vectors for Word Representation (GLoVe): e provide an implementation of the GloVe model for learning word representations, and describe how to download web-dataset vectors or train your own. See the project page or the paper for more information on glove vectors.
Winter 2017 CS224n Assignment #2.
- Colaboratory with tensorboard
-
Word embedding summary: This first post lays the foundations by presenting current word embeddings based on language modelling. While many of these models have been discussed at length, we hope that investigating and discussing their merits in the context of past and current research will provide new insights.
-
Efficient Estimation of Word Representations in Vector Space: This is word2vec. We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
-
Distributed Representations of Sentences and Documents: by Quoc Le and Tomas Mikolov. (This is doc2vec). Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, “powerful,” “strong” and “Paris” are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-ofwords models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks
-
LDAvis: A method for visualizing and interpreting topics by Carson Sievert and Kenneth E. Shirley. We present LDAvis, a web-based interactive visualization of topics estimated using Latent Dirichlet Allocation that is built using a combination of R and D3. Our visualization provides a global view of the topics (and how they differ from each other), while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic. First, we propose a novel method for choosing which terms to present to a user to aid in the task of topic interpretation, in which we define the relevance of a term to a topic. Second, we present results from a user study that suggest that ranking terms purely by their probability under a topic is suboptimal for topic interpretation. Last, we describe LDAvis, our visualization system that allows users to flexibly explore topic-term relationships using relevance to better understand a fitted LDA model.
- PyTorch at Udemy
- awesome-nlp - goog collection of NLP links
- NLP Overview - Modern Deep Learning Techniques Applied to Natural Language Processing
- NLP Progress - Sebastian Ruder - Current state-of-the-art for the most common NLP tasks
- NLP Blog - Sebasian Ruder - Good explanation of NLP history and methodology
- Fuzzy C-means clustering in R Clustering example in R using Iris data.
- AllenNLP High powered NLP library.
- Speech and Language Processing by Dan Jurafsky and James H. Martin. Nice looking, covers n-grams, naive bayes classifiers, sentiment, logistic regression, vector semantics, neural nets, part-of-speech tagging, sequence processing with recurrent networks, grammers, syntax, statistical parsing, information extraction. hidden markov models.
- https://en.wikipedia.org/wiki/Natural_language_processing
- https://en.wikipedia.org/wiki/Non-negative_matrix_factorization
- https://en.wikipedia.org/wiki/Graph_theory
- https://en.wikipedia.org/wiki/Bayesian_network
- https://en.wikipedia.org/wiki/Computational_linguistics
- https://en.wikipedia.org/wiki/Fuzzy_clustering
-Mathematical foundations for a compositional distributional model of meaning. by Bob Coecke , Mehrnoosh Sadrzadeh , Stephen Clark We propose a mathematical framework for a unification of the distributional theory of meaning in terms of vector space models, and a compositional theory for grammatical types, for which we rely on the algebra of Pregroups, introduced by Lambek. This mathematical framework enables us to compute the meaning of a well-typed sentence from the meanings of its constituents. Concretely, the type reductions of Pregroups are ‘lifted’ to morphisms in a category, a procedure that transforms meanings of constituents into a meaning of the (well-typed) whole. Importantly, meanings of whole sentences live in a single space, independent of the grammatical structure of the sentence. Hence the inner-product can be used to compare meanings of arbitrary sentences, as it is for comparing the meanings of words in the distributional model. The mathematical structure we employ admits a purely diagrammatic calculus which exposes how the information flows between the words in a sentence in order to make up the meaning of the whole sentence. A variation of our ‘categorical model’ which involves constraining the scalars of the vector spaces to the semiring of Booleans results in a Montague-style Boolean-valued semantics.