data science tutorials and snippets prepared by greysweater42
NLP or natural language processing (not neuro-linguistic programming. This disambiguation makes searching on the internet a real pain) is a group of methods/algorithms that deal with text data. Sounds like text mining? The former concentrates more on predictive modeling, while the latter on exploratory data analysis. In practice some of the algorithms may be used both for prediction and exploration (LDA, for example).
As a data scientist you should care, because text data contains a lot of useful and valuable information, so declining it just because you don’t know how deal with it would be a huge waste ;)
Basic concepts are: document, corpus, vector and model. You’ll find their definitions here.
The input data is usually very messy. Just as in traditional machine learning (I refer here to classical, matrix-transformable tabular format) the data must be ‘clean’ so the results would not come out messy as well. What does ‘clean data’ in nlp stand for? In general having run through the following steps should give us a fairly usable, valid dataset:
tokenization - an absolutely basic concept, yet surprisingly simple. In practice tokenization stands for dividing your text into a list of words (sometimes pairs or triples of words, entire sentences or even syllables). A var brief tutorial should be enough, e.g. this one:
excluding stop words - the most popular approach (yet not the best one) is to find a list of stop words for a particular language on the Internet. But… a better way is to exclude specific parts of speech, e.g. conjunctions. To do this, you have to tag each word with its part of speech (for Polish morfeusz seems to be the best choice, with spaCy‘s API, so again I recommend Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras).
stemming and lemmatization - which of these you should use depends on the language that you are working on. For example for Polish (which is my first language) lemmatization works far better than stemming and the best tool I’ve come across so far is morfeusz with its excellent documentation, which introduces various quirks of the Polish language thoroughly. However the Morfeusz’s API is not so user friendly (IMHO), so I recommend using spaCy‘s API to Morfeusz instead. A perfect place to begin you adventure with spaCy is Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras.
last but not least, regex. To be honest, even though most nlp libraries provide regex functionalities, I stick to the old good Python’s re package because of its great performance thanks to re.compile() philosophy and huge popularity and support. And easiness of use. And support for standard regex syntax.
BOW - the simplest idea of transforming words into numbers by counting the occurrences of each word in a document.
tf-idf (best resource: An Introduction to Information Retrieval, Manning, Raghavan, Shutze, chapter 6: Scoring, term weighting and the vector space model) - for a good start you should familiarize yourself with how unstructured data, especially text data is represented in mathematical terms. For this, tf-idf is used, which being a rather straightforward concept, enables representing documents as vectors. In result it becomes possible to calculate resemblance (proximity) of two documents by simply applying cosine measure (a close relative to Pearson’s correlation coefficient).
word2vec - representing each word as a dense vector, which contains meaning of the word. The most famous example is the equation king - man + woman = queen, presented in word2vec paper, which appears in 100% articles about word2vec, so I had to mention it as well. However the original paper is a little bit obscure and does not contentrate on mathematics too much, so I recommend reading word2vec Parameter Learning Explained instead and definitely watching a lecture from Stanford conducted by legendary Christopher D. Manning. Even though word2vec is based on a shallow neural network, I don’t use any of famous NN libraries (tensorflow, pytorch, keras) for obtaining predictions, but simply call gensim’s word2vec model.
doc2vec is pretty much a word2vec model with a minor tweak (adding a particular sentence’s id as if it were another word). Official paper is rather approachable, but running through examples from gensim’s doc2vec model will provide you with a good intuition on what the capabilities of this algorithm are: you can very easily measure similarities between various documents.
LDA
Probably the most popular algorithm for topic modeling is LDA (Latent Dirichlet Allocation), however before deep diving into papers on this particular subject, you should consider first whether you already have appropriate background. The easy way to get familiar with LDA is starting with:
latent semantic indexing (An Introduction to Information Retrieval, Manning, Raghaven, Schutze, chapter 18: Matrix decomposition and latent semantic indexing). Understanding LSA (Latent Semantic Indexing) is an excellent step towards understanding LDA, however at first the concept may seem a little obscure, especially if you haven’t been using linear algebra for a while. These resources should be helpful:
singular value decomposition (SVD) and low-rank matrix approximations (CS168: The Modern Algorithmic Toolbox, Lecture #9 link)
besides SVD, it’s good to remember how PCA actually works. I’ve always been finding it hard to understand why eigenvalues and eigenvectors are the best choices in PCA, and by why I mean that this should be as obvious to me as arithmetic mean. Excellent resources are here: one and two, also from The Modern Algorithmic Toolbox. Chapeau bas, Stanford University!
having read about SVD, the Python package that I find best for topic modelling is gensim with its wonderful tutorials.
Armed with tha basic intuition on LDA, you may deep dive into more advanced resources:
LDA’s original paper - for crazy-braves only.
INHO, depp understanding of LDA is not necessary to make use of its functionalities and advantages.
RNN or Recurrent Neural Network is a type of artificial neural network used for working on sequence data, so it can be used not only for sentences, which are sequences of words (tokens), but also for time series or any type of observation, which can be indexed reasonably. The main idea is to pass on the output from the current cell to the following one, so each cell ends up with two inputs eventually: the “original” one, i.e. a numerical representation of the word from the sentence and the output of the previous cell.
LSTM - LSTMs are the type of recurrent neural newtorks, which solve the vanishing gradient problem (when the network does not rememeber what was happening many layers earlier) in a savvy way: they use their own, specially created “highways” to pass the “old” information (“old” meaning from the beginning of the sentence) throughout the whole sentence. It resembles the natural way of understanding the spoken language by humans: when we listen to someone speaking, we subconsciously summarize, remember and pass on what we have heard, so when the sentence is finished we can understand it as a whole.
BERT - I have a very basic intuition on how this works thanks to BERT’s original paper and I just use it :) with BERT-pytorch.
Text mining in R - basic concepts in text mining, introduction to tidytext package and LDA
example nlp usage in Python - a towardsdatascience article