class: center, titleslide
# Text Mining Techniques # Accounting Research
##
Ties de Kok
## Tilburg University --- layout: true class: mainlayout --- class: tocslide .left-column[ ## Agenda ] .right-column[ ### What are we going to discuss today? 1. Positioning session 2. Terminology 3. Language 4. Jupyter 5. NLP Python tools 6. Topics: - Process and Clean text - Direct feature extraction - Represent text numerically - Machine learning ] --- class: tocslide .left-column[ ## Agenda ## Positioning ] .right-column[ ### Where does this session fit into the bigger scheme of NLP?
- Determining relevance textual data - Finding sources textual data - Gathering textual data
.emphasized[Processing textual data]
.emphasized[Analyzing textual data] ] --- class: tocslide .left-column[ ## Agenda ## Positioning ## Terminology ] .right-column[ ### Many inter-related names and terms: - Computational Linguistics - Textual Analysis
.emphasized[Text Mining]
.emphasized[Natural Language Processing] ] --- class: tocslide .left-column[ ## Agenda ## Positioning ## Terminology ## Language ] .right-column[ ### Which programming language / software to use?
.emphasized[Python] - R - PERL
To get started with the Python basics see my [Python Tutorial](https://github.com/TiesdeKok/LearnPythonforResearch) ] --- class: tocslide .left-column[ ## Agenda ## Positioning ## Terminology ## Language ## Jupyter ] .right-column[ ### Project Jupyter
Try it in your browser
Install the Notebook
] --- class: tocslide .left-column[ ## Agenda ## Positioning ## Terminology ## Language ## Jupyter ## NLP Python ] .right-column[ ### External NLP-relevant Python libraries **Standard NLP libraries**: 1. [`NLTK`](http://www.nltk.org/) and the higher-level wrapper [`TextBlob`](https://textblob.readthedocs.io/en/dev/) 2. [`Spacy`](https://spacy.io/) and the higher-level wrapper [`Textacy`](https://github.com/chartbeat-labs/textacy) **Standard machine learning library**: 1. [`scikit learn`](http://scikit-learn.org/stable/) **Topic modelling library**: 1. [`Gensim`](https://github.com/RaRe-Technologies/gensim) ] --- class: tocslide .left-column[ ## Agenda ## Positioning ## Terminology ## Language ## Jupyter ## NLP Python ## Topics ] .right-column[
] --- class: tocslide .left-column[ ## Process
& Clean ] .right-column[
] --- class: tocslide .left-column[ ## Process
& Clean ] .right-column[ ### Text normalization - Sentence segmentation > i.e. split text up into sentences - Word tokenization > i.e. split sentence up into tokens (i.e. words) - Entity normalization > i.e. "http://www.google.com" → "URL" - Lemmatization & Stemming > Convert tokens to a base representation ] --- class: tocslide .left-column[ ## Process
& Clean ] .right-column[ ###Lemmatization & Stemming Stemming: > Crude heuristic process that chops off the ends of words **Lemmatizing:** > Use vocabulary and morphological analysis of words to return the base or dictionary form ] -- .right-column-next[
Example:
] --- class: tocslide .left-column[ ## Process
& Clean ] .right-column[ ### Language modelling Text has a complex underlying structure that you can tap into. - Part-of-Speech tagging > Identify the "Word Class" of a token (e.g. noun, verb) - Remove stop words > Remove words that don't carry any informational value - Uni-Gram vs. N-Grams > Multi-word token: retain some of the sequential nature ] --- class: tocslide .left-column[ ## Process
& Clean ] .right-column[ ###
Uni-Gram vs. N-Grams
> Multi-word token: retain some of the sequential nature
"Tilburg University is located in Noord Brabant"
Unigram
Bigram
Trigram
Tilburg
Tilburg-University
Tilburg-University-is
University
University-is
University-is-located
is
is-located
is-located-in
located
located-in
located-in-Noord
in
in-Noord
in-Noord-Brabant
Noord
Noord-Brabant
Brabant
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[ ### Feature search * Entity extraction > e.g. extract PEOPLE / EVENTS / DATES / MONETARY VALUES * Pattern search (`RE`) > i.e. use [`Regular Expressions`](https://scotch.io/tutorials/an-introduction-to-regex-in-python) to look for patterns * Term (Dictionary) counting > i.e. count the number of times a term occurs ] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[ ### Pattern search (`RE`)
**TIP**: Use [Pythex.org](https://pythex.org/) to try out your regular expression Example on Pythex:
click here
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[ ### Term (Dictionary) counting
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[ ### Text evaluation * Language > i.e. detect whether text is English * Readability > i.e. use the [`TextStat`](https://github.com/shivam5992/textstat) package to calculate text statistics * Text similarity
See the awesome [`FuzzyWuzzy`](https://github.com/seatgeek/fuzzywuzzy) package for details. ] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[ ### Bag of Words Also labelled: *frequency based representation* Term frequency (TF)
(Figure taken from: https://web.stanford.edu/~jurafsky/slp3/6.pdf)
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[ ### Term frequency (TF) example: >
[1] "The sky is blue." > [2] "The sun is bright today." > [3] "The sun in the sky is bright." > [4] "We can see the shining sun, the bright sun."
Note: the collection of all text documents is called the *corpus*
(Example taken from: http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[
] -- .right-column-next[
(Figure taken from: and https://moz.com/blog/7-advanced-seo-concepts)
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[ ### TF-IDF example: >
[1] "The sky is blue." > [2] "The sun is bright today." > [3] "The sun in the sky is bright." > [4] "We can see the shining sun, the bright sun."
(Example taken from: http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[ ### Word Embeddings Are there alternatives to the frequency based representation?
Yes, meet the new "secret sauce": **word embeddings**! ] -- .right-column-next[
Word embeddings are based on a "prediction based representation". Basic idea: > A word is characterized by the company it keeps:
> 1. A **Ferrari** is a fast car > 2. A **Lamborgini** is a fast car
Notes: the most well-known adaptation is `Word2Vec`. Word embeddings are sometimes called *Continous Bag of Words*
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### What is Machine Learning?
> A machine learning algorithm is not explicitly programmed.
Instead, the algorithm is trained based on the input + output data. Does this sound familiar? ] -- .right-column-next[
A linear regression is also machine learning! ] -- .right-column-next[ ### Example: sentiment analysis Traditional method:
manually create pos/neg word lists Machine learning method:
manually classify sentence pos/neg score
pos/neg word lists determined by algorithm ] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### Supervised Machine Learning
> Supervised ML algorithms are trained on classified training data. ] -- .right-column-next[
### Where to get training data? 1. Use a naturally classified training set - News categories - Movie reviews - Text books for different levels of English 2. Create your own training set - Manually classify text - Crowdsource a training set ] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### Crowdsource training set
It is possible to crowd source a training set using services like Amazon Mechanical Turk. ] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### Supervised Machine Learning: models
Three often used models for Supervised ML: 1. Naive Bayes classifier ([sklearn link](http://scikit-learn.org/stable/modules/naive_bayes.html)) 2. SVM: Support Vector Machines ([sklearn link](http://scikit-learn.org/stable/modules/svm.html)) 3. Decision Trees ([sklearn link](http://scikit-learn.org/stable/modules/tree.html#classification)) ] -- .right-column-next[ **My recommendation?** Always try multiple models to see which gives you the best results. * Naive Bayes is good for small samples and quick testing. * SVM is more sophisticated, generally better for more complex models. * Decision Trees are more intuitive but harder to train. Regardless of the model:
hyperparameter optimization is very important! ] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[
(Slide taken from: https://www.slideshare.net/sparktc/hyperparameter-optimization-sven-hafeneger)
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### Model Selection and Evaluation
> i.e. how to select the model and hyperparameters? ####There are two essential metrics in ML: 1. Precision > High precision --> low false positive rate 2. Recall > High recall --> low false negative rate
For details see: [Precision-Recall](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### Unsupervised Machine Learning
> Unsupervised ML algorithms are trained using only input data. Do unsupervised ML models work for all problems?
No! Usually only for clustering / topic modelling. ] -- .right-column-next[
Examples of unsupervised models: 1. Principal Component Analysis / Factor Analysis 2. .emphasized[Latent Dirichlet Allocation (LDA)] (and Word2Vec) ] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### Latent Dirichlet Allocation (LDA)
> Unsupervised topic model technique to discover abstract topics from a collection of documents. ] -- .right-column-next[
### LDA procedure You define the number of topics (*N*) and the other hyperparameters. LDA then assigns each document a vector with *N* topic probabilities. **Important:** topics are not labeled and there is a degree of randomness > i.e. running the same model twice can result in different topic labels! ] --- class: tocslide
--- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ## Neural Networks ] .right-column[ ### Neural Networks for NLP Natural Language is very complex, NLP is hard: 1. The Pope's baby steps on gays 2. Scientists study whales from space 3. Boy paralyzed after tumor fights back to gain black belt
(Examples from: http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture1.pdf)
] -- .right-column-next[ ### Deep Neural Networks DNN allow to model complex phenomenon, promising progress for NLP! Interested? Check out the Stanford course CS224n ([Syllabus](http://web.stanford.edu/class/cs224n/syllabus.html))! ] --- class: tocslide .left-column[ ## Closing remarks ] .right-column[ ### Closing remarks Getting started with Python / NLP can be overwhelming.
This is normal!
**General tips:** 1. Remember, Google is your friend 2. Having a hard time determining your next step?
Try to explicitly formulate what your (sub-)goal is 3. Asking for help?
Avoid the XY problem: [xyproblem.info/](http://xyproblem.info/) 4. Don't get discouraged by the abundance of mathematics ] --- class: tocslide .left-column[ ## Closing remarks ## GitHub repository ] .right-column[ ### GitHub repository Will be communicated in due time. ] --- class: tocslide .left-column[ ## Closing remarks ## GitHub repository ## Assignment ] .right-column[
] --- class: tocslide .left-column[ ## Closing remarks ## GitHub repository ## Assignment ## Questions? ] .right-column[ ### Questions? ] --- class: tocslide .left-column[ ## Closing remarks ## GitHub repository ## Assignment ## Questions? ## Extra ] .right-column[ ### Word Embeddings Can we do better than a Frequency based representation?
Yes, meet the new "secret sauce": **word embeddings**! ] -- .right-column-next[
The most well-known method is called *word2vec*: > Word2vec creates a prediction based representation of text based on several hundered dimensions using a two-layer neural network.
> Each word in the corpus has a probability of belonging to each dimension, resulting in a vector of probabilities. Note, this is sometimes referred to as *Continous Bag of Words* ] --- class: tocslide .left-column[ ## Closing remarks ## GitHub repository ## Assignment ## Questions? ## Extra ] .right-column[ ### Word2Vec example > Paris – France + Spain = Madrid ] --
--- class: tocslide .left-column[ ## Closing remarks ## GitHub repository ## Assignment ## Questions? ## Extra ] .right-column[ ### Neural Networks Primer Linear regression in "Neural Network" representation:
] --- class: tocslide .left-column[ ## Closing remarks ## GitHub repository ## Assignment ## Questions? ## Extra ] .right-column[ ### Neural Networks Primer Shallow Neutral Network representation:
] -- .right-column-next[ Why?
Allows to model complex non-linear relationships! ]