AccountingNLP

# Text Mining Techniques
# Accounting Research
 
## <a href="http://www.tiesdekok.com" target="_blank">Ties de Kok</a>
## Tilburg University

---

---

### What are we going to discuss today?
1. Positioning session
        
2. Terminology
        
3. Language
        
4. Jupyter
        
5. NLP Python tools
        
6. Topics:  
	- Process and Clean text  
	- Direct feature extraction   
    - Represent text numerically
	- Machine learning
    
]

---
class: tocslide

.left-column[
  ## Agenda
  ## Positioning
]
.right-column[
### Where does this session fit into the bigger scheme of NLP?

- Determining relevance textual data

- Finding sources textual data

- Gathering textual data

---
class: tocslide

.left-column[
  ## Agenda
  ## Positioning
  ## Terminology
]
.right-column[
### Many inter-related names and terms:

- Computational Linguistics

- Textual Analysis

]

---
class: tocslide

.left-column[
 ## Agenda
 ## Positioning
 ## Terminology
 ## Language
]
.right-column[
### Which programming language / software to use?

- R

- PERL
 
 
 
To get started with the Python basics see my [Python Tutorial](https://github.com/TiesdeKok/LearnPythonforResearch)

]
        
---
class: tocslide

]

### Project Jupyter

<div style="text-align: center;">
 <img src="images/Jupyter_Screenshot.PNG", width=85%>
 <a class="orange-button" href="https://try.jupyter.org/" target="_blank">Try it in your browser</a>
 <a class="orange-button install-button" href="http://jupyter.org/install.html">Install the Notebook</a>
</div>
 
]
 
---
class: tocslide

.left-column[
  ## Agenda
  ## Positioning
  ## Terminology
  ## Language
  ## Jupyter
  ## NLP Python
]

### External NLP-relevant Python libraries

**Standard NLP libraries**:
1. [`NLTK`](http://www.nltk.org/) and the higher-level wrapper [`TextBlob`](https://textblob.readthedocs.io/en/dev/) 
2. [`Spacy`](https://spacy.io/) and the higher-level wrapper [`Textacy`](https://github.com/chartbeat-labs/textacy) 
        
**Standard machine learning library**:

1. [`scikit learn`](http://scikit-learn.org/stable/)
        
**Topic modelling library**:

1. [`Gensim`](https://github.com/RaRe-Technologies/gensim)

]

---
class: tocslide

.left-column[
  ## Agenda
  ## Positioning
  ## Terminology
  ## Language
  ## Jupyter
  ## NLP Python
  ## Topics
]

]

---
class: tocslide

.right-column[
 
<img style="position: relative; top: -40px" src="images/ProcClean_topic_diagram.png", width=85%>

]

---
class: tocslide

- Sentence segmentation
    > i.e. split text up into sentences
- Word tokenization
    > i.e. split sentence up into tokens (i.e. words)
- Entity normalization
    > i.e. "http://www.google.com" → "URL"
- Lemmatization & Stemming
  > Convert tokens to a base representation

]
        
---
class: tocslide

> Crude heuristic process that chops off the ends of words 
        
**Lemmatizing:**
        
> Use vocabulary and morphological analysis of words to return the base or dictionary form

]       
--

<div style="position: relative; top: 20px" >
 Example:

</div>

] 
        
---
class: tocslide

.right-column[
        
### Language modelling
        
Text has a complex underlying structure that you can tap into.

- Part-of-Speech tagging
    > Identify the "Word Class" of a token (e.g. noun, verb)
- Remove stop words
    > Remove words that don't carry any informational value 
- Uni-Gram vs. N-Grams
    > Multi-word token: retain some of the sequential nature

]
        
---
class: tocslide

> Multi-word token: retain some of the sequential nature
 
"Tilburg University is located in Noord Brabant"
 
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;margin:0px auto;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:2px 50px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;
 text-align: center}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal; font-weight: bold}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
 <tr>
 <th class="tg-yw4l">Unigram</th>
 <th class="tg-yw4l">Bigram</th>
 <th class="tg-yw4l">Trigram</th>
 </tr>
 <tr>
 <td class="tg-yw4l">Tilburg</td>
 <td class="tg-yw4l">Tilburg-University</td>
 <td class="tg-yw4l">Tilburg-University-is</td>
 </tr>
 <tr>
 <td class="tg-yw4l">University</td>
 <td class="tg-yw4l">University-is</td>
 <td class="tg-yw4l">University-is-located</td>
 </tr>
 <tr>
 <td class="tg-yw4l">is</td>
 <td class="tg-yw4l">is-located</td>
 <td class="tg-yw4l">is-located-in</td>
 </tr>
 <tr>
 <td class="tg-yw4l">located</td>
 <td class="tg-yw4l">located-in</td>
 <td class="tg-yw4l">located-in-Noord</td>
 </tr>
 <tr>
 <td class="tg-yw4l">in</td>
 <td class="tg-yw4l">in-Noord</td>
 <td class="tg-yw4l">in-Noord-Brabant</td>
 </tr>
 <tr>
 <td class="tg-yw4l">Noord</td>
 <td class="tg-yw4l">Noord-Brabant</td>
 <td class="tg-yw4l"></td>
 </tr>
 <tr>
 <td class="tg-yw4l">Brabant</td>
 <td class="tg-yw4l"></td>
 <td class="tg-yw4l"></td>
 </tr>
</table>

]
---
class: tocslide

]

---
class: tocslide

### Feature search

* Entity extraction
> e.g. extract PEOPLE / EVENTS / DATES / MONETARY VALUES

* Pattern search (`RE`)
> i.e. use [`Regular Expressions`](https://scotch.io/tutorials/an-introduction-to-regex-in-python) to look for patterns

* Term (Dictionary) counting
> i.e. count the number of times a term occurs
]
        
---
class: tocslide

### Pattern search (`RE`)

<img style="position: relative; margin-top:20px;" src="images/RE_example.PNG", width=75%>
 
 
**TIP**: Use [Pythex.org](https://pythex.org/) to try out your regular expression

Example on Pythex: <a href="https://pythex.org/?regex=IDNUMBER: (\d\d\d-\w\w)&test_string=Ties de Kok (IDNUMBER: 123-AZ). Rest of Text.">click here</a>

]
        
---
class: tocslide

### Term (Dictionary) counting
 
<img style="position: relative; margin-top:20px;" src="images/Count_example.PNG", width=75%>

]
        
---
class: tocslide

### Text evaluation

* Language
> i.e. detect whether text is English
* Readability
> i.e. use the [`TextStat`](https://github.com/shivam5992/textstat) package to calculate text statistics
* Text similarity
 
<img style="position: relative; margin-left:40px;" src="images/Similarity_example.PNG", width=65%>
 
See the awesome [`FuzzyWuzzy`](https://github.com/seatgeek/fuzzywuzzy) package for details.

]

---
class: tocslide

]

---
class: tocslide

### Bag of Words 
 
Also labelled: *frequency based representation*
 
Term frequency (TF) 
 
<img style="position: relative; margin-left:40px;" src="images/BoWs_figure61.PNG", width=65%> 
(Figure taken from: https://web.stanford.edu/~jurafsky/slp3/6.pdf)

]

---
class: tocslide

> [1] "The sky is blue." 
> [2] "The sun is bright today." 
> [3] "The sun in the sky is bright." 
> [4] "We can see the shining sun, the bright sun."

<img style="position: relative; margin-left:40px;" src="images/TF_matrix.PNG", width=21%> 
 
Note: the collection of all text documents is called the *corpus*

(Example taken from: http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)

]
        
---
class: tocslide

]

--
.right-column-next[
<img style="position: relative; margin-left:20px;" src="https://d1avok0lzls2w.cloudfront.net/uploads/blog/5445d032e97981.23456174.jpg", width=50%> 
(Figure taken from: and https://moz.com/blog/7-advanced-seo-concepts) 
 
]

---
class: tocslide

(Example taken from: http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)

]     
---
class: tocslide

### Word Embeddings
 
Are there alternatives to the frequency based representation? 
 Yes, meet the new "secret sauce": **word embeddings**!
 
]
 
--

Word embeddings are based on a "prediction based representation". 
 
Basic idea: 
 
> A word is characterized by the company it keeps: 
 
> 1. A **Ferrari** is a fast car 
> 2. A **Lamborgini** is a fast car
 
Notes: the most well-known adaptation is `Word2Vec`. Word embeddings are sometimes called *Continous Bag of Words*

]
        
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

]

---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### What is Machine Learning?

> A machine learning algorithm is not explicitly programmed. 
 Instead, the algorithm is trained based on the input + output data.
 
Does this sound familiar? 
]
 
--
 
.right-column-next[
 
 A linear regression is also machine learning!
]
--
.right-column-next[
 
### Example: sentiment analysis
 
Traditional method: 
 manually create pos/neg word lists

Machine learning method: 
 manually classify sentence pos/neg score 
 pos/neg word lists determined by algorithm

]
        
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### Supervised Machine Learning

> Supervised ML algorithms are trained on classified training data.
]
 
--
.right-column-next[
 
### Where to get training data?

1. Use a naturally classified training set  
 - News categories
 - Movie reviews
 - Text books for different levels of English
        
2. Create your own training set
 - Manually classify text
 - Crowdsource a training set 
        
]

---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### Crowdsource training set

It is possible to crowd source a training set using services like Amazon Mechanical Turk.

]
        
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### Supervised Machine Learning: models

Three often used models for Supervised ML:
 
1. Naive Bayes classifier ([sklearn link](http://scikit-learn.org/stable/modules/naive_bayes.html))
2. SVM: Support Vector Machines ([sklearn link](http://scikit-learn.org/stable/modules/svm.html)) 
3. Decision Trees ([sklearn link](http://scikit-learn.org/stable/modules/tree.html#classification)) 
]
--
.right-column-next[
 
**My recommendation?**
 
Always try multiple models to see which gives you the best results.
 
* Naive Bayes is good for small samples and quick testing. 
* SVM is more sophisticated, generally better for more complex models. 
* Decision Trees are more intuitive but harder to train. 
 
Regardless of the model: 
 hyperparameter optimization is very important!
 
]
 
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

(Slide taken from: https://www.slideshare.net/sparktc/hyperparameter-optimization-sven-hafeneger) 
] 
 
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### Model Selection and Evaluation

> i.e. how to select the model and hyperparameters?
 
####There are two essential metrics in ML: 
 
1. Precision 
 
 > High precision --> low false positive rate
 
2. Recall 
 
 > High recall --> low false negative rate

For details see: [Precision-Recall](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)
 
]
 
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### Unsupervised Machine Learning

> Unsupervised ML algorithms are trained using only input data.
 
Do unsupervised ML models work for all problems? 
 No! Usually only for clustering / topic modelling. 
 
]
 
--
.right-column-next[
 
Examples of unsupervised models: 
 
1. Principal Component Analysis / Factor Analysis 
2. .emphasized[Latent Dirichlet Allocation (LDA)]

(and Word2Vec)
        
]
        
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### Latent Dirichlet Allocation (LDA)

> Unsupervised topic model technique to discover abstract topics from a collection of documents.
 
]
--
.right-column-next[
 
 
### LDA procedure
 
You define the number of topics (*N*) and the other hyperparameters. 
 
LDA then assigns each document a vector with *N* topic probabilities. 
 
**Important:** topics are not labeled and there is a degree of randomness
> i.e. running the same model twice can result in different topic labels! 
 
]
 
---
class: tocslide

---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
 ## Neural Networks
]

### Neural Networks for NLP
 
Natural Language is very complex, NLP is hard:
 
1. The Pope's baby steps on gays
2. Scientists study whales from space
3. Boy paralyzed after tumor fights back to gain black belt
 
(Examples from: http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture1.pdf)
]
--
 
.right-column-next[
 
### Deep Neural Networks
 
DNN allow to model complex phenomenon, promising progress for NLP!

Interested? Check out the Stanford course CS224n ([Syllabus](http://web.stanford.edu/class/cs224n/syllabus.html))!   
        
]

---
class: tocslide

### Closing remarks
 
Getting started with Python / NLP can be overwhelming. 
 This is normal! 
 
 
 
**General tips:**
 
1. Remember, Google is your friend 
2. Having a hard time determining your next step? 
 Try to explicitly formulate what your (sub-)goal is
3. Asking for help? 
 Avoid the XY problem: [xyproblem.info/](http://xyproblem.info/) 
4. Don't get discouraged by the abundance of mathematics

]
        
---
class: tocslide

### GitHub repository
        
Will be communicated in due time.

]
        
---
class: tocslide

.right-column[
 
<img style="position: relative; top: -20px" src="images/Assignment_topic_diagram.png", width=85%>

]
        
---
class: tocslide

### Questions?

]
        
---
class: tocslide

.left-column[
  ## Closing remarks
  ## GitHub repository
  ## Assignment
  ## Questions?
  ## Extra
]

### Word Embeddings
 
Can we do better than a Frequency based representation? 
 Yes, meet the new "secret sauce": **word embeddings**!
 
]
 
--

.right-column-next[
 
 
The most well-known method is called *word2vec*:
 
> Word2vec creates a prediction based representation of text based on several hundered dimensions using a two-layer neural network. 
 
 
> Each word in the corpus has a probability of belonging to each dimension, resulting in a vector of probabilities.
 
Note, this is sometimes referred to as *Continous Bag of Words*

]
        
---
class: tocslide

.left-column[
  ## Closing remarks
  ## GitHub repository
  ## Assignment
  ## Questions?
  ## Extra
]

### Word2Vec example
 
> Paris – France + Spain = Madrid
 
]
 
--
 
<img style="position: relative; margin-left:20px; margin-top: -10px;" src="images/Word2Vec_example.PNG", width=50%> 
 
 
---
class: tocslide

.left-column[
  ## Closing remarks
  ## GitHub repository
  ## Assignment
  ## Questions?
  ## Extra
]

### Neural Networks Primer

Linear regression in "Neural Network" representation:
 
<img style="position: relative; margin-left:20px; margin-top: -10px;" src="images/Linear_model.PNG", width=55%>

]
        
---
class: tocslide

.left-column[
  ## Closing remarks
  ## GitHub repository
  ## Assignment
  ## Questions?
  ## Extra
]

### Neural Networks Primer

Shallow Neutral Network representation:
 
<img style="position: relative; margin-left:20px; margin-top: -10px;" src="images/Neural_network.PNG", width=75%>

]
 
--
.right-column-next[
Why? 
 Allows to model complex non-linear relationships!
]