AccountingNLP

# Textual Analysis with Python
# for Accounting Research
 
## <a href="http://www.tiesdekok.com" target="_blank">Ties de Kok</a>
## Tilburg University

---

---

.left-column[
 ## About me
]
.right-column[
<img style="position: relative; top: 75px; left: -100px;" src="images/website_human.PNG", width=100%>

<a href="https://www.tiesdekok.com" style="font-size: 18pt; ">Personal Website (TiesdeKok.com)</a>

]

---

### Goal of this session:

1. Introduce you to the basic concepts of textual analysis

2. Highlight NLP techniques useful for Accounting research

3. Introducte my Python tutorial Notebooks

]

--
.right-column-next[

### What I will **not** do:

1. Focus on the technical and mathematical details

2. Throw buzz-words at you for 1 hour

4. Provide you with a comprehensive literature review

]

--
.right-column-next[

### Slides
I have excluded some slides for the sake of time. 
Full presentation will be posted on the ARC platform. ([Link](http://arc.eaa-online.org/ties-de-kok))

]

---
class: tocslide

.left-column[
  ## Agenda
  ## Positioning
]
.right-column[
### Where does this session fit into the bigger scheme of NLP?

- Determining relevance textual data

- Finding sources textual data

- Gathering textual data

---
class: tocslide

.left-column[
  ## Agenda
  ## Positioning
  ## Terminology
]
.right-column[
### Many inter-related names and terms:

- Computational Linguistics

- Textual Analysis

- Text Mining

]

---
class: tocslide

.left-column[
 ## Agenda
 ## Positioning
 ## Terminology
 ## Language
]
.right-column[
### Which programming language / software to use?

- R

- PERL

]

--
.right-column-next[
 
<img style="position: relative; top: -175px; left: 200px;" src="images/python_popularity.PNG", width=75%>

Source: https://stackoverflow.blog/2017/09/06/incredible-growth-python/

]

---
class: tocslide

.left-column[
  ## Agenda
  ## Positioning
  ## Terminology
  ## Language
]
.right-column[
### Learn Python for Research

Want to learn how to use Python?

Take a look at my GitHub repository!

<a href='https://github.com/TiesdeKok/LearnPythonforResearch'><img style="position: relative; top: 0px; left: 60px;" src="images/learnpython.PNG", width=70%></a>
<a href="https://github.com/TiesdeKok/LearnPythonforResearch" style="position:relative; left:80px;">Github.com/TiesdeKok/LearnPythonforResearch</a>

[Python basics](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb) | [Data processing](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb) | [Data visualization](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/3_visualizing_data.ipynb) | [Webscraping](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/4_web_scraping.ipynb)

]

---
class: tocslide

]

### Project Jupyter

<div style="text-align: center;">
 <img src="images/Jupyter_Screenshot.PNG", width=85%>
 <a class="orange-button" href="https://try.jupyter.org/" target="_blank">Try it in your browser</a>
 <a class="orange-button install-button" href="http://jupyter.org/install.html">Install the Notebook</a>
</div>
 
] 
---
class: tocslide

.left-column[
  ## Agenda
  ## Positioning
  ## Terminology
  ## Language
  ## Jupyter
  ## NLP Python
]

### NLP Python libraries: my recommendations

**Standard NLP libraries**:
1. [`NLTK`](http://www.nltk.org/) and the higher-level wrapper [`TextBlob`](https://textblob.readthedocs.io/en/dev/) 
2. [`Spacy`](https://spacy.io/) and the higher-level wrapper [`Textacy`](https://github.com/chartbeat-labs/textacy)

**Standard machine learning library**:

1. [`scikit learn`](http://scikit-learn.org/stable/)

**Topic modelling library** *(not covered)*:

1. [`Gensim`](https://github.com/RaRe-Technologies/gensim)

]

---
class: tocslide

.left-column[
  ## Agenda
  ## Positioning
  ## Terminology
  ## Language
  ## Jupyter
  ## NLP Python
  ## NLP Space
]

]

---
class: tocslide

.right-column[
 
<img style="position: relative; top: -40px" src="images/ProcClean_topic_diagram.png", width=80%>

]

---
exclude: false
class: tocslide

- Sentence segmentation
    > i.e. split text up into sentences

- Word tokenization
    > i.e. split sentence up into tokens (i.e. words)

- Entity normalization
    > i.e. "http://www.google.com" → "URL"

- Lemmatization & Stemming
  > Convert tokens to a base representation

]
        
---
exclude: false
class: tocslide

.right-column[
        
### Language modelling
        
Text has a complex underlying structure that you can tap into.

- Part-of-Speech tagging
    > Identify the "Word Class" of a token (e.g. noun, verb)

- Remove stop words
    > Remove words that don't carry any informational value

- Uni-Gram vs. N-Grams
    > Multi-word token: retain some of the sequential nature

]
        
---
exclude: false
class: tocslide

> Multi-word token: retain some of the sequential nature
 
"Tilburg University is located in Noord Brabant"
 
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;margin:0px auto;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:2px 50px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;
 text-align: center}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal; font-weight: bold}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
 <tr>
 <th class="tg-yw4l">Unigram</th>
 <th class="tg-yw4l">Bigram</th>
 <th class="tg-yw4l">Trigram</th>
 </tr>
 <tr>
 <td class="tg-yw4l">Tilburg</td>
 <td class="tg-yw4l">Tilburg-University</td>
 <td class="tg-yw4l">Tilburg-University-is</td>
 </tr>
 <tr>
 <td class="tg-yw4l">University</td>
 <td class="tg-yw4l">University-is</td>
 <td class="tg-yw4l">University-is-located</td>
 </tr>
 <tr>
 <td class="tg-yw4l">is</td>
 <td class="tg-yw4l">is-located</td>
 <td class="tg-yw4l">is-located-in</td>
 </tr>
 <tr>
 <td class="tg-yw4l">located</td>
 <td class="tg-yw4l">located-in</td>
 <td class="tg-yw4l">located-in-Noord</td>
 </tr>
 <tr>
 <td class="tg-yw4l">in</td>
 <td class="tg-yw4l">in-Noord</td>
 <td class="tg-yw4l">in-Noord-Brabant</td>
 </tr>
 <tr>
 <td class="tg-yw4l">Noord</td>
 <td class="tg-yw4l">Noord-Brabant</td>
 <td class="tg-yw4l"></td>
 </tr>
 <tr>
 <td class="tg-yw4l">Brabant</td>
 <td class="tg-yw4l"></td>
 <td class="tg-yw4l"></td>
 </tr>
</table>

]
---
class: tocslide

]

---
exclude: false
class: tocslide

### Feature search

* Entity extraction
> e.g. extract PEOPLE / EVENTS / DATES / MONETARY VALUES

* Pattern search (`RE`)
> i.e. use [`Regular Expressions`](https://scotch.io/tutorials/an-introduction-to-regex-in-python) to look for patterns

* Term (Dictionary) counting
> i.e. count the number of times a term occurs
]
        
---
class: tocslide

### Pattern search (`RE`)

<img style="position: relative; margin-top:20px;" src="images/RE_example.PNG", width=85%>
 
 
**TIP**: Use [Pythex.org](https://pythex.org/) or [Regex101.com](https://regex101.com) to try out your regular expression

Example on Pythex: <a href="https://pythex.org/?regex=IDNUMBER: (\d\d\d-\w\w)&test_string=Ties de Kok (IDNUMBER: 123-AZ). Rest of Text.">click here</a>

Example on Regex101: <a href="https://regex101.com/r/XxVPOg/1">click here</a>

]
        
---
class: tocslide

### Term (Dictionary) counting
 
<img style="position: relative; margin-top:20px;" src="images/Count_example.PNG", width=85%>

]

---
class: tocslide

### **Accounting Research:** Term (Dictionary) counting
 
1. Loughran and McDonald (2011, JF) 
 Positive / Negative dictionaries for financial texts

2. Garcia and Norli (2012, JFE) 
 Geographic dispersion based on state name mentions

3. Brochet, Loumioti, and Serafeim (2015, RAST) 
 Count horizon related words in conference calls

]

--
.right-column-next[

References:

<li>Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65.</li>
 <li>Garcia, D., & Norli, Ø. (2012). Geographic dispersion and stock returns. Journal of Financial Economics, 106(3), 547-565.</li>
 <li>Brochet, F., Loumioti, M., & Serafeim, G. (2015). Speaking of the short-term: Disclosure horizon and managerial myopia. Review of Accounting Studies, 20(3), 1122-1163.</li>
</lu>

]
        
---
class: tocslide

### Text evaluation

* Language
> i.e. detect whether text is English

* Readability
> i.e. use the [`TextStat`](https://github.com/shivam5992/textstat) package to calculate text statistics

* Text similarity
 
<img style="position: relative; margin-left:40px;" src="images/Similarity_example.PNG", width=55%>
 
See the awesome [`FuzzyWuzzy`](https://github.com/seatgeek/fuzzywuzzy) package for details.

]

---
class: tocslide

### **Accounting Research:** Readability measures
 
1) Li (2008, JAE) 
 Basic readability metrics (Fog etc.) and earnings

2) Bonsall, Leone, Miller, Rennekamp (2017, JAE) 
 Proprietary "Plain English" measure

]
--
.right-column-next[

References:

<li>Li, F. (2008). Annual report readability, current earnings, and earnings persistence. Journal of Accounting and economics, 45(2-3)</li>
 <li>Bonsall IV, S. B., Leone, A. J., Miller, B. P., & Rennekamp, K. (2017). A plain English measure of financial reporting readability. Journal of Accounting and Economics, 63(2-3)</li>

</lu>

]

---
class: tocslide

### **Accounting Research:** Similarity measures

1) Merkley (2013, TAR) 
 Identify amount of repetitive R&D information based on similarity

2) Lang and Stice-Lawrence (2015, JAE) 
 Similarity of financial narratives based on cosine similarity

]
--
.right-column-next[

References:

<li>Merkley, K. J. (2013). Narrative disclosure and earnings performance: Evidence from R&D disclosures. The Accounting Review, 89(2), 725-757.</li>
 <li>Lang, M., & Stice-Lawrence, L. (2015). Textual analysis and international financial reporting: Large sample evidence. Journal of Accounting and Economics, 60(2-3), 110-135.</li>
</lu>

]

---
class: tocslide

]

---
class: tocslide

### Bag of Words 
 
Also labelled: *frequency based representation*
 
Term frequency (TF) 
 
<img style="position: relative; margin-left:40px;" src="images/BoWs_figure61.PNG", width=65%> 
(Figure taken from: https://web.stanford.edu/~jurafsky/slp3/6.pdf)

]

---
class: tocslide

### Term frequency (TF) example:

> [1] "The sky is blue." 
> [2] "The sun is bright today." 
> [3] "The sun in the sky is bright." 
> [4] "We can see the shining sun, the bright sun."

<img style="position: relative; margin-left:40px;" src="images/TF_matrix.PNG", width=21%> 
 
Note: the collection of all text documents is called the *corpus* 
(Example taken from: http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)

]
        
---
exclude: false
class: tocslide

]

--
.right-column-next[
<img style="position: relative; margin-left:80px;" src="https://d1avok0lzls2w.cloudfront.net/uploads/blog/5445d032e97981.23456174.jpg", width=50%> 
(Figure taken from: and https://moz.com/blog/7-advanced-seo-concepts) 
 
]

---
exclude: false
class: tocslide

(Example taken from: http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)

]     
---
exclude: false
class: tocslide

### Word Embeddings
 
Are there alternatives to the frequency based representation? 
 Yes, meet the new "secret sauce": **word embeddings**!
 
]
 
--

Word embeddings are based on a "prediction based representation". 
 
Basic idea: 
 
> A word is characterized by the company it keeps: 
 
> 1. A **Ferrari** is a fast car 
> 2. A **Lamborgini** is a fast car
 
Notes: the most well-known adaptation is `Word2Vec`. Word embeddings are sometimes called *Continous Bag of Words*

]
        
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

]

---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### What is Machine Learning?

> A machine learning algorithm is not explicitly programmed. 
 Instead, the algorithm is trained based on the input + output data.
 
Does this sound familiar? 
]
 
--
 
.right-column-next[
 
 A linear regression is also machine learning!
]
--
.right-column-next[
 
### Example: sentiment analysis
 
Traditional method: 
 manually create pos/neg word lists

Machine learning method: 
 manually classify sentence pos/neg score 
 pos/neg word lists determined by algorithm

]
        
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### Supervised Machine Learning

> Supervised ML algorithms are trained on classified training data.
]
 
--
.right-column-next[
 
### Where to get training data?

1. Use a naturally classified training set  
 - News categories
 - Movie reviews
 - Text books for different levels of English
        
2. Create your own training set
 - Manually classify text
 - Crowdsource a training set

Amazon Mechanical Turk is a great way to get training data!
 
]

---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### Supervised Machine Learning: models

Three often used models for Supervised ML:
 
1. Naive Bayes classifier ([sklearn link](http://scikit-learn.org/stable/modules/naive_bayes.html))
2. SVM: Support Vector Machines ([sklearn link](http://scikit-learn.org/stable/modules/svm.html)) 
3. Decision Trees ([sklearn link](http://scikit-learn.org/stable/modules/tree.html#classification)) 
]
--
.right-column-next[
 
**My recommendation?**
 
Always try multiple models to see which gives you the best results.
 
* Naive Bayes is good for small samples and quick testing. 
* SVM is more sophisticated, generally better for more complex models. 
* Decision Trees are more intuitive but harder to train. 
 
Regardless of the model: 
 hyperparameter optimization is very important!
 
]

---
exclude: false

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

(Slide taken from: https://www.slideshare.net/sparktc/hyperparameter-optimization-sven-hafeneger) 
] 
 
---
exclude: false
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### Model Selection and Evaluation

> i.e. how to select the model and hyperparameters?
 
####There are two essential metrics in ML: 
 
1. Precision 
 
 > High precision --> low false positive rate
 
2. Recall 
 
 > High recall --> low false negative rate

For details see: [Precision-Recall](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)
 
]
 
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### **Accounting Research:** Supervised Machine Learning
 
1) Li (2010, JAR) 
 Classify tone and content of FL statements using Naïve Bayes

2) Jegadeesh and Wu (2013, JAE) 
 Term weights for tone words by "training" on abnormal returns

]
--
.right-column-next[

References:

<li>Li, F. (2010). The information content of forward‐looking statements in corporate filings—A naïve Bayesian machine learning approach. Journal of Accounting Research, 48(5), 1049-1102.</li>

<li>Jegadeesh, N., & Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics, 110(3), 712-729.</li>

</lu>

]
        
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### Unsupervised Machine Learning

> Unsupervised ML algorithms are trained using only input data.
 
Do unsupervised ML models work for all problems? 
 No! Usually only for clustering / topic modelling. 
 
]
 
--
.right-column-next[
 
Examples of unsupervised models: 
 
1. Principal Component Analysis / Factor Analysis 
2. **Latent Dirichlet Allocation (LDA)**

> Unsupervised topic model technique to discover abstract topics from a collection of documents.

]
        
---
exclude: false
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### Latent Dirichlet Allocation (LDA)

> Unsupervised topic model technique to discover abstract topics from a collection of documents.
        
]
--
exclude: false

.right-column-next[
 
 
### LDA procedure
 
You define the number of topics (*N*) and the other hyperparameters. 
 
LDA then assigns each document a vector with *N* topic probabilities. 
 
**Important:** topics are not labeled and there is a degree of randomness
> i.e. running the same model twice can result in different topic labels! 
 
]
 
---
class: tocslide

Made with the awesome <a href="https://github.com/bmabey/pyLDAvis">pyLDAvis</a> package
</div>
---
class: tocslide

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
]

### **Accounting Research:** Unsupervised Machine Learning

1) Dyer, Lang, and Stice-Lawrence (2017, JAE) 
 Use LDA to evaluate how the topics of 10-Ks have changed over time

2) Huang et al. (2017, MS) 
 Thematic content of a large sample of analyst reports using LDA

3) Bird, Karolyi, Ma (2018, SSRN) 
 Use LDA on 8-K documents to detect strategic misclassification

]
--
.right-column-next[

<lu style="font-size: 10pt;">
 References:
 <li>Dyer, T., Lang, M., & Stice-Lawrence, L. (2017). The evolution of 10-K textual disclosure: Evidence from Latent Dirichlet Allocation. Journal of Accounting and Economics</li>
 <li>Huang, A. H., Lehavy, R., Zang, A. Y., & Zheng, R. (2017). Analyst information discovery and interpretation roles: A topic modeling approach. Management Science.</li>
 <li>Bird, A., Karolyi, S. A., & Ma, P. (2018). Strategic disclosure misclassification. SSRN</li>

</lu>

]

---
exclude: false

.left-column[
 ## Process & Clean
 ## Feature Extraction
 ## Represent Numerically
 ## Machine Learning
 ## Neural Networks
]

### Neural Networks for NLP
 
Natural Language is very complex, NLP is hard:
 
1. The Pope's baby steps on gays
2. Scientists study whales from space
3. Boy paralyzed after tumor fights back to gain black belt
 
(Examples from: http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture1.pdf)
]
--
exclude: false
.right-column-next[
 
### Deep Neural Networks
 
DNN allow to model complex phenomenon, promising progress for NLP!

Interested? Check out the Stanford course CS224n ([Syllabus](http://web.stanford.edu/class/cs224n/syllabus.html))!   
        
]

---
class: tocslide

Take a look at my NLP repository!

<a href="https://github.com/TiesdeKok/Python_NLP_Tutorial" style="position:relative; left:80px;">Github.com/TiesdeKok/Python_NLP_Tutorial</a>

]

---
class: tocslide

### Closing remarks
        
Getting started with Python / NLP can be overwhelming.

This is normal!

**General tips:**
        
1. Remember, Google is your friend

2. Having a hard time determining your next step? 
 Try to explicitly formulate what your (sub-)goal is

3. Asking for help? 
 Avoid the XY problem: [xyproblem.info/](http://xyproblem.info/)

4. Don't get discouraged by the abundance of mathematics

]

---
class: tocslide

Questions?

<img style="position: relative; top: 10px; left:80px;" src="images/Business4-150ppp.jpg", width=60%> 
 
]