class: center, titleslide
# Textual Analysis with Python # for Accounting Research
##
Ties de Kok
## Tilburg University --- layout: true class: mainlayout --- class: tocslide .left-column[ ## About me ] .right-column[
Personal Website (TiesdeKok.com)
] --- class: tocslide .left-column[ ## Agenda ] .right-column[ ### Goal of this session: 1. Introduce you to the basic concepts of textual analysis 2. Highlight NLP techniques useful for Accounting research 3. Introducte my Python tutorial Notebooks ] -- .right-column-next[ ### What I will **not** do: 1. Focus on the technical and mathematical details 2. Throw buzz-words at you for 1 hour 4. Provide you with a comprehensive literature review ] -- .right-column-next[ ### Slides
I have excluded some slides for the sake of time.
Full presentation will be posted on the ARC platform. ([Link](http://arc.eaa-online.org/ties-de-kok)) ] --- class: tocslide .left-column[ ## Agenda ## Positioning ] .right-column[ ### Where does this session fit into the bigger scheme of NLP?
- Determining relevance textual data - Finding sources textual data - Gathering textual data
.emphasized[Processing textual data]
.emphasized[Analyzing textual data] ] --- class: tocslide .left-column[ ## Agenda ## Positioning ## Terminology ] .right-column[ ### Many inter-related names and terms:
- Computational Linguistics - Textual Analysis - Text Mining
.emphasized[Natural Language Processing] ] --- class: tocslide .left-column[ ## Agenda ## Positioning ## Terminology ## Language ] .right-column[ ### Which programming language / software to use?
.emphasized[Python] - R - PERL ] -- .right-column-next[
Source: https://stackoverflow.blog/2017/09/06/incredible-growth-python/
] --- class: tocslide .left-column[ ## Agenda ## Positioning ## Terminology ## Language ] .right-column[ ### Learn Python for Research Want to learn how to use Python?
Take a look at my GitHub repository!
Github.com/TiesdeKok/LearnPythonforResearch
[Python basics](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb) | [Data processing](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb) | [Data visualization](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/3_visualizing_data.ipynb) | [Webscraping](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/4_web_scraping.ipynb) ] --- class: tocslide .left-column[ ## Agenda ## Positioning ## Terminology ## Language ## Jupyter ] .right-column[ ### Project Jupyter
Try it in your browser
Install the Notebook
] --- class: tocslide .left-column[ ## Agenda ## Positioning ## Terminology ## Language ## Jupyter ## NLP Python ] .right-column[ ### NLP Python libraries: my recommendations
**Standard NLP libraries**: 1. [`NLTK`](http://www.nltk.org/) and the higher-level wrapper [`TextBlob`](https://textblob.readthedocs.io/en/dev/) 2. [`Spacy`](https://spacy.io/) and the higher-level wrapper [`Textacy`](https://github.com/chartbeat-labs/textacy)
**Standard machine learning library**: 1. [`scikit learn`](http://scikit-learn.org/stable/)
**Topic modelling library** *(not covered)*: 1. [`Gensim`](https://github.com/RaRe-Technologies/gensim) ] --- class: tocslide .left-column[ ## Agenda ## Positioning ## Terminology ## Language ## Jupyter ## NLP Python ## NLP Space ] .right-column[
] --- class: tocslide .left-column[ ## Process
& Clean ] .right-column[
] --- exclude: false class: tocslide .left-column[ ## Process
& Clean ] .right-column[ ### Text normalization - Sentence segmentation > i.e. split text up into sentences - Word tokenization > i.e. split sentence up into tokens (i.e. words) - Entity normalization > i.e. "http://www.google.com" → "URL" - Lemmatization & Stemming > Convert tokens to a base representation ] --- exclude: false class: tocslide .left-column[ ## Process
& Clean ] .right-column[ ### Language modelling Text has a complex underlying structure that you can tap into. - Part-of-Speech tagging > Identify the "Word Class" of a token (e.g. noun, verb) - Remove stop words > Remove words that don't carry any informational value - Uni-Gram vs. N-Grams > Multi-word token: retain some of the sequential nature ] --- exclude: false class: tocslide .left-column[ ## Process
& Clean ] .right-column[ ###
Uni-Gram vs. N-Grams
> Multi-word token: retain some of the sequential nature
"Tilburg University is located in Noord Brabant"
Unigram
Bigram
Trigram
Tilburg
Tilburg-University
Tilburg-University-is
University
University-is
University-is-located
is
is-located
is-located-in
located
located-in
located-in-Noord
in
in-Noord
in-Noord-Brabant
Noord
Noord-Brabant
Brabant
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[
] --- exclude: false class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[ ### Feature search
* Entity extraction > e.g. extract PEOPLE / EVENTS / DATES / MONETARY VALUES * Pattern search (`RE`) > i.e. use [`Regular Expressions`](https://scotch.io/tutorials/an-introduction-to-regex-in-python) to look for patterns * Term (Dictionary) counting > i.e. count the number of times a term occurs ] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[ ### Pattern search (`RE`)
**TIP**: Use [Pythex.org](https://pythex.org/) or [Regex101.com](https://regex101.com) to try out your regular expression
Example on Pythex:
click here
Example on Regex101:
click here
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[ ### Term (Dictionary) counting
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[ ### **Accounting Research:** Term (Dictionary) counting 1. Loughran and McDonald (2011, JF)
Positive / Negative dictionaries for financial texts 2. Garcia and Norli (2012, JFE)
Geographic dispersion based on state name mentions 3. Brochet, Loumioti, and Serafeim (2015, RAST)
Count horizon related words in conference calls ] -- .right-column-next[
References:
Loughran, T., & McDonald, B. (2011).
When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65.
Garcia, D., & Norli, Ø. (2012).
Geographic dispersion and stock returns. Journal of Financial Economics, 106(3), 547-565.
Brochet, F., Loumioti, M., & Serafeim, G. (2015).
Speaking of the short-term: Disclosure horizon and managerial myopia. Review of Accounting Studies, 20(3), 1122-1163.
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[ ### Text evaluation * Language > i.e. detect whether text is English * Readability > i.e. use the [`TextStat`](https://github.com/shivam5992/textstat) package to calculate text statistics * Text similarity
See the awesome [`FuzzyWuzzy`](https://github.com/seatgeek/fuzzywuzzy) package for details. ] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[ ### **Accounting Research:** Readability measures 1) Li (2008, JAE)
Basic readability metrics (Fog etc.) and earnings
2) Bonsall, Leone, Miller, Rennekamp (2017, JAE)
Proprietary "Plain English" measure ] -- .right-column-next[
References:
Li, F. (2008).
Annual report readability, current earnings, and earnings persistence. Journal of Accounting and economics, 45(2-3)
Bonsall IV, S. B., Leone, A. J., Miller, B. P., & Rennekamp, K. (2017).
A plain English measure of financial reporting readability. Journal of Accounting and Economics, 63(2-3)
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ] .right-column[ ### **Accounting Research:** Similarity measures 1) Merkley (2013, TAR)
Identify amount of repetitive R&D information based on similarity
2) Lang and Stice-Lawrence (2015, JAE)
Similarity of financial narratives based on cosine similarity ] -- .right-column-next[
References:
Merkley, K. J. (2013).
Narrative disclosure and earnings performance: Evidence from R&D disclosures. The Accounting Review, 89(2), 725-757.
Lang, M., & Stice-Lawrence, L. (2015).
Textual analysis and international financial reporting: Large sample evidence. Journal of Accounting and Economics, 60(2-3), 110-135.
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[ ### Bag of Words Also labelled: *frequency based representation* Term frequency (TF)
(Figure taken from: https://web.stanford.edu/~jurafsky/slp3/6.pdf)
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[ ### Term frequency (TF) example:
>
[1] "The sky is blue." > [2] "The sun is bright today." > [3] "The sun in the sky is bright." > [4] "We can see the shining sun, the bright sun."
Note: the collection of all text documents is called the *corpus*
(Example taken from: http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)
] --- exclude: false class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[
] -- .right-column-next[
(Figure taken from: and https://moz.com/blog/7-advanced-seo-concepts)
] --- exclude: false class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[ ### TF-IDF example:
>
[1] "The sky is blue." > [2] "The sun is bright today." > [3] "The sun in the sky is bright." > [4] "We can see the shining sun, the bright sun."
(Example taken from: http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)
] --- exclude: false class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ] .right-column[ ### Word Embeddings Are there alternatives to the frequency based representation?
Yes, meet the new "secret sauce": **word embeddings**! ] -- .right-column-next[
Word embeddings are based on a "prediction based representation". Basic idea: > A word is characterized by the company it keeps:
> 1. A **Ferrari** is a fast car > 2. A **Lamborgini** is a fast car
Notes: the most well-known adaptation is `Word2Vec`. Word embeddings are sometimes called *Continous Bag of Words*
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### What is Machine Learning?
> A machine learning algorithm is not explicitly programmed.
Instead, the algorithm is trained based on the input + output data. Does this sound familiar? ] -- .right-column-next[
A linear regression is also machine learning! ] -- .right-column-next[ ### Example: sentiment analysis Traditional method:
manually create pos/neg word lists Machine learning method:
manually classify sentence pos/neg score
pos/neg word lists determined by algorithm ] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### Supervised Machine Learning
> Supervised ML algorithms are trained on classified training data. ] -- .right-column-next[
### Where to get training data? 1. Use a naturally classified training set - News categories - Movie reviews - Text books for different levels of English 2. Create your own training set - Manually classify text - Crowdsource a training set
Amazon Mechanical Turk is a great way to get training data! ] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### Supervised Machine Learning: models
Three often used models for Supervised ML: 1. Naive Bayes classifier ([sklearn link](http://scikit-learn.org/stable/modules/naive_bayes.html)) 2. SVM: Support Vector Machines ([sklearn link](http://scikit-learn.org/stable/modules/svm.html)) 3. Decision Trees ([sklearn link](http://scikit-learn.org/stable/modules/tree.html#classification)) ] -- .right-column-next[ **My recommendation?** Always try multiple models to see which gives you the best results. * Naive Bayes is good for small samples and quick testing. * SVM is more sophisticated, generally better for more complex models. * Decision Trees are more intuitive but harder to train. Regardless of the model:
hyperparameter optimization is very important! ] --- exclude: false class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[
(Slide taken from: https://www.slideshare.net/sparktc/hyperparameter-optimization-sven-hafeneger)
] --- exclude: false class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### Model Selection and Evaluation
> i.e. how to select the model and hyperparameters? ####There are two essential metrics in ML: 1. Precision > High precision --> low false positive rate 2. Recall > High recall --> low false negative rate
For details see: [Precision-Recall](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### **Accounting Research:** Supervised Machine Learning 1) Li (2010, JAR)
Classify tone and content of FL statements using Naïve Bayes
2) Jegadeesh and Wu (2013, JAE)
Term weights for tone words by "training" on abnormal returns ] -- .right-column-next[
References:
Li, F. (2010).
The information content of forward‐looking statements in corporate filings—A naïve Bayesian machine learning approach. Journal of Accounting Research, 48(5), 1049-1102.
Jegadeesh, N., & Wu, D. (2013).
Word power: A new approach for content analysis. Journal of Financial Economics, 110(3), 712-729.
] --- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### Unsupervised Machine Learning
> Unsupervised ML algorithms are trained using only input data. Do unsupervised ML models work for all problems?
No! Usually only for clustering / topic modelling. ] -- .right-column-next[
Examples of unsupervised models: 1. Principal Component Analysis / Factor Analysis 2. **Latent Dirichlet Allocation (LDA)** > Unsupervised topic model technique to discover abstract topics from a collection of documents. ] --- exclude: false class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### Latent Dirichlet Allocation (LDA)
> Unsupervised topic model technique to discover abstract topics from a collection of documents. ] -- exclude: false .right-column-next[
### LDA procedure You define the number of topics (*N*) and the other hyperparameters. LDA then assigns each document a vector with *N* topic probabilities. **Important:** topics are not labeled and there is a degree of randomness > i.e. running the same model twice can result in different topic labels! ] --- class: tocslide
Made with the awesome
pyLDAvis
package
--- class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ] .right-column[ ### **Accounting Research:** Unsupervised Machine Learning 1) Dyer, Lang, and Stice-Lawrence (2017, JAE)
Use LDA to evaluate how the topics of 10-Ks have changed over time 2) Huang et al. (2017, MS)
Thematic content of a large sample of analyst reports using LDA 3) Bird, Karolyi, Ma (2018, SSRN)
Use LDA on 8-K documents to detect strategic misclassification ] -- .right-column-next[
References:
Dyer, T., Lang, M., & Stice-Lawrence, L. (2017).
The evolution of 10-K textual disclosure: Evidence from Latent Dirichlet Allocation. Journal of Accounting and Economics
Huang, A. H., Lehavy, R., Zang, A. Y., & Zheng, R. (2017).
Analyst information discovery and interpretation roles: A topic modeling approach. Management Science.
Bird, A., Karolyi, S. A., & Ma, P. (2018).
Strategic disclosure misclassification. SSRN
] --- exclude: false class: tocslide .left-column[ ## Process
& Clean ## Feature Extraction ## Represent Numerically ## Machine Learning ## Neural Networks ] .right-column[ ### Neural Networks for NLP Natural Language is very complex, NLP is hard: 1. The Pope's baby steps on gays 2. Scientists study whales from space 3. Boy paralyzed after tumor fights back to gain black belt
(Examples from: http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture1.pdf)
] -- exclude: false .right-column-next[ ### Deep Neural Networks DNN allow to model complex phenomenon, promising progress for NLP! Interested? Check out the Stanford course CS224n ([Syllabus](http://web.stanford.edu/class/cs224n/syllabus.html))! ] --- class: tocslide .left-column[ ## Notebook ] .right-column[ ### How to get started with NLP and Python?
Take a look at my NLP repository!
Github.com/TiesdeKok/Python_NLP_Tutorial
] --- class: tocslide .left-column[ ## Notebook ## Closing remarks ] .right-column[ ### Closing remarks Getting started with Python / NLP can be overwhelming.
This is normal!
**General tips:** 1. Remember, Google is your friend 2. Having a hard time determining your next step?
Try to explicitly formulate what your (sub-)goal is 3. Asking for help?
Avoid the XY problem: [xyproblem.info/](http://xyproblem.info/) 4. Don't get discouraged by the abundance of mathematics ] --- class: tocslide .left-column[ ## Notebook ## Closing
remarks ] .right-column[
Questions?
]