day_3_slides

# Python Course
# Part 3: Gathering data from the web

---

---

### What are we going to discuss today?

1. How does the web work?

2. Terminology
        
3. Ethics
        
4. Tools
        
5. Specific topics:  
	- Interacting with an API
	- Web scrape a page
  - Dealing with Javascript elements
  - Reverse-engineer HTTP requests
    
]

---
class: tocslide

#### What we see:

<img style="position: relative; top: -20px" src="images/website_human.PNG", width=85%>
 
]

---
class: tocslide

#### What computers see:

<img style="position: relative; top: -40px" src="images/website_computer.PNG", width=75%>
 
]

---
class: tocslide

#### Web browsers are awesome!

<img style="position: relative; top: 20px; left: -300px;" src="images/comp_to_human.PNG", width=145%>
 
]

---
class: tocslide

.left-column[
 ## Agenda
 ## The web
 ## Terminology
]
.right-column[
- HTML 
 The structure + static content
- CSS 
 Determines the way things look
- Javascript 
 Code to make stuff happen

]

---
class: tocslide

.left-column[
  ## Agenda
  ## The web
  ## Terminology
]
.right-column[
### Gathering data from the web

- Web scraping 
 Extract data from a webpage

- Web crawling 
 Automatically move across webpages

]

- API 
 "Webpages" for computers, not for humans.

Let's say we want to know the current price of Bitcoin:

Humans: Google "Bitcoin price", open website, click on some menu.

Computers: ??

]

---
class: tocslide

Computers: https://api.coindesk.com/v1/bpi/currentprice.json

]

---
class: tocslide

.left-column[
  ## Agenda
  ## The web
  ## Terminology
  ## Ethics
]
.right-column[
### Ethics of gathering data from the web

General rules:

1. Always use the API if there is an API available
2. Never allow your scraper to turn into an unintentional DoS attack 
3. Read the ToS, explicitly prohibited? Probably don't do it

There is a lot of conflicting information and advice...

My golden principles:

1. Make sure nothing is harmed by gathering the data
2. Never distribute gathered data without permission

]
        
---
class: tocslide

### Basic tools
        
- Chrome + Chrome DevTools

Open DevTools: `CTRL` + `SHIFT` + `J`

]

---
class: tocslide

### Basic tools
        
- `SelectorGadget` Chrome extension

<a href="https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en">Downoad + Install `SelectorGadget`</a>

]

---
exclude: true
class: tocslide

### Basic tools
        
- `Network Sniffer` Chrome extension

<a href="https://chrome.google.com/webstore/detail/network-sniffer/coblekblkacfilmgdghecpekhadldjfj?hl=en">Downoad + Install `Network Sniffer`</a>

]

---
class: tocslide

### Python packages

**To retrieve a webpage:**

1) [`Requests`](http://docs.python-requests.org/en/master/)

**To process a webpage:**

2) [`LXML`](http://lxml.de/) and the higher-level wrapper [`Requests-HTML`](https://github.com/kennethreitz/requests-html)

**To deal with JavaScript heavy-webpages:**

3) [`Selenium with Python`](http://selenium-python.readthedocs.io/) 
(Note: Selenium is not covered in this workshop)

]
        
---
class: tocslide

##  Specific topics:

- Interacting with an API

- Web scrape a page

- Dealing with Javascript elements

- Reverse-engineer HTTP requests 
(Note: only covered briefly in this workshop)

]

---
class: tocslide

.right-column[

### Basics of using an API

 
An API, in a simplified sense, has two characteristics:
1. A request is made using a URL + parameters 
2. A response is returned in a machine-readable format.

The machine-readable formats are usually either:
- JSON
- XML
- (sometimes plain text)

]

---
class: tocslide

]
        
---
class: tocslide

**Scraping a page consists of 4 steps:**

1. Construct / determine the URL  
2. Retrieve the webpage data (usually HTML)  
3. Parse the HTML 
4. Extract information using HTML structure

]

**Step 1 and Step 2:**

Mostly the same compared to interacting with an API.

**Step 3 and Step 4:**

Different, a HTML page is meant for humans!
 
]

---
class: tocslide

.right-column[
 
**Retrieve webpage data (step 1 & 2)**: 
<img style="position: relative; top: 10px; left:0px;" src="images/get_website_requests.PNG", width=70%>

]

**Retrieve data (step 3 & 4)**: 
<img style="position: relative; top: 20px; left:0px;" src="images/requests_text_res.PNG", width=85%>

]

---
class: tocslide

.right-column[
        
### Parse HTML (step 3):
  
HTML has a clear structure, but we need to parse (i.e. interpret) it first!

The `lxml.html` library (or `requests-html` wrapper) make this very easy.

We can now use the HTML structure to extract information!

]

---
class: tocslide

.right-column[
 
### CSS Selectors (step 4, extract information)
 
<iframe
 style="width: 100%; height: 150px; margin-top: 30px; margin-bottom:30px;"
 src="http://jsfiddle.net/bxjjvwdf/embedded/result,html,css/">
</iframe>

]

--
.right-column-next[

**CSS Selector make it very easy to select elements to extract!**

]

---
class: tocslide

Most frequent options:

1. Use a dot to select based on **class**:  ` .classname `
2. Use a hash to select based on **id**: ` #idname `
3. Selected directly on the **element**: `p`, `span`, `h1`

You can chain multiple conditions together using: `>`, `+`, and `~`.

**Example:** get `` elements with the `title` class and `<div>` parent: 
 CSS Selector: `div > p.title`

For a full overview I recommend checking this page:  
https://www.w3schools.com/cssref/css_selectors.asp

]

---
class: tocslide

#### `SelectorGadget` Chrome extension:

#### Chrome DevTools:

]

---
class: tocslide

**Step 1: determine the URL of the page you need**

> URL = https://www.tiesdekok.com

**Step 2 and Step 3: download and parse the HTML of the webpage**

<img style="position: relative; top: 00px; left:50px;" src="images/step2and3_web.PNG", width=50%> 
Note: the `Requests-HTML` does the HTML parsing automatically for you.

**Step 4: use CSS Selectors to extract information**

<img style="position: relative; top: 00px; left:55px;" src="images/step4_web.PNG", width=45%> 
]

---
class: tocslide

Some webpages rely heavily on JavaScript to load in data-elements:

]

---
class: tocslide

Can we still scrape them?

**Sure, but with a different approach:**

**Option 1:** use browser automation tools 
 Two primary tools:

1. Use a headless browser (`requests-html` can do this)

2. Use `Selenium` with Chrome bindings

**Option 2:** try to reverse-engineer the HTTP Requests

]

---
class: tocslide

**Option 1:** use browser automation tools

Use a headless browser (`requests-html` can do this)

Note: first time you run `html.render()` it will download some dependencies.

]

---
class: tocslide

**Option 1:** use browser automation tools

Use `Selenium` with Chrome bindings

GIF courtesy of the <a href="https://github.com/droidlife/PyWhatsapp">PyWhatsapp GitHub page</a>

]

---
class: tocslide

Modern webpages often "load" data to the page using HTTP Requests.

**Tip: reverse-engineer the APIs that are used and mimic them!**

]

### Example:

Let's say we want to get data on the approval rating for Jeff Bezos:

]

---
class: tocslide

]

]

---
class: tocslide

What is next?

## Demonstration:
Watch the demonstration video, see Discord for the link.

## Problems:
Solve tasks in the "web_gathering_problems.ipynb" notebook.

]

---
class: tocslide
exclude: true

Questions?

<img style="position: relative; top: 10px; left:80px;" src="images/Business4-150ppp.jpg", width=60%> 
 
]

---
class: tocslide
exclude: true

Demonstration

]

---
class: tocslide
exclude: true

## Setup:

1. Download the day 3 materials from GitHub
2. Make sure you have Chrome installed
3. Install `SelectorGadget` extension

## Mini-tasks:

**Goal:** Solve tasks in the "web_gathering_tasks.ipynb" notebook.

1. Open a Jupyter Notebook in the `limperg_python_2019` folder
2. Solve the web gathering tasks 
 Find them in `minitasks > day_3 > web_gathering_tasks.ipynb`

### You will need these notebooks: 
<div style='margin-top: -1px; margin-left: 30px;'>
 <lu>
<li><a href="https://github.com/TiesdeKok/LearnPythonforResearch">Python tutorial</a></li>
<li><a href="https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb">Python Basics Notebook</a></li>
<li><a href="https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/4_web_scraping.ipynb">Gathering Web Data</a></li>
</lu>
</div>
 
]