` parent:
CSS Selector: `div > p.title`
For a full overview I recommend checking this page:
https://www.w3schools.com/cssref/css_selectors.asp
]
---
class: tocslide
.left-column[
## API
Requests
## Web
Scraping
]
.right-column[
#### `SelectorGadget` Chrome extension:
#### Chrome DevTools:
]
---
class: tocslide
.left-column[
## API
Requests
## Web
Scraping
]
.right-column[
### Recap: web-scraping a page
**Step 1: determine the URL of the page you need**
> URL = https://www.tiesdekok.com
**Step 2 and Step 3: download and parse the HTML of the webpage**
Note: the `Requests-HTML` does the HTML parsing automatically for you.
**Step 4: use CSS Selectors to extract information**
]
---
class: tocslide
.left-column[
## API
Requests
## Web
Scraping
## JavaScript
Pages
]
.right-column[
### JavaScript heavy webpages
Some webpages rely heavily on JavaScript to load in data-elements:
]
---
class: tocslide
.left-column[
## API
Requests
## Web
Scraping
## JavaScript
Pages
]
.right-column[
### JavaScript heavy webpages
Can we still scrape them?
**Sure, but with a different approach:**
**Option 1:** use browser automation tools
Two primary tools:
1. Use a headless browser (`requests-html` can do this)
2. Use `Selenium` with Chrome bindings
**Option 2:** try to reverse-engineer the HTTP Requests
]
---
class: tocslide
.left-column[
## API
Requests
## Web
Scraping
## JavaScript
Pages
]
.right-column[
**Option 1:** use browser automation tools
Use a headless browser (`requests-html` can do this)
Note: first time you run `html.render()` it will download some dependencies.
]
---
class: tocslide
.left-column[
## API
Requests
## Web
Scraping
## JavaScript
Pages
]
.right-column[
**Option 1:** use browser automation tools
Use `Selenium` with Chrome bindings
GIF courtesy of the PyWhatsapp GitHub page
]
---
class: tocslide
.left-column[
## API
Requests
## Web
Scraping
## JavaScript
Pages
## HTTP
Requests
]
.right-column[
### HTTP Requests
Modern webpages often "load" data to the page using HTTP Requests.
**Tip: reverse-engineer the APIs that are used and mimic them!**
]
--
.right-column-next[
### Example:
Let's say we want to get data on the approval rating for Jeff Bezos:
]
---
class: tocslide
.left-column[
## API
Requests
## Web
Scraping
## JavaScript
Pages
## HTTP
Requests
]
.right-column[
What do we see in the `Network Sniffer` Chrome extension?
]
--
.right-column-next[
]
---
class: tocslide
.left-column[
## Get
Started!
]
.right-column[
What is next?
## Demonstration:
Watch the demonstration video, see Discord for the link.
## Problems:
Solve tasks in the "web_gathering_problems.ipynb" notebook.
]
---
class: tocslide
exclude: true
.left-column[
## Closing
remarks
]
.right-column[
Questions?
]
---
class: tocslide
exclude: true
.left-column[
## Closing
remarks
## Demonstration
]
.right-column[
Demonstration
]
---
class: tocslide
exclude: true
.left-column[
## Closing
remarks
## Demonstration
## Mini-task
]
.right-column[
## Setup:
1. Download the day 3 materials from GitHub
2. Make sure you have Chrome installed
3. Install `SelectorGadget` extension
## Mini-tasks:
**Goal:** Solve tasks in the "web_gathering_tasks.ipynb" notebook.
1. Open a Jupyter Notebook in the `limperg_python_2019` folder
2. Solve the web gathering tasks
Find them in `minitasks > day_3 > web_gathering_tasks.ipynb`
### You will need these notebooks:
]