Skip to the content.

Code 401 Class 17 Reading Notes

Web Scrape with Python in 4 minutes

Web Scraping is a technique to automatically access and extract large amounts of information from a website, which can save huge amount of time and effort.

Important

  1. Read through the website’s Terms and Conditions to understand how you can legally use the data. Most sites prohibit you from using the data for commercial proposes.
  2. Make sure your not downloading data at too rapid a rate because this may break the website. You may potentially be blocked from the site as well.

Inspecting with Python Code

First - import the following libraries

import requests
import urllib.request
import time
from bs4 import BeautifulSoup

Second - Set the url to the website and access the site with our requests library.

url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)

Third - parse the html with BeautifulSoup so that we can work with a nicer data structure.

soup = BeautifulSoup(response.text, “html.parser”)

Fourth - Use the method .findAll to locate all of our <a> tags

soup.findAll('a')

Then you will get every code that has an <a> tag.

Fifth Found the link you want at line 38

one_a_tag = soup.findAll(‘a’)[38]
link = one_a_tag[‘href’]

This saves a a text file data/nyct/turnstile/turnstile_180922.txt to the variable link.

You can access these with the provided request.urlretrieve with two parameters: file url and the filename.

download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])

Finally set a line of code that pauses our code for a second so we are not spamming the website with requests.

time.sleep(1)

Final Code Look:

# Import libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

# Set the URL you want to webscrape from
url = 'http://web.mta.info/developers/turnstile.html'

# Connect to the URL
response = requests.get(url)

# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")

# To download the whole data set, let's do a for loop through all a tags
line_count = 1 #variable to track what line you are on
for one_a_tag in soup.findAll('a'):  #'a' tags are for links
    if line_count >= 36: #code for text files starts at line 36
        link = one_a_tag['href']
        download_url = 'http://web.mta.info/developers/'+ link
        urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:]) 
        time.sleep(1) #pause the code for a sec
    #add 1 for next line
    line_count +=1

What is Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. This can be accomplished by using the Hypertext Transfer Protocol or a web browser.

Techniques

Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire web sites into structured information, with limitations.


How to scrap websites without getting blocked

Basic Rule: “Be Nice”

An overarching rule to keep in mind for any kind of web scraping is BE GOOD AND FOLLOW A WEBSITE’s CRAWLING POLICIES

Things I want to know more about

<—BACK