Code 401 Class 17 Reading Notes
Web Scrape with Python in 4 minutes
Web Scraping is a technique to automatically access and extract large amounts of information from a website, which can save huge amount of time and effort.
Important
- Read through the website’s Terms and Conditions to understand how you can legally use the data. Most sites prohibit you from using the data for commercial proposes.
- Make sure your not downloading data at too rapid a rate because this may break the website. You may potentially be blocked from the site as well.
Inspecting with Python Code
First - import the following libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
Second - Set the url to the website and access the site with our requests library.
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
Third - parse the html with BeautifulSoup so that we can work with a nicer data structure.
soup = BeautifulSoup(response.text, “html.parser”)
Fourth - Use the method .findAll
to locate all of our <a>
tags
soup.findAll('a')
Then you will get every code that has an <a>
tag.
Fifth Found the link you want at line 38
one_a_tag = soup.findAll(‘a’)[38]
link = one_a_tag[‘href’]
This saves a a text file data/nyct/turnstile/turnstile_180922.txt
to the variable link.
You can access these with the provided request.urlretrieve with two parameters: file url and the filename.
download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])
Finally set a line of code that pauses our code for a second so we are not spamming the website with requests.
time.sleep(1)
Final Code Look:
# Import libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set the URL you want to webscrape from
url = 'http://web.mta.info/developers/turnstile.html'
# Connect to the URL
response = requests.get(url)
# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")
# To download the whole data set, let's do a for loop through all a tags
line_count = 1 #variable to track what line you are on
for one_a_tag in soup.findAll('a'): #'a' tags are for links
if line_count >= 36: #code for text files starts at line 36
link = one_a_tag['href']
download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])
time.sleep(1) #pause the code for a sec
#add 1 for next line
line_count +=1
What is Web Scraping
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. This can be accomplished by using the Hypertext Transfer Protocol or a web browser.
- Web scraping a web page involves fetching it and extracting from it.
Techniques
Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire web sites into structured information, with limitations.
- Human copy-and-paste: Manually copying and pasting data from a web page into a text file or spreadsheet. Sometimes this is the best practice because technology cannot replace a human’s manual examination and copy-and-paste. Also websites set up barriers to prevent machine automation.
- Text pattern matching: A simple powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression-matching facilities of programming languages.
- HTTP programming: Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.
- HTML parsing: A program that detects websites with large collections of pages that are typically encoded into similar pages by a common script or template, extracts its content and translates it into a relation form called a wrapper.
- Vertical aggregation: A platform that creates and monitors a multitude of “bots” for specific verticals with no “man in the loop”, and no work related to a specific target site.
- Semantic annotation recognizing: The pages being scraped may embrace metadata or semantic mark ups and annotations, which can be used to locate specific data snippets.
- Computer vision web-page analysis: Machine learning and computer vision that attempt to identify and extract information from webpages like a human being.
How to scrap websites without getting blocked
Basic Rule: “Be Nice”
An overarching rule to keep in mind for any kind of web scraping is BE GOOD AND FOLLOW A WEBSITE’s CRAWLING POLICIES
- Making the crawling slower, and don’t slam the server, will treat website nicely. Adding a delay and payses between clicks will not add so much load ont the website.
- Incorporate random clicks and mouse movements on the page to avoid looking like a bot.
Things I want to know more about
- I would like to practice on the web scraping that we learned yesterday more.
- How to scrape data off of a site without using a json file type site, is that thing?