Python Webscraping Tutorial

About

I followed a video tutorial which implements BeautifulSoup, a Python database based on the foundation of HTML analytics engine, used for extracting, analyzing, and editing information in the document object model (DOM) tree of web pages to collect the data. Specifically, this tutorial uses BeautifulSoup to extract the title and transcript from a text box on this webpage and exports the data to a txt file.

Installation

Open terminal on your local machine and run the following commands:

pip install BeautifulSoup
pip install requests
pip install lxml

Now you should have all the packages installed.

Script

Open up a new Python file in a text editor or IDE of choice. I am using VSCode.
Import the required libraries for webscraping by including these lines of code at the top of your file:

from bs4 import BeautifulSoup
import requests

Define the URL you want to scrape. We will be scraping the webpage https://subslikescript.com/movie/Titanic-120338. So, our code will look like this:

website = 'https://subslikescript.com/movie/Titanic-120338'

To request data from the webpage, we need to use the requests() method. Include the following line of code:

result = requests.get(website)

Store this in a variable called 'content'

content = result.text

Now we will use BeautifulSoup and an HTML parser on the data.

soup = BeautifulSoup(content, "lxml")

At any point you can print the data using to check the HTML using the following line of code:

print(soup.prettify())

We want to locate the box that contains the title and transcript on the webpage. Thus, we will use the following line of code:

box = soup.find('article', class_='main-article')

The find() method is used to get one element and the find_all() method is used to get all the elements. We used the find() method since there is only one title we want to get.

The parameters for the find() method are the ID, class name, tag name, CSS Selector, and/or the Xpath. These are found by right clicking the webpage and pressing inspect. The HTML code associated with the content on the webpage appears informing users of the references.

Now that we have the text box found we can extract the title and transcript using these lines of code:

title = box.find('h1').get_text()
transcript = box.find('div', class_='full-script').get_text(strip=True, separator=' ')

The paramters for the find() method are found following the same process as Step 8.

The parameters for 'get_text()' are 'strip=True' which deletes spaces at the beginning and ending of the transcript. The other is 'separator=" "' which replaces a new line with a blank space. These are personal preference formatting resources.

Now we can export the data to a new txt file with the following lines of code:

with open(f'{title}.txt', 'w', encoding="utf-8") as file:
    file.write(transcript)

This creates a new file with the transcript in it called Titanic (1997) - full transcript.txt since that is the title extracted.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
webscraping-tutorial.py		webscraping-tutorial.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python Webscraping Tutorial

About

Installation

Script

About

Uh oh!

Releases

Packages

Languages

e-corwin/webscraping-tutorial

Folders and files

Latest commit

History

Repository files navigation

Python Webscraping Tutorial

About

Installation

Script

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages