## Web Scraping with BeautifulSoup

#### What is Web Scraping?

Web scraping is the process of extracting data from websites using code.

Example use cases:
- Extracting weather data
- Scraping product prices from e-commerce sites
- Collecting research article titles from journals

#### Basic Structure of a Web Page

Web pages are written in HTML, which consists of tags like `<html>`, `<body>`, `<div>`, `<a>`, `<p>`, etc.

More about html tags can be found [here](https://www.geeksforgeeks.org/html/what-are-html-tags/)

#### Step 1: Install required libraries

This step should be done one time. If you have installed libraries first time, you can skip this step if running the notebook again.

In [None]:
%pip install requests beautifulsoup4

#### Step 2: Import libraries

In [4]:
import requests # to establish connection with the website
from bs4 import BeautifulSoup
import time
import csv

#### Step 3: Fetch and Parse a webpage (for example, wikipedia page on python programming)

In [5]:
# URL is the address of the web page which is being scraped
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Python (programming language) - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited

### Step 4: Extract and save extracted data 

#### Extract All Links (<a> Tags)

In [6]:
# Get all <a> tags (links)
links = soup.find_all('a')

for link in links:
    href = link.get('href')
    text = link.text.strip()
    print(f"Text: {text} | Link: {href}")

Text: Jump to content | Link: #bodyContent
Text: Main page | Link: /wiki/Main_Page
Text: Contents | Link: /wiki/Wikipedia:Contents
Text: Current events | Link: /wiki/Portal:Current_events
Text: Random article | Link: /wiki/Special:Random
Text: About Wikipedia | Link: /wiki/Wikipedia:About
Text: Contact us | Link: //en.wikipedia.org/wiki/Wikipedia:Contact_us
Text: Help | Link: /wiki/Help:Contents
Text: Learn to edit | Link: /wiki/Help:Introduction
Text: Community portal | Link: /wiki/Wikipedia:Community_portal
Text: Recent changes | Link: /wiki/Special:RecentChanges
Text: Upload file | Link: /wiki/Wikipedia:File_upload_wizard
Text: Special pages | Link: /wiki/Special:SpecialPages
Text:  | Link: /wiki/Main_Page
Text: Search | Link: /wiki/Special:Search
Text: Donate | Link: https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en
Text: Create account | Link: /w/index.php?title=Special:CreateAccount&returnto=Python+%28programming+language%

#### Extract elements by tag and class

In [7]:
author = soup.find('p', class_='author')
if author:
    print("Author:", author.text)
else:
    print("No author tag found.")

No author tag found.


#### Extract Wikipedia section headers

In [8]:
news_url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(news_url)
soup = BeautifulSoup(response.text, "html.parser")

# Find all headings with tag h2
sections = soup.find_all('h2') # h2 is heading tag in html

# Extract text from these headings
for h in sections:
    text = h.text.strip()
    if text:
        print(text)

Contents
History
Design philosophy and features
Syntax and semantics
Code examples
Libraries
Development environments
Implementations
Language Development
API documentation generators
Naming
Popularity
Types of Use
Languages influenced by Python
See also
Notes
References
Further reading
External links


#### Save the sections to CSV

In [9]:
sections_data = [h.text.strip() for h in sections if h.text.strip()]

with open("wiki_sections.csv", "w", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["Sections"])
    for line in sections_data:
        writer.writerow([line])

print("Saved to wiki_sections.csv")

Saved to wiki_sections.csv


#### Extract ALL text (whole page) and save it in a .txt file

This gives all the text, including navigation menus, footers, etc. The saved text can be used for further analysis.


In [10]:
page_text = soup.get_text() 
print(page_text)

#save the text from the webpage to a .txt file
with open("webpage_text.txt", "w", encoding="utf-8") as file:
    file.write(page_text)





Python (programming language) - Wikipedia




































Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages



















Search











Search






















Appearance
















Donate

Create account

Log in








Personal tools





Donate Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




(Top)





1
History








2
Design philosophy and features








3
Syntax and semantics




Toggle Syntax and semantics subsection





3.1
Indentation








3.2
Statements and control flow








3.3
Expressions








3.4
Methods








3.5
Typing








3.6
Arithmetic operations








3.7
Function syntax










4
Code examples







#### Step 5: Be courteous (pause between requests to avoid overloading any website with scraping requests)

In [11]:
time.sleep(1) # request is paused for 1 second 

#### Step 6: Check Robots.txt

Check Robots.txt file of website to know which parts (or links) on the webpage are available/allowed for scraping and which are not.

In [12]:
robots_url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
robots = requests.get(robots_url)
print(robots.text)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Python (programming language) - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enab