This is a simple web scraper that collects real estate agent information from the Sothebyβs Realty Seattle website. It grabs details like names, job titles, company info, addresses, and contact numbers and saves them in an organized format.
π‘ Why I Made This: I saw a similar project in a job posting on UpWork and thought it would be a fun way to practice web scraping!
This is a web scraper built using Python, Selenium, and BeautifulSoup to extract real estate agent information from the Sothebys Realty Seattle website. The scraper navigates through multiple pages, extracts agent details (name, position, company, address, contact information), and saves them in structured JSON format.
- Automated navigation using Selenium WebDriver.
- Extracts structured data (name, position, company, address, contacts).
- Handles pagination dynamically.
- Saves results in properly formatted JSON.
- Runs headless (silent, without opening a browser window).
βββ sothebys_scraper.py # Main script
βββ requirements.txt # Python dependencies
βββ README.md # Documentation
βββ output.json # Extracted data
Ensure you have Python 3.x installed. Install required packages:
pip install selenium beautifulsoup4 html5lib
This scraper uses Microsoft Edge WebDriver. Download it from: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Place the msedgedriver.exe
file in the specified directory (or add it to your system PATH).
Modify the driver_path
in sothebys_scraper.py
to point to your Edge WebDriver, then run:
python sothebys_scraper.py
The scraper follows these steps:
def initialize_webdriver(driver_path):
service = Service(driver_path)
options = EdgeOptions()
options.add_argument("--headless")
return webdriver.Edge(service=service, options=options)
- Loads the Edge WebDriver.
- Runs in headless mode (does not open a visible browser window).
def navigate_to_page(browser, url):
browser.get(url)
- Opens the Sothebys Realty agents page.
- Extracts the HTML source.
def extract_data(browser):
soup = BeautifulSoup(browser.page_source, 'html5lib')
agents = []
info_IDs = soup.find_all(id=re.compile("Entity_180"))
for info in info_IDs:
text = info.text.strip().split("\n")
agent_data = {
"name": text[0].strip(),
"position": text[1].strip(),
"company": text[2].strip(),
"address": " ".join(text[3:6]).strip()
}
contact_info = {}
for line in text:
if "M:" in line:
contact_info["CONTACT M"] = line.split("M:")[1].strip()
if "O:" in line:
contact_info["CONTACT O"] = line.split("O:")[1].strip()
agent_data.update(contact_info)
agents.append(agent_data)
with open("output.json", "w", encoding="utf-8") as f:
json.dump(agents, f, indent=4)
- Finds agent sections in HTML (
Entity_180
). - Extracts name, position, company, and address.
- Detects contact numbers dynamically.
- Formats the data into a dictionary.
- Saves to
output.json
in proper JSON format.
def full_extraction(browser, url):
navigate_to_page(browser, url)
sleep(2)
soup = references(browser.page_source)
links = soup.find_all(href=re.compile("-pg"))
pg_numbers = re.findall(r'\d', str(links))
end_page = max(map(int, pg_numbers))
count = list(range(1, end_page + 1))
total_pages = [f"{url}/{x}-pg" for x in count]
for page in total_pages:
navigate_to_page(browser, page)
sleep(4)
extract_data(browser)
- Detects pagination links (e.g.,
?page=2
). - Extracts the total number of pages.
- Iterates through each page, extracting agent data.
After running the script, output.json
will look like this:
[
{
"name": "Jay Alan",
"position": "BROKER",
"company": "Realogics Sotheby's International Realty",
"address": "4031 E. Madison Street Seattle, WA, 98112 United States",
"CONTACT M": "+1 206.391.9000"
},
{
"name": "George Beasley",
"position": "BROKER",
"company": "Realogics Sotheby's International Realty",
"address": "2715 1st Avenue Seattle, WA, 98121 United States",
"CONTACT M": "+1 206.617.4758",
"CONTACT O": "+1 206.538.0730"
}
]
- πΉ Add error handling for network issues.
- πΉ Implement multi-threading for faster scraping.
- πΉ Convert data into a CSV export option.
This project is for educational purposes. Do whatever you want with it!
Found a bug? Want to improve the script? Open a pull request!
π¬ Have questions? Reach out!
π§ Email: [email protected]