- Create a virtual environment
pip install -r requirements.txt
The cralwer has folliwing 3 component:
It will crawl the house list every day and dump to folder output/YYYYMMDD/house_links.csv
It will scan the new house_list info after house_list_spider finished and output a feed to be consumed by house_info_spider:
- Add new listed houses to
house_link
table and put them into feed - For those houses with updated price or was unavailable,
we merge its info to
house_link
and put it into feed; - For remaining house, they are existing house w/o updated price, we do nothing
- For those available house not showing up in latest house_list, put them into feed;
Different from strategy 1 on 3. -- still write these house linkds to feed. Basically this strategy always request all available houses.
It will scan the feed generated by new_house_list_processor and try to crawl the raw_html for each house item based on follow strategy:
- If the page is not available mark it as unavailable in
house_link
table and update the unavailable_date; - If the page is available, mark it as available in
house_link
table and update its price inhouse_price_history
table if necessary; Then save the page and runhouse_info_extractor
to inject/merge new data intohouse_info
table;
We want to write the daily stats to a house_daily_stats
table to extract
most valuable pending houses per geo.