The repository is about implementing web scraping libraries that are native to Node.js to extract information about professional opportunities from various online websites, eventually stored in JSON files. Two libraries will be used to scrape data on three website examples.
- Open a node.js file in an IDE (ie. VS Code)
- Make sure the package.json file is in the same directory as the node.js file
- In the package.json file, put "dependencies": { "puppeteer": "^19.6.2" },
- Make sure Puppeteer is installed with the command "npm install puppeteer"
- Navigate to the current working directory through CLI/terminal
- type "node {name}.js" in the terminal
- Expected Behaviors:
- The file will launch and open a Google Chrome browser with the corresponding website content and close immediately
- The information is scraped from outside sources and displays the nested structures of relevant data in the terminal window
- It will create a new JSON file containing all the data if there isn't one in the directory. Otherwise, it will rewrite the data
- NOTE: one file only extracts data from one website and stores it in a separate JSON file.
- Open a node.js file in an IDE (ie. VS Code)
- Make sure the package.json file is in the same directory as the node.js file
- In the package.json file, put "dependencies": { "cheerio": "^ 1.0.0-rc.12", "axios": "^ 1.5.1" },
- Make sure Cheerio is installed with the command "npm install cheerio"
- Navigate to the current working directory through CLI/terminal
- type "node {name}.js" in the terminal
- Expected Behaviors:
- The information is scraped from outside sources and displays the nested structures of relevant data in the terminal window
- It will create a new JSON file containing all the data if there isn't one in the directory. Otherwise, it will rewrite the data
- NOTE: one file only extracts data from one website and stores it in a separate JSON file.
Example 1: NASA Jet Propulsion Laboratory Internship: https://www.jpl.nasa.gov/edu/intern/apply
- name --> title of internship
- link --> application link to internship
- academic level --> academic level (undergraduate/graduate)
- session --> time of internship program
Example 2: Top 142 STEM Scholarship in October 2023: https://scholarships360.org/scholarships/stem-scholarships/
- nameText --> title of scholarship
- linkText --> application link to scholarship
- platOfferText --> scholarship platform
- awardText --> scholarship award amount
- deadlineText --> deadline of the application
Example 3: The Muse Job Search Website: https://www.themuse.com/search/
- titleText --> title of job
- appLinkText --> application link to the job
- desLinkText --> job description link about the job
- NameLocateText --> name of the company AND location of the company
- DID NOT extract a tremendous amount of data which potentially affects the performance of website servers.
- Extracted the information only for educational purposes.
- The JSON data contained only public information, there was no personal/sensitive data.