Hey everyone, I am working on a really cool project which I shared with some of you in yesterday's call. I was able to collect really good information from the call yesterday. Thank you everyone. Ok, so now, I want some data to be scraped from Google.
A quick reminder about the project that I am working on:
So, we are creating a platform where its easier for students to choose the right University based on different factors and for that, we need a lot of data and a lot of automation. There are about 5400 Universities in the US and for now, I've got 1700 Universities basic data (such as names and ALIAS etc). What I wanna do now is that I want to build a scraper that takes the name of the University and search the following query: "What programs does {name} offer?" and I will see all of the programs there are to offer by that University. I want that data!
Here's the flow that I understood:
- Search the query
- Get the URL (generated after the search query is searched) of the web page.
- Use bs4 to get the HTML and then its easy as there's similar names used for <divs> for all the Universities.
I have attached a screenshot too so that you guys can see in detail about what I am talking about.
My questions:
- How can I use crewAI to do this for me?
- Should I use crewAI just to the URL of the webpage or should I leverage crewAI to give me all of the programs too.
- If I use CrewAI, should I use openAI or any open source (maybe 7b) LLM to reduce the cost etc?
If any one of you could gimme a roadmap, it would be amazing because as of now, I have to use selenium just to get the target URL. The main concern for me right now, is the target URL. Its hard to get.