How to leverage CrewAI for scraping? · AI Developer Accelerator

Jul '24 • CrewAI

How to leverage CrewAI for scraping?

Hey everyone, I am working on a really cool project which I shared with some of you in yesterday's call. I was able to collect really good information from the call yesterday. Thank you everyone. Ok, so now, I want some data to be scraped from Google.

A quick reminder about the project that I am working on:

So, we are creating a platform where its easier for students to choose the right University based on different factors and for that, we need a lot of data and a lot of automation. There are about 5400 Universities in the US and for now, I've got 1700 Universities basic data (such as names and ALIAS etc). What I wanna do now is that I want to build a scraper that takes the name of the University and search the following query: "What programs does {name} offer?" and I will see all of the programs there are to offer by that University. I want that data!

Here's the flow that I understood:

Get the name

Search the query
Get the URL (generated after the search query is searched) of the web page.
Use bs4 to get the HTML and then its easy as there's similar names used for <divs> for all the Universities.

I have attached a screenshot too so that you guys can see in detail about what I am talking about.

My questions:

How can I use crewAI to do this for me?
Should I use crewAI just to the URL of the webpage or should I leverage crewAI to give me all of the programs too.
If I use CrewAI, should I use openAI or any open source (maybe 7b) LLM to reduce the cost etc?

If any one of you could gimme a roadmap, it would be amazing because as of now, I have to use selenium just to get the target URL. The main concern for me right now, is the target URL. Its hard to get.

0 comments