I have a simple goal, or so I think it should be:
- Get a URL (usually the main url of a company blog or website)
- Get all the content and throw it in to a database. Try to only retrieve the main text and not the menus, headers and footers.
- Do some LLM magic on the extracted content
My original workflow (not pasting it here since it's all messy by now):
- Form trigger to collect the URL
- Use Firecrawl's `/map` to get a list of links (works nicely)
- Split out and Loop over items, then call `extract/{{ $json.link }}` . This request doesn't return a crawled page, by an extract ID.
- Once I collected all extract IDs, I start another Loop to get the content (Firecrawl expects you to poll their service and wait for a "completed" status)
- Collect all results and save to DB, and basically be happy.
However, I found Firecrawl to be extremely unstable in its responses and I wonder if I'm doing something wrong. I expected this to be straight forward and simple, but their service occasionally returns server errors, and often the crawled page is returned empty.
My URLs list has ~150 pages. This doesn't see to me like a very big list, but perhaps?
Does anyone have experience with lists of this size, when working with Firecrawl?
P.S. fwiw, Tavily's site map brought in wayyyyy too many links (hundreds), many of them are anchors inside pages, and it's extract function doesn't know how to remove the headers/footers wrapper.