Trying Pydantic Parsers(Just structuring data)

In the spirit of a particular Facebook group i'm in, "Breaking Into Tech", I figured i'd refactor my project to attempt to structure the output at each step of the information extraction.

The main way of structuring output is utilizing Pydantic.

Major kudos to @jxnlco on twitter for the presenation at Ai Summit .

Major kudos, also, to a friend who had a work problem they're trying to solve!

How it works:

API call to APIFY ACTOR to scrape Indeed.com for 'job_title' I desire.

Download scraped data as a csv file.

Load csv file into pandas dataframe.

Extract and store "Skills and Technologies" from 'description' field of dataframe into text file.

Combine all extracted text.

Create JSON output of Top 5 skills and technologies that are mentioned.

So, I tried it and the results aren't much better than what I was getting before. I think for the next iteration i'm going to look more into adding some kind of classification step for each Skill or technology. It seems like Skills and Tech can be the same thing, so it returns things like "AWS" in skills and tech.

Anyhow, check out the results in my repo: start2.ipynb is the file.

Here are other links that i found useful along the way:

Learning to utilize pydantic with pydanticoutputparser

https://docs.llamaindex.ai/en/stable/optimizing/advanced_retrieval/structured_outputs/pydantic_program.html

https://docs.pydantic.dev/latest/concepts/serialization/

Example for pydantic class: <--- Inspiration

https://jxnl.github.io/instructor/examples/action_items/

Learned to utilize the LLMTextCompletionProgram (Pydantic output parser as input)

https://github.com/run-llama/llama_index/blob/031b3518f3f89c5bd2837ecea77028758358536d/llama_index/program/llm_program.py

2 comments