Enhancing RAG Search Accuracy with Pre-Processed Contextual Data

I have been using a technique to improve the accuracy of searches in RAG (Retrieval-Augmented Generation). The process works as follows: I receive data in JSON or XML format, and instead of using it directly, I apply a pre-processing step.

To achieve this, I use an AI model to transform the raw data into a structured sentence and a logical natural language representation. This significantly enhances accuracy, as traditional semantic search struggles with retrieving disconnected data points.

Example:

Suppose I have a JSON file with real estate listings:

{

"id": 12345,

"type": "Apartment",

"bedrooms": 3,

"bathrooms": 2,

"size": 120,

"city": "São Paulo",

"neighborhood": "Vila Mariana",

"price": 950000

}

The AI processes this data and generates a structured sentence:

"Apartment with 120m², featuring 3 bedrooms and 2 bathrooms, located in Vila Mariana, São Paulo, available for R$ 950,000."

It also creates a logical representation for precise searches:

"A property of type apartment, located in São Paulo, in the Vila Mariana neighborhood, with 3 bedrooms, 2 bathrooms, and a size of 120m², available for sale at R$ 950,000."

Why This Technique Works

This approach is particularly useful when searching for highly relevant results within RAG. Instead of relying solely on semantic search—which may struggle with loosely structured data—we provide clear contextual information in advance.

Once the model returns a relevant result, I can extract the ID of the property (12345). With this ID, I can perform a direct database query, retrieving all the necessary details accurately.

This technique significantly enhances search precision, ensures data reliability, and improves the overall effectiveness of AI-powered retrieval systems.

1 comment