In scenarios where users would send large and complex prompts, with alot of jargon sending these requests can be expensive, and cost millions of dollars a year for companies. To avoid this we can use several techniques; two such techniques are pre-filter and post-filtering.
Pre-filtering is when for example, in a hotel room booking app we know that there will be many request based on the number of rooms, therefore we can create a vector index for rooms in this case. Post- filtering is done after the prompt is sent, where we extract only the required entities from the prompt.
This can be done programmatically, but can also be accomplished from the database as well, in the database we can use the vector embeddings to do a semantic search for the needed details, and based on the most frequent details or most popular details we can bring those up, where it is easier to access.
Another technique is called Prompt Compression, this can be accomplished by taking the large user prompt and using a low resource model that is good to compress prompts to compress and optimize the users prompt. This will help to improve relevance and reduce cost.
Information retrieval based on the context of data. Vector search is accomplished by comparing the distance between embeddings to find similar or related items in a higher dimensional vector space.
Key applications of vector search are:
Data is encoded into high-dimensional vector representations. In vector search Vectors are stored and indexed in a specialized vector database. Searches are done by similarity matching between query vectors and database vectors.
RAG is a system design pattern that is based on information retrieval, such as vector search by retrieving relevant responses based on the query and supplement the users query with this retrieved data and feed it into the model for a better and informed output.
The benefits of using RAG are: