How to Build a Scalable & Production-Ready RAG System in Ruby for AI Applications

Reading Time: < 1 minute

Retrieval-Augmented Generation (RAG) has emerged as a game-changing approach in AI applications, natural language processing (NLP), and machine learning. While Python dominates AI development, Ruby provides a powerful, efficient, and production-ready alternative for building scalable RAG systems. In this article, we’ll explore a step-by-step RAG implementation in Ruby, breaking down key concepts, AI-driven text generation, and retrieval techniques, and examining how each component works together to create high-performance AI solutions.

Understanding RAG Architecture

Before diving into the code, let’s understand what makes our Ruby RAG implementation special:

  • Thread-safe document store: Handles concurrent requests safely
  • Flexible embedding generation: Uses OpenAI’s latest embedding models
  • Multi-provider support: Works with both OpenAI and OpenRouter
  • Efficient text chunking: Implements smart text segmentation with overlap
  • Production-ready API: Built with Sinatra for lightweight deployment

Core Components Deep Dive

Smart Text Chunking

One of the most critical aspects of RAG is proper text chunking. Our implementation uses a sophisticated approach:

def chunk_text(text, max_tokens: 800, overlap: 50)
  sentences = text.split(/(?<=[.!?])\s+/)
  chunks = []
  current_chunk = []
  current_tokens = 0

  sentences.each do |sentence|
    sentence_tokens = TOKENIZER.encode(sentence).tokens.size
    
    if current_tokens + sentence_tokens > max_tokens
      chunks << build_chunk(current_chunk, overlap)
      current_chunk = current_chunk.last(overlap)
      current_tokens = current_chunk.sum { |s| TOKENIZER.encode(s).tokens.size }
    end
    
    current_chunk << sentence
    current_tokens += sentence_tokens
  end
  
  chunks << build_chunk(current_chunk, overlap) unless current_chunk.empty?
  chunks
end

This chunking algorithm:

  • Preserves sentence boundaries for natural context
  • Maintains overlap between chunks to preserve context
  • Respects token limits while maximizing content

Thread-Safe Document Store

The DocumentStore class implements thread safety using Ruby's Monitor module:

class DocumentStore
  include MonitorMixin

  def initialize
    super
    @documents = []
    @index = []
  end

  def add(chunk, embedding)
    synchronize do
      @documents << { chunk: chunk, embedding: embedding }
      @index << embedding
    end
  end
end

This ensures our RAG system can handle multiple concurrent requests without data corruption - essential for production deployments.

Intelligent Response Generation

The RAGBrain class orchestrates the response generation with a carefully crafted system prompt:

SYSTEM_PROMPT = <<~SYS.freeze
  Eres un asistente especializado en análisis de documentos. Respuestas deben:
  - Basarse exclusivamente en el contexto proporcionado
  - Ser concisas (máximo 3 oraciones)
  - Incluir referencias tipo [1] cuando aplique
  - Indicar claramente cuando no haya información suficiente

  Contexto disponible:
  %<context>s
SYS

This prompt engineering ensures:

  • Responses are grounded in provided context
  • Clear attribution through references
  • Concise and focused answers
  • Transparency about information gaps

Production Considerations

Error Handling and Fallbacks

The implementation includes robust error handling throughout:

def embed(text)
  return @cache if @cache.key?(text)
  
  clean_text = text.truncate(MAX_TOKENS * 4, omission: "")
  response = @client.embeddings(
    parameters: {
      model: MODEL_EMBEDDINGS,
      input: clean_text
    }
  )

  raise "Embedding error: #{response.dig("error", "message")}" if response["error"]

  Vector[*response.dig("data", 0, "embedding")].tap do |vec|
    @cache = vec
  end
rescue => e
  puts "Embedding fallback: #{e.message}"
  Vector.elements(Array.new(1536, 0.0))
end

Key features include:

  • Caching for performance
  • Fallback vectors for error cases
  • Clear error messaging
  • Token limit handling

API Design

The Sinatra API provides a clean interface:

post "/ingest" do
  text = parse_text(request)
  emb_client = EmbeddingClient.new
  
  chunks = emb_client.chunk_text(text)
  chunks.each { |chunk| DOCUMENT_STORE.add(chunk, emb_client.embed(chunk)) }

  { status: "success", chunks: chunks.size }.to_json
end

Deployment Tips

To deploy this RAG system in production:

  1. Environment Configuration:
    ENV["OPENAI_API_KEY"] ||= ENV.fetch("OPENAI_ACCESS_TOKEN", nil)
    ENV["OR_ACCESS_TOKEN"] ||= ENV.fetch("OPENROUTER_API_KEY", nil)
  2. Memory Management: The stats endpoint helps monitor system health:
    get "/stats" do
      {
        documents: DOCUMENT_STORE.size,
        memory: "%d MB" % (`ps -o rss= -p #{Process.pid}`.to_i / 1024)
      }.to_json
    end
  3. Scaling Considerations:
    • Use Redis or PostgreSQL for the document store in larger deployments
    • Implement rate limiting for API endpoints
    • Consider background job processing for document ingestion

Conclusion

Ruby proves to be an excellent choice for building production-ready RAG systems. The language's elegant syntax and robust concurrency support, combined with powerful gems like Sinatra, create a solid foundation for AI applications.

This implementation demonstrates that you don't need complex Python frameworks to build sophisticated RAG systems. Ruby's simplicity and productivity shine through, making it an excellent choice for teams looking to integrate RAG into their existing Ruby applications.

Remember to check the full source code for additional features and optimizations not covered in this article. The modular design makes it easy to extend and customize for your specific needs.

🚀 Want to take your AI-powered applications to the next level? Explore MagmaChat and see how it can enhance your conversational AI solutions!

0 Shares:
You May Also Like
RailsConf 2017 Adventure
Read More

RailsConf 2017 Adventure

Reading Time: 3 minutesHello, I want to talk to you about my visit to the city of Phoenix Arizona, where I…