⚡ Evaluation-Driven AI Systems • Agentic Workflows • Grounded RAG

Building LLM-powered systems that retrieve, reason, and deliver grounded results.

I build evaluation-driven LLM systems that combine retrieval, reasoning, and orchestration to generate grounded, verifiable outputs from live data. My work focuses on agentic AI, RAG systems, multi-agent workflows, and production-oriented AI applications.

About Me

I build modular AI systems that combine retrieval, orchestration, evaluation, and backend design with real engineering intent.

My work sits at the intersection of LLM applications, semantic retrieval, evaluation-driven AI, and backend engineering. I’m especially interested in systems that use live data, structured reasoning, and grounded generation rather than one-shot prompting.

I’ve built systems involving multi-agent orchestration, RAG pipelines, FAISS-based retrieval, real-time web ingestion, and citation-aware generation. I also care about how these systems evolve toward production through modular design, containerization, CI/CD, observability, and AWS-aligned architecture patterns.

Focus Areas

Evaluation-Driven LLM Systems Agentic AI Orchestration Grounded RAG Pipelines Semantic Retrieval & Ranking FastAPI & Backend APIs Vector Search (FAISS / ChromaDB) CI/CD with GitHub Actions Containerized AI Workflows Applied Machine Learning Data Pipelines & Automation Observability Thinking AWS-Aligned System Design

Current Working Projects

Projects in active development that are moving from experimentation toward packaging, evaluation, and deployment-ready structure.

Voice Agent for Customer Support

Current Phase: MVP Definition + Architecture Design

In Design

Actively designing and scoping an AI voice agent for customer support that combines speech-to-text, LLM reasoning, RAG-based retrieval, and text-to-speech for natural, multi-turn phone interactions.

The MVP is being structured around FAQ resolution, order-status lookup, appointment scheduling, and human handoff for complex issues, with emphasis on latency, orchestration state, and reliability.

Current engineering focus includes targeting low-latency turn responses, managing state across multi-turn conversations, and designing interruption-safe flows for real-time voice interactions.

Twilio LangGraph RAG Qdrant / ChromaDB Groq Deepgram ElevenLabs

Proposed Architecture

User Call Inbound or outbound voice interaction
Telephony Layer Twilio call routing and session handling
Speech-to-Text Deepgram streaming transcription
Agent Orchestrator LangGraph state, reasoning, and tool execution
RAG + Business Tools Knowledge retrieval, CRM actions, scheduling, order status
Text-to-Speech ElevenLabs natural voice generation
Caller Response Natural conversational audio back to the user
Observability & Reliability Layer Logging, monitoring, retries, fallbacks, human handoff, and latency tracking

Customer Complaint Classification (NLP)

Current Phase: Packaging, holdout evaluation, and deployment-oriented structuring

In Progress

Built a large-scale customer complaint intelligence workflow using a financial complaints dataset with approximately 14M source records, framed as a multi-class NLP classification problem for complaint triaging.

Created a 500K sampled experimental setup and benchmarked TF-IDF + Logistic Regression, Linear SVM, and DistilBERT, achieving around 0.80 weighted F1 under practical constraints.

Current focus is on moving from notebook experimentation toward cleaner packaging, holdout-based evaluation, and more deployment-ready structuring for modular inference and future serving workflows.

DistilBERT TF-IDF Logistic Regression Linear SVM NLP 500K+ Sample

Proposed Architecture

Complaint Data Source Large-scale financial complaint dataset
Sampling & Filtering 500K subset, missing-text removal, short-text filtering
Text Preprocessing Cleaning, normalization, label preparation, train/validation split
Baseline NLP Pipeline TF-IDF + Logistic Regression / Linear SVM benchmarking
Transformer Pipeline DistilBERT tokenization, fine-tuning, evaluation, model comparison
Artifacts & Inference Layer Saved vectorizer, label encoder, and trained models for reuse
Future Deployment Path Batch prediction, API serving, complaint triage workflow
Evaluation & Reliability Layer Weighted F1, Macro F1, holdout testing, class-balance checks, and model-vs-cost tradeoff analysis

Featured Projects

Selected projects that best represent my AI engineering, retrieval systems, evaluation workflows, and applied machine learning work.

Agentic Research Intelligence Platform

Multi-Agent RAG System

Flagship

Built an end-to-end multi-agent research system with 6 specialized agents (Planner, Search, Scraper, Retriever, Writer, Evaluator) to separate task planning, live source discovery, grounded synthesis, and output evaluation.

Designed a RAG pipeline using FAISS-based semantic retrieval over 5–20 chunks per query, processing 5–8 live external sources per run to improve grounding, traceability, and citation-backed generation.

Implemented an evaluation workflow across 50+ benchmark queries to assess relevance, faithfulness, and completeness, helping refine retrieval quality and improve the reliability of generated outputs.

Design tradeoff: modular multi-agent orchestration improved interpretability and evaluation control, while introducing higher latency than a simpler single-chain workflow.

Optimized the workflow to achieve approximately 15s end-to-end latency and structured the system for Docker-oriented execution and future CI integration.

LangChain FAISS MiniLM Tavily Streamlit Docker GitHub Actions

System Flow

User Query Research-oriented prompt or question
Planner Agent Breaks task into subqueries and workflow steps
Search Agent Finds candidate sources from the web
Scraper Agent Extracts and cleans source content
Retriever Agent Indexes chunks and retrieves top relevant context
Writer Agent Synthesizes grounded answer from retrieved evidence
Evaluator Agent Checks relevance, completeness, and grounding
Output Layer Grounded report with modular reasoning, source-backed synthesis, and evaluation-aware refinement

ResearchGPT — Enterprise RAG AI Assistant

Grounded Question Answering

RAG

Developed a document-grounded Q&A system over 10,000+ chunks using embedding-based retrieval with FAISS indexing.

Improved retrieval relevance from 65% to 82% through chunking refinement, embedding strategy improvements, and retrieval-quality iteration.

Reduced hallucination risk by strengthening retrieval grounding and designing the system with modular separation of retrieval, reasoning, and API-serving layers, alongside containerized deployment workflows.

Emphasized retrieval quality and modular backend design so document answers remained better grounded, easier to serve through APIs, and more reliable for downstream querying.

Key outcome: stronger retrieval quality and cleaner API separation made grounded document question answering more reliable.

FAISS MiniLM FastAPI RAG Evaluation Docker

System Flow

Document Upload PDF or document input enters the pipeline
Chunking Layer Breaks long text into retrieval-ready segments
Embedding Layer MiniLM vectors represent semantic meaning
FAISS Index Stores searchable vector representations
User Query Question is matched against indexed content
Relevant Chunks Top-k context is retrieved for grounding
LLM Response Answer is generated from retrieved evidence
Serving Layer Grounded answer delivery through modular API workflow and containerized deployment path

Job2Mail — AI Cold Outreach Automation

LLM + RAG + Semantic Matching

LLM App

Built an LLM-powered outreach automation system that extracts job descriptions, retrieves relevant project context, and generates personalized cold emails aligned to target roles.

Used ChromaDB + embeddings to improve retrieval precision for generation context, matching job requirements to relevant portfolio projects before response generation.

Designed the application with modular separation of ingestion, retrieval, and generation layers, and structured it for containerized execution with Docker.

Structured the system to support repeatable outreach generation with clearer separation between job parsing, project matching, and email generation, improving maintainability for future extensions.

Key outcome: modular retrieval-first design improved maintainability and made the workflow easier to extend with new targeting logic.

LangChain Groq ChromaDB Embeddings Streamlit Docker

Workflow Flow

Job Description Input Raw role description enters the workflow
Requirement Extraction Key skills and role signals are parsed from job text
Embedding / Query Layer Structured query is prepared for semantic retrieval
ChromaDB Retrieval Relevant project context is retrieved from vector store
Project Matching Most relevant portfolio work is selected for the role
Context Assembly Matched evidence is prepared for generation
LLM Email Generation Cold email is generated using retrieval-backed context
Workflow Goal Turn raw job descriptions into context-aware, personalized outreach using semantic project matching and modular generation

Credit Risk Modeling

Financial Risk Classification

ML

Built a credit risk prediction workflow to identify likely loan defaulters using structured financial data, with emphasis on preprocessing quality, feature selection, and interpretable risk modeling.

Cleaned messy data containing missing values, inconsistent entries, unrealistic records such as negative income values, and extreme outliers, then reduced 30–35 initial features to 20 meaningful variables through feature analysis, correlation review, and business relevance checks.

Applied SMOTE-Tomek to address class imbalance and compared results against a baseline modeling workflow, improving the robustness of default prediction on imbalanced financial data.

Improved model interpretability by removing weak predictors and surfacing less obvious variables that showed stronger association with default risk.

Python Classification Feature Engineering SMOTE-Tomek Data Cleaning Finance

Modeling Flow

Raw Financial Data Borrower and financial records enter the pipeline
Cleaning & Validation Missing values, inconsistencies, and outliers handled
Feature Selection 30–35 inputs reduced to 20 meaningful variables
Class Balancing SMOTE-Tomek applied for imbalanced default classes
Model Training Risk models trained and compared against baseline setup
Risk Prediction Default likelihood estimated with more robust signal quality
Evaluation Focus Interpretability, class imbalance handling, and decision-oriented default risk segmentation

Engineering Challenges I’ve Worked Through

Technical problems that shaped how I think about AI systems beyond just building the happy path.

  • Balancing retrieval depth vs latency in RAG pipelines to improve grounding without slowing response quality.
  • Preventing agent loops and uncontrolled tool calls in multi-agent orchestration workflows.
  • Improving chunking strategy and semantic recall to raise retrieval precision and downstream answer quality.

Experience

Work across data reliability, reporting automation, AI-assisted analytics, and data-driven impact communication.

Data Scientist • Street Care

2026 – Present

Improved data reliability by 35% across 10K+ operational records through validation pipelines, automated reporting workflows to reduce manual effort by 40%, and integrated AI-assisted workflows using Copilot and Claude to convert analytical outputs into structured summaries.

Analyzed and transformed unstructured survey and outreach data using Python and LLM-based techniques, extracting key community insights and automating report and grant content generation, enabling faster data-driven decision-making and stronger impact communication.

Member Services Supervisor • University of Florida RecSports

2024 – 2025

Used Python and Excel for data cleaning, transformation, and weekly reporting across 1,200+ attendance records, and built Power BI dashboards to track trends across 50+ programs, improving reporting consistency, visibility, and decision support for operations.

Education

Academic foundation in computer science, machine learning, and applied AI systems.

University of Florida

M.S. in Computer Science • 2024 – 2025 • USA

Graduate study in computer science with relevant coursework in Applications of NLP, AI Ethics for Tech Leaders, and Research Methods for HCC.

Bennett University

B.Tech in Computer Science • 2020 – 2024 • India

Strong undergraduate foundation in computer science, with coursework and practical exposure to machine learning, artificial intelligence, data science, and software engineering principles.

Skills

A focused stack across AI engineering, retrieval systems, data science, analytics, and foundational DevOps.

Generative AI & LLM Systems Agentic AI, Multi-Agent Systems, RAG Pipelines, Prompt Engineering, LLM Evaluation
Retrieval & Search FAISS, Vector Search, Semantic Retrieval, Embeddings, MiniLM, ChromaDB
Frameworks & Tools LangChain, LangGraph, Transformers, Tavily, BeautifulSoup, Requests, Groq
Backend & APIs FastAPI, REST APIs, Async Processing, Modular System Design
Machine Learning & Data Science EDA, Feature Engineering, Model Evaluation, Statistical Analysis, Pandas, NumPy
Data Visualization & Reporting Power BI, Excel, Dashboarding, Reporting Automation, Operational Analytics
DevOps & Automation Docker, GitHub Actions, CI/CD concepts, Shell Scripting, Environment Management
Cloud & Architecture AWS-aligned design patterns, S3-style storage flows, Lambda-style event workflows, API-based system integration

Model Stack

Hands-on model usage across generation, retrieval, and NLP workflows, with practical model evaluation during development.

Core Models Used

Used GPT-4o-mini in LLM application workflows and low-latency hosted inference APIs for real-time generation workflows. Across NLP, retrieval, and semantic search systems, worked with DistilBERT, BERT, DistilGPT2, and all-MiniLM-L6-v2.

GPT-4o-mini Llama 3.1 8B Instant Mistral 7B Mixtral 8x7B Gemini 2.5 Flash-Lite DistilBERT BERT DistilGPT2 all-MiniLM-L6-v2

Model Evaluation & Selection

During experimentation, evaluated models including Llama 3.3 70B, GPT-OSS 120B, and Llama Prompt Guard 2 22M, comparing them across response quality, latency, safety fit, and task fit before selecting the most suitable option for a given workflow.

Llama 3.3 70B GPT-OSS 120B Llama Prompt Guard 2 22M

Applied Across

Agentic AI workflows, grounded RAG pipelines, semantic retrieval, text generation, complaint classification, safety-aware experimentation, and question-answering systems backed by modular APIs and evaluation-driven iteration.

Contact

Open to AI Engineer, GenAI Engineer, and LLM Systems opportunities.

What I’m Looking For

Roles where I can contribute to LLM-powered products, agentic workflows, RAG systems, and production-oriented AI applications while continuing to grow across system design, model adaptation, and deployment.