OceanPhysics AI TLDR

OceanPhysics manufactures HF radar systems used in oceanography. Customers and collaborators often ask technical questions about installation, maintenance, troubleshooting, and data analysis. The answers existed across years of internal knowledge, but were scattered across wiki pages, email archives, PDF manuals, and WhatsApp conversations.

I built OceanPhysics AI, an end-to-end RAG system that turns this fragmented knowledge into a deployed AI support assistant. The system ingests four source types — wiki, email, PDF, and WhatsApp — into 16,537 indexed vector chunks from an 11.2 GB raw corpus, then serves a public chatbot for users and a protected internal chatbot for OceanPhysics lab members and developers.

The core challenge was ingestion. The system could not simply index raw files: emails contained valuable support knowledge mixed with sensitive or irrelevant material; PDFs required reliable transcription and page-level provenance; WhatsApp chats had to be distilled from noisy conversations; wiki pages needed better metadata. Each source therefore needed a dedicated preprocessing pipeline.

The most complex pipeline was email. Starting from 153 .mbox archives and roughly 29,899 messages, the system reconstructed 14,709 conversation threads, assessed relevance and sensitivity with an LLM-assisted pipeline, and exported 8,221 cleaned technical thread records. This turned raw support conversations into usable technical knowledge rather than indexing mailbox dumps.

PDF ingestion was the second major challenge. The system processed 113 technical PDFs totaling 3,658 pages by rendering pages as images and transcribing them with a multimodal LLM. It preserved page-level provenance through inline page markers and custom chunking, producing 6,049 indexed PDF chunks that can support inspectable technical answers.

At serving time, the system follows a compact pipeline:

User question → query contextualization → hybrid retrieval → cross-source reranking → answer generation

Hybrid retrieval combines dense embeddings with BM25 keyword search, then reranks results before answer generation. The internal version exposes source filtering, citations, and tracing tools. The public version is intentionally simpler and hides internal references.

The system runs on a company-owned server in Docker, with separate public and protected internal endpoints. Phoenix tracing provides observability into retrieval and LLM calls, and was useful for debugging a key hallucination issue: one unsupported citation was traced back not to answer generation, but to an earlier LLM-assisted email extraction step.

The project includes retrieval evaluation, software tests, deployment, observability, and future paths for improvement. The most important lesson was that production RAG is not only retrieval and prompting. It is data transformation, product judgment, evaluation, deployment, and careful control over where information enters the system.

Page updated

Google Sites

Report abuse