Building OceanPhysics AI
Building OceanPhysics AI
Turning Technical Archives into an
AI Support Assistant
Here is shorter version if you prefer. Here is a link to the final product.
OceanPhysics manufactures HF radar systems used in oceanography. Customers and collaborators often need help with installation, maintenance, troubleshooting, and data analysis, but many useful answers were scattered across internal wiki pages, email archives, PDF manuals, and WhatsApp conversations. The project turned that fragmented knowledge into an AI assistant for HF radar users. I built the system end to end: ingestion pipelines, vector indexing, retrieval, chat orchestration, public/internal interfaces, Docker deployment, evaluation, and tests. The final system unifies four source types into 16,537 indexed vector chunks from an 11.2 GB raw corpus. The main challenge was not the chat interface; it was transforming messy, sensitive, real-world technical material into a reliable knowledge base.
The stakeholder wanted an AI chatbot to help deflect recurring support questions. Users are typically universities, customers, and collaborators operating OceanPhysics radar systems. They need practical answers grounded in OceanPhysics’ own experience, not generic web knowledge. The most useful knowledge, however, was not always in polished documentation. It was often buried in support emails, chat discussions, and technical PDFs. That created the central product constraint: make internal knowledge useful to external users without exposing the raw internal archive behind it.
The first implementation used the wiki as a tracer-bullet source because it was the fastest path to a working end-to-end system. From there, the source scope expanded from wiki to email, WhatsApp, and PDFs. The stakeholder feedback loop shaped which sources mattered most, what information should be included, what should be filtered, and what should remain internal. One major product decision was the split between public and internal versions. Public users get a simple assistant without source filters or internal citations. OceanPhysics lab members and developers get a protected internal version with source filtering, references, and developer tools.
The architecture separates offline ingestion from online serving:
Ingestion runs offline. Each source has its own preprocessing pipeline, and expensive LLM-assisted steps are cached. Once clean artifacts exist, an index-management CLI builds one Qdrant collection per source: wiki, email, PDF, and WhatsApp.
Serving is simpler. The production server only loads prepared indices; it never runs ingestion. A FastAPI backend holds the RAG core, while two lightweight Streamlit frontends expose the public and internal chat experiences.
At query time, the system follows a compact retrieval pipeline:
The contextualization step rewrites follow-up questions into standalone search queries when needed. Hybrid retrieval combines dense embeddings for semantic similarity with BM25 for exact technical terms, part names, and domain-specific phrases. Results from the selected source collections are merged and reranked before answer generation.
In a domain-specific RAG system, ingestion is where much of the product behavior is defined. Retrieval quality depends on the shape of the indexed knowledge: what is extracted, what is excluded, how source context is represented, and how much provenance survives into the final chunks.
For OceanPhysics AI, ingestion also carried the main safety and product constraints. The assistant needed to use internal knowledge to answer external support questions without exposing the raw internal archive. The pipelines therefore did more than clean text: they selected relevant material, filtered or flagged sensitive content, generalized customer-specific details, and preserved metadata needed for internal inspection.
Each source required a different transformation. Wiki pages mostly needed metadata enrichment. WhatsApp chats needed distillation from informal conversation into structured technical notes. Emails required thread reconstruction, sensitivity-aware extraction, and hallucination control. PDFs required multimodal transcription, structure preservation, and page-level provenance.
This made ingestion part of the system’s reasoning and safety layer, not a one-time data-loading step. The upstream transformations had to be inspectable, cached, repeatable, and evaluated because they determined what the assistant was allowed to know.
Email was the hardest and most valuable source. The raw input was 153 .mbox archives: about 10.9 GB and roughly 29,899 messages. Indexing those directly would have produced a noisy and unsafe knowledge base. The key design choice was to index conversation threads, not individual emails. A useful support answer often emerges across several replies, so the pipeline reconstructed 14,709 threads, converted them into readable context, and used an LLM-assisted step to decide relevance, sensitivity, and what technical information should be extracted. This reduced the archive to 8,221 relevant thread records, later indexed into 9,120 vector chunks. The final email index does not contain raw mailbox dumps; it contains cleaned, generalized technical knowledge extracted from real support conversations.
The hardest lesson came from hallucination during ingestion. An early prompt gave the model too much freedom to add helpful context, and it produced plausible but fake citations. The issue appeared later in chatbot answers, but the root cause was upstream: the extraction pipeline had already inserted unsupported information into the knowledge base. I traced the problem back to the original email extraction, tightened the prompt so the model could only extract what the emails actually said, and added broader checks instead of relying only on manual spot checks. That changed how I treated LLM-assisted ingestion: it was not just preprocessing, but part of the system’s trust boundary.
PDFs were the second major ingestion challenge. The corpus contained 113 technical PDFs (manuals, tutorials, antenna handbooks, and regulatory documents) totaling 3,658 pages. Traditional PDF text extraction was not reliable enough, so I treated each page as an image and used a multimodal LLM to transcribe it into markdown. This gave more consistent text across varied layouts and made it possible to preserve headings, paragraphs, and printed page numbers.
The main operational obstacle was Gemini’s RECITATION blocking. Detailed transcription prompts and structured outputs often triggered the safeguard. Gemini was kept because the pricing made full-corpus transcription feasible, while the pipeline was adapted around minimal prompts and fallback retries. The second challenge was provenance. Internal users needed to know where an answer came from inside a PDF, not just which PDF it came from. The pipeline inserted compact page markers into the markdown, and the custom PDF loader preserved those markers through chunking. The final PDF source became 114 clean markdown documents, 33,339 inline page markers, and 6,049 indexed chunks. The result was not just PDF-to-text conversion. It was a pipeline that preserved enough structure and page-level provenance for technical answers to be useful and inspectable.
The retrieval system combines semantic search with exact-term search. Dense embeddings help with meaning; BM25 helps with domain-specific terms, part names, and technical phrases. Results from the selected source collections are merged and reranked before answer generation.
The internal version exposes hierarchical source filtering, so users can narrow retrieval by source, mailbox, PDF folder, wiki section, or document. This became useful both as a product feature and as a debugging tool. The public version is deliberately simpler. It searches the knowledge base but hides filters and internal references. That decision reflects the sensitivity of the source material: the assistant can use internal knowledge without exposing the internal archive itself.
The system runs on a company-owned server in Docker, exposing two endpoints from the same backend: a public assistant and a protected internal assistant. Ingestion remains offline: vector indices are built locally, shipped to the server, and only loaded in production.
Phoenix tracing provides observability into retrieval, LLM calls, and answer generation. It was useful for debugging the email hallucination issue by tracing a suspicious answer back through retrieval into the indexed source material, and it also creates a path for turning weak answers or retrieval misses into future evaluation cases.
Quality is handled through two layers: retrieval evaluation and software testing. Evaluation measures whether the system retrieves the right context, using hit rate, MRR, generated test sets, and hand-written golden queries. The test suite protects the implementation itself (RAG core, source filtering, loaders, storage, prompts, ingestion caches, and CLIs) with 641 test functions across 38 files.
The most useful next step would be evaluation-driven improvement: use Phoenix traces, user feedback, and real questions to expand the evaluation set, then tune chunking, retrieval depth, reranking, source weighting, and dense/BM25 balance against observed failure cases.
The larger opportunity would be knowledge consolidation. Email and WhatsApp extraction currently preserves threads and conversations for provenance, but related information remains spread across many records. A derived topic-based, wiki-like layer could cluster and synthesize those extractions while preserving source links, making the knowledge base easier to retrieve from, maintain, and navigate with an agent.
OceanPhysics AI is now a deployed support assistant over the company’s technical knowledge. Public users get a simpler way to ask questions about HF radar systems. Internal users get a more inspectable tool with filtering, citations, and developer controls. From an engineering perspective, the project demonstrates an end-to-end production RAG system: source-specific ingestion, vector indexing, hybrid retrieval, reranking, chat orchestration, Docker deployment, evaluation, and testing. The part I am most proud of is the transformation itself: turning messy raw archives, especially years of email conversations, into a usable technical knowledge base.