prototype

2026-07-13 03:10:13 +00:00 · 2026-05-04 22:00:38 +05:30
commit 711d691870
48 changed files with 5093 additions and 0 deletions
@@ -0,0 +1,26 @@
 # =============================================================================
 # Clawrity — Environment Variables
 # Copy this file to .env and fill in your values.
 # NEVER commit .env to git.
 # =============================================================================
 # --- Groq API (free at https://console.groq.com) ---
 GROQ_API_KEY=
 # --- PostgreSQL + pgvector (docker-compose handles this if using defaults) ---
 DATABASE_URL=postgresql://user:pass@localhost:5432/clawrity
 # --- Slack Bot (Socket Mode) ---
 # 1. Create app at https://api.slack.com/apps
 # 2. Enable Socket Mode → generate App-Level Token (xapp-...)
 # 3. OAuth & Permissions → install to workspace → copy Bot Token (xoxb-...)
 # 4. Basic Information → Signing Secret
 SLACK_BOT_TOKEN=
 SLACK_APP_TOKEN=
 SLACK_SIGNING_SECRET=
 # --- Tavily Web Search (free at https://app.tavily.com) ---
 TAVILY_API_KEY=
 # --- Slack Webhook for digest delivery ---
 ACME_SLACK_WEBHOOK=
@@ -0,0 +1,43 @@
 # === Environment & Secrets ===
 .env
 *.env
 # === Dataset files — never commit raw or processed data ===
 data/raw/
 data/processed/
 # === Python ===
 __pycache__/
 *.py[cod]
 *$py.class
 *.so
 *.egg-info/
 dist/
 build/
 *.egg
 # === Virtual Environment ===
 venv/
 .venv/
 env/
 # === IDE ===
 .vscode/
 .idea/
 *.swp
 *.swo
 # === OS ===
 .DS_Store
 Thumbs.db
 # === Logs ===
 logs/
 *.log
 *.jsonl
 # === Docker ===
 pg_data/
 # === Model Cache ===
 .cache/
@@ -0,0 +1,23 @@
 FROM python:3.11-slim
 WORKDIR /app
 # Install system dependencies for psycopg2 and Prophet
 RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    libpq-dev \
    && rm -rf /var/lib/apt/lists/*
 # Install Python dependencies
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 # Copy project
 COPY . .
 # Create necessary directories
 RUN mkdir -p data/raw data/processed logs
 EXPOSE 8000
 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
@@ -0,0 +1,213 @@
 # Clawrity
 **Multi-channel AI business intelligence agent.** Enterprise clients interact via Slack (or Teams) and get data-grounded answers, daily digests, budget recommendations, ROI forecasts, and competitor/sector intelligence — all specific to their business data.
 ---
 ## Architecture
 Built on the **OpenClaw pattern**:
 - **ProtocolAdapter** — normalises messages from any channel (Slack, Teams, etc.)
 - **SOUL.md** — per-client personality, rules, and business context
 - **HEARTBEAT.md** — autonomous daily digest scheduling
 All intelligence lives in the Clawrity backend. OpenClaw layer has zero business logic.
 ## Tech Stack
 | Component | Tool |
 |---|---|
 | Language | Python 3.11 |
 | API Framework | FastAPI + uvicorn |
 | LLM | Groq API — llama-3.3-70b-versatile |
 | Embeddings | sentence-transformers all-MiniLM-L6-v2 (CPU, 384d) |
 | Database | PostgreSQL + pgvector |
 | Channel (dev) | Slack Bolt SDK (Socket Mode) |
 | Channel (demo) | Microsoft Teams Bot Framework SDK |
 | Scheduler | APScheduler AsyncIOScheduler |
 | Web Search | Tavily API + DuckDuckGo fallback |
 | Forecasting | Prophet |
 ## Quick Start
 ### 1. Prerequisites
 - Python 3.11+
 - Docker & Docker Compose
 - Groq API key (free: https://console.groq.com)
 - Tavily API key (free: https://app.tavily.com)
 ### 2. Environment Setup
 ```bash
 cp .env.example .env
 # Fill in your API keys in .env
 ```
 ### 3. Start PostgreSQL + pgvector
 ```bash
 docker compose up -d postgres
 ```
 ### 4. Install Dependencies
 ```bash
 python -m venv venv
 source venv/bin/activate
 pip install -r requirements.txt
 ```
 ### 5. Download Kaggle Datasets
 Download these two datasets and place them in `data/raw/`:
 1. **Global Superstore**: https://kaggle.com/datasets/apoorvaappz/global-super-store-dataset
 2. **Marketing Campaign Performance**: https://kaggle.com/datasets/manishabhatt22/marketing-campaign-performance-dataset
 ```bash
 mkdir -p data/raw data/processed
 # Place downloaded files in data/raw/
 ```
 ### 6. Seed Demo Data
 ```bash
 python scripts/seed_demo_data.py --client_id acme_corp \
  --superstore data/raw/Global_Superstore2.csv \
  --marketing data/raw/marketing_campaign_dataset.csv
 ```
 ### 7. Run RAG Pipeline
 ```bash
 python scripts/run_rag_pipeline.py --client_id acme_corp
 ```
 ### 8. Start the API
 ```bash
 uvicorn main:app --reload --port 8000
 ```
 ---
 ## Slack Bot Setup (Socket Mode)
 ### Step 1: Create Slack App
 1. Go to https://api.slack.com/apps
 2. Click **Create New App** → **From scratch**
 3. Name it `Clawrity` and select your workspace
 ### Step 2: Enable Socket Mode
 1. In the left sidebar, click **Socket Mode**
 2. Toggle **Enable Socket Mode** to ON
 3. Click **Generate Token** — name it `clawrity-socket`
 4. Copy the `xapp-...` token → paste into `.env` as `SLACK_APP_TOKEN`
 ### Step 3: Configure Bot Token
 1. Go to **OAuth & Permissions**
 2. Under **Bot Token Scopes**, add:
   - `app_mentions:read`
   - `chat:write`
   - `channels:history`
   - `channels:read`
 3. Click **Install to Workspace**
 4. Copy the `xoxb-...` token → paste into `.env` as `SLACK_BOT_TOKEN`
 ### Step 4: Enable Events
 1. Go to **Event Subscriptions**
 2. Toggle **Enable Events** to ON (no Request URL needed in Socket Mode)
 3. Under **Subscribe to bot events**, add:
   - `app_mention`
   - `message.channels`
 4. Click **Save Changes**
 ### Step 5: Get Signing Secret
 1. Go to **Basic Information**
 2. Under **App Credentials**, copy **Signing Secret**
 3. Paste into `.env` as `SLACK_SIGNING_SECRET`
 ### Step 6: Invite Bot to Channel
 In Slack, go to your desired channel and type:
 ```
 /invite @Clawrity
 ```
 ---
 ## API Endpoints
 | Method | Path | Description |
 |--------|------|-------------|
 | POST | `/chat` | Send message → get AI response |
 | POST | `/slack/events` | Slack webhook fallback |
 | POST | `/compare` | Side-by-side RAG vs no-RAG |
 | POST | `/forecast/run/{client_id}` | Trigger Prophet forecasting |
 | GET | `/forecast/{client_id}/{branch}` | Get cached forecast |
 | GET | `/admin/stats/{client_id}` | RAG monitoring stats |
 | GET | `/health` | System status |
 ## Adding a New Client
 1. Create `config/clients/client_newclient.yaml` (copy from `client_acme.yaml`)
 2. Create `soul/newclient_soul.md`
 3. Create `heartbeat/newclient_heartbeat.md`
 4. Place data in `data/raw/` and run seed + RAG scripts
 5. Restart — zero code changes required
 ---
 ## Project Structure
 ```
 clawrity/
 ├── main.py                         # FastAPI application
 ├── config/                         # Configuration
 │   ├── settings.py                 # pydantic-settings from .env
 │   ├── client_loader.py            # YAML client config loader
 │   └── clients/client_acme.yaml    # Per-client config
 ├── soul/                           # Per-client personality
 │   ├── soul_loader.py
 │   └── acme_soul.md
 ├── heartbeat/                      # Autonomous digest scheduling
 │   ├── heartbeat_loader.py
 │   ├── scheduler.py
 │   └── acme_heartbeat.md
 ├── agents/                         # AI agents
 │   ├── gen_agent.py                # Response generation
 │   ├── qa_agent.py                 # Quality assurance
 │   ├── orchestrator.py             # Pipeline coordinator
 │   └── scout_agent.py              # Competitor intelligence
 ├── skills/                         # Capabilities
 │   ├── postgres_connector.py       # DB connection pool
 │   ├── nl_to_sql.py                # Natural language → SQL
 │   └── web_search.py               # Tavily + DuckDuckGo
 ├── channels/                       # Message channels
 │   ├── protocol_adapter.py         # OpenClaw normalisation
 │   ├── slack_handler.py            # Slack Socket Mode
 │   └── teams_handler.py            # Teams stub
 ├── rag/                            # Retrieval-augmented generation
 │   ├── preprocessor.py
 │   ├── chunker.py
 │   ├── vector_store.py
 │   ├── retriever.py
 │   ├── evaluator.py
 │   └── monitoring.py
 ├── forecasting/
 │   └── prophet_engine.py
 ├── connectors/
 │   ├── base_connector.py
 │   └── csv_connector.py
 ├── etl/
 │   └── normaliser.py
 └── scripts/
    ├── seed_demo_data.py
    └── run_rag_pipeline.py
 ```
@@ -0,0 +1,184 @@
 """
 Clawrity — Gen Agent
 Generates newsletter-style, data-grounded responses using LLM.
 Supports NVIDIA NIM and Groq via OpenAI-compatible API.
 Temperature 0.7 (reduced by 0.2 on each retry).
 Augmented with SOUL.md + live query results + RAG chunks (Phase 2).
 """
 import logging
 from typing import List, Optional, Dict
 import pandas as pd
 from config.llm_client import get_llm_client, get_model_name
 logger = logging.getLogger(__name__)
 class GenAgent:
    """Response generation agent using LLM (NVIDIA NIM or Groq)."""
    def __init__(self):
        self.client = get_llm_client()
        self.model = get_model_name()
        self.base_temperature = 0.7
    def generate(
        self,
        question: str,
        soul_content: str,
        data_context: Optional[pd.DataFrame] = None,
        rag_chunks: Optional[List[Dict]] = None,
        retry_issues: Optional[List[str]] = None,
        retry_count: int = 0,
        strict_data_instruction: Optional[str] = None,
        supplementary_context: Optional[pd.DataFrame] = None,
    ) -> str:
        """
        Generate a data-grounded response.
        Args:
            question: User's original question
            soul_content: SOUL.md content for personality/rules
            data_context: DataFrame from PostgreSQL query results
            rag_chunks: Retrieved chunks with similarity scores (Phase 2)
            retry_issues: QA Agent issues from previous attempt
            retry_count: Current retry number (0-2)
        Returns:
            Markdown-formatted response string
        """
        temperature = max(0.1, self.base_temperature - (retry_count * 0.2))
        prompt = self._build_prompt(
            question, soul_content, data_context, rag_chunks, retry_issues,
            strict_data_instruction, supplementary_context,
        )
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": soul_content},
                    {"role": "user", "content": prompt},
                ],
                temperature=temperature,
                max_tokens=2048,
            )
            result = response.choices[0].message.content.strip()
            logger.info(
                f"Gen Agent produced {len(result)} chars "
                f"(temp={temperature}, retry={retry_count})"
            )
            return result
        except Exception as e:
            logger.error(f"Gen Agent failed: {e}")
            return f"I encountered an error generating your response. Please try again."
    def generate_digest(
        self,
        soul_content: str,
        data_context: pd.DataFrame,
        rag_chunks: Optional[List[Dict]] = None,
    ) -> str:
        """Generate a daily digest newsletter."""
        prompt = f"""Generate a professional daily business intelligence digest.
 ## Performance Data (Last 7 Days)
 {data_context.to_markdown(index=False) if data_context is not None and len(data_context) > 0 else "No data available."}
 """
        if rag_chunks:
            prompt += "## Historical Context\n"
            for i, chunk in enumerate(rag_chunks, 1):
                sim = chunk.get("similarity", 0)
                prompt += f"{i}. {chunk['text']} (relevance: {sim:.2f})\n"
            prompt += "\n"
        prompt += """Format as a newsletter with:
 1. **Executive Summary** — key highlights in 2-3 sentences
 2. **Top Performers** — best performing branches
 3. **Attention Required** — bottom 3 branches by revenue (ALWAYS include this)
 4. **Channel Insights** — spending efficiency across channels
 5. **Recommendations** — specific, data-backed suggestions
 Use bullet points, bold key numbers, and keep it concise."""
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": soul_content},
                    {"role": "user", "content": prompt},
                ],
                temperature=0.7,
                max_tokens=3000,
            )
            return response.choices[0].message.content.strip()
        except Exception as e:
            logger.error(f"Digest generation failed: {e}")
            return "Daily digest generation encountered an error."
    def _build_prompt(
        self,
        question: str,
        soul_content: str,
        data_context: Optional[pd.DataFrame],
        rag_chunks: Optional[List[Dict]],
        retry_issues: Optional[List[str]],
        strict_data_instruction: Optional[str] = None,
        supplementary_context: Optional[pd.DataFrame] = None,
    ) -> str:
        """Build the augmented prompt for response generation."""
        parts = []
        # Strict data instruction (on retry — prevents hallucination)
        if strict_data_instruction:
            parts.append(f"## ⚠️ STRICT REQUIREMENT\n{strict_data_instruction}\n")
        # Data context
        if data_context is not None and len(data_context) > 0:
            parts.append("## Data Context (query results for the user's question)")
            parts.append(data_context.to_markdown(index=False))
        else:
            parts.append("## Data Context\nNo query results available.")
        # Supplementary context (top performers for comparison)
        if supplementary_context is not None and len(supplementary_context) > 0:
            parts.append("\n## Benchmark Data (top-performing branches for comparison)")
            parts.append(supplementary_context.to_markdown(index=False))
            parts.append(
                "\nUse this benchmark data to compare the queried branch's performance "
                "against top performers. Identify which channels and strategies work "
                "best, and recommend specific, actionable improvements based on what "
                "top-performing branches are doing differently."
            )
        # RAG chunks (Phase 2)
        if rag_chunks:
            parts.append("\n## Historical Business Context (retrieved from intelligence layer)")
            if strict_data_instruction:
                parts.append("⚠️ ONLY use historical context that is about branches/entities in the Data Context above. IGNORE any historical context about other branches.")
            for i, chunk in enumerate(rag_chunks, 1):
                sim = chunk.get("similarity", 0)
                parts.append(f"{i}. {chunk['text']} (relevance: {sim:.2f})")
            parts.append("\nBase suggestions on historical context. Cite specific data points.")
        # Retry instructions
        if retry_issues:
            parts.append("\n## IMPORTANT — Previous Response Issues")
            parts.append("Your previous response had these problems. Fix them:")
            for issue in retry_issues:
                parts.append(f"- {issue}")
            parts.append("Be more precise. Only state facts supported by the data above.")
            parts.append("Do NOT introduce any new branches, cities, or figures that are not in the Data Context.")
        # User question
        parts.append(f"\n## User Question\n{question}")
        parts.append("\nProvide a professional, data-grounded response. Cite specific numbers from the data.")
        return "\n".join(parts)
@@ -0,0 +1,294 @@
 """
 Clawrity — Orchestrator
 Coordinates the full message pipeline:
  NormalisedMessage → NL-to-SQL → PostgreSQL → (RAG Retriever) → Gen Agent → QA Agent → Response
 Max 2 retries per query. Returns best attempt with confidence warning after max retries.
 Context enrichment: when a query returns sparse data (≤3 rows) and the question
 asks for recommendations, automatically pulls top-performing branches as comparison
 context so the Gen Agent can give actionable suggestions.
 """
 import re
 import logging
 import time
 from typing import Dict, Optional, List
 import pandas as pd
 from agents.gen_agent import GenAgent
 from agents.qa_agent import QAAgent
 from channels.protocol_adapter import NormalisedMessage
 from config.client_loader import ClientConfig
 from skills.nl_to_sql import NLToSQL
 from skills.postgres_connector import get_connector
 from soul.soul_loader import load_soul
 logger = logging.getLogger(__name__)
 MAX_RETRIES = 2
 # Keywords that signal the user wants recommendations, not just raw data
 _RECOMMENDATION_KEYWORDS = re.compile(
    r"\b(improve|increase|boost|grow|fix|help|recommend|suggest|advice|strategy|"
    r"what (should|can|do)|how (to|can|do|should))\b",
    re.IGNORECASE,
 )
 class Orchestrator:
    """Pipeline orchestrator — the central brain of Clawrity."""
    def __init__(self):
        self.nl_to_sql = NLToSQL()
        self.gen_agent = GenAgent()
        self.qa_agent = QAAgent()
        self.retriever = None  # Set in Phase 2 via set_retriever()
    def set_retriever(self, retriever):
        """Attach the RAG retriever (Phase 2)."""
        self.retriever = retriever
    async def process(
        self,
        message: NormalisedMessage,
        client_config: ClientConfig,
    ) -> Dict:
        """
        Process a user message through the full pipeline.
        Returns:
            Dict with: response, qa_score, qa_passed, retries, metadata
        """
        start_time = time.time()
        db = get_connector()
        # Load SOUL
        soul_content = load_soul(client_config)
        # Step 1: NL-to-SQL
        schema_meta = db.get_spend_data_schema(client_config.client_id)
        sql = self.nl_to_sql.generate_sql(
            question=message.text,
            client_id=client_config.client_id,
            schema_metadata=schema_meta,
        )
        # Step 2: Execute SQL
        data_context = None
        if sql:
            try:
                data_context = db.execute_query(sql)
                logger.info(f"SQL returned {len(data_context)} rows")
            except Exception as e:
                logger.error(f"SQL execution failed: {e}")
                data_context = pd.DataFrame()
        else:
            data_context = pd.DataFrame()
        # Step 2b: Context enrichment for sparse results
        # When data is sparse and the user wants recommendations, pull
        # top performers and channel benchmarks as supplementary context
        supplementary_context = None
        if self._needs_enrichment(message.text, data_context):
            supplementary_context = self._enrich_context(
                db, client_config.client_id, message.text, data_context
            )
            if supplementary_context is not None:
                logger.info(
                    f"Context enriched: {len(supplementary_context)} supplementary rows"
                )
        # Step 3: RAG Retrieval (Phase 2)
        rag_chunks = None
        if self.retriever:
            try:
                rag_chunks = self.retriever.retrieve(
                    query=message.text,
                    client_id=client_config.client_id,
                )
            except Exception as e:
                logger.warning(f"RAG retrieval failed: {e}")
        # Step 4: Gen Agent → QA Agent loop (max 2 retries)
        # When supplementary context is provided (enrichment mode), use a relaxed
        # QA threshold since the response naturally references broader benchmark data
        qa_threshold = client_config.hallucination_threshold
        if supplementary_context is not None and len(supplementary_context) > 0:
            qa_threshold = min(qa_threshold, 0.5)
            logger.info(f"Using relaxed QA threshold ({qa_threshold}) for enriched context")
        best_response = None
        best_score = 0.0
        qa_result = {"score": 0, "passed": False, "issues": []}
        retries = 0
        for attempt in range(MAX_RETRIES + 1):
            retry_issues = qa_result["issues"] if attempt > 0 else None
            # On retry, add explicit data-only instruction to prevent hallucination
            strict_data_instruction = None
            if attempt > 0:
                if supplementary_context is not None and len(supplementary_context) > 0:
                    strict_data_instruction = (
                        "CRITICAL: Only use data from the Data Context and Benchmark Data "
                        "sections provided. Do NOT invent figures or branch names that are "
                        "not present in either of those sections. You MAY reference benchmark "
                        "branches for comparison and recommendations."
                    )
                else:
                    strict_data_instruction = (
                        "CRITICAL: Do NOT mention any branches, figures, or historical data "
                        "that are not in the SQL query result provided. Stick strictly to the "
                        "data. If historical context from RAG is about different branches than "
                        "what the query returned, IGNORE that context entirely."
                    )
            response = self.gen_agent.generate(
                question=message.text,
                soul_content=soul_content,
                data_context=data_context,
                rag_chunks=rag_chunks,
                retry_issues=retry_issues,
                retry_count=attempt,
                strict_data_instruction=strict_data_instruction,
                supplementary_context=supplementary_context,
            )
            qa_result = self.qa_agent.evaluate(
                response=response,
                data_context=data_context,
                threshold=qa_threshold,
                supplementary_context=supplementary_context,
                user_question=message.text,
            )
            # Track best response (prefer longer, richer responses over "no data" stubs)
            if qa_result["score"] > best_score or (
                qa_result["score"] == best_score
                and best_response is not None
                and len(response) > len(best_response)
            ):
                best_score = qa_result["score"]
                best_response = response
            if qa_result["passed"]:
                logger.info(f"QA passed on attempt {attempt + 1}")
                break
            else:
                retries += 1
                logger.warning(
                    f"QA failed on attempt {attempt + 1}: "
                    f"score={qa_result['score']:.2f}, issues={qa_result['issues']}"
                )
        # If max retries exceeded, use best response with confidence warning
        final_response = best_response or response
        if not qa_result["passed"] and retries >= MAX_RETRIES:
            final_response += (
                "\n\n---\n"
                f"⚠️ *Confidence: {best_score:.0%} — "
                f"This response may contain approximations. "
                f"Please verify critical numbers against your source data.*"
            )
        elapsed = time.time() - start_time
        result = {
            "response": final_response,
            "qa_score": best_score,
            "qa_passed": qa_result["passed"],
            "retries": retries,
            "sql": sql,
            "data_rows": len(data_context) if data_context is not None else 0,
            "rag_chunks_used": len(rag_chunks) if rag_chunks else 0,
            "elapsed_seconds": round(elapsed, 2),
        }
        # Log interaction
        self._log_interaction(message, client_config, result)
        return result
    def _needs_enrichment(
        self,
        question: str,
        data_context: Optional[pd.DataFrame],
    ) -> bool:
        """Check if the query result is too sparse for a recommendation question."""
        # Only enrich if data is sparse
        if data_context is not None and len(data_context) > 3:
            return False
        # Only enrich if user is asking for recommendations/improvement
        return bool(_RECOMMENDATION_KEYWORDS.search(question))
    def _enrich_context(
        self,
        db,
        client_id: str,
        question: str,
        data_context: Optional[pd.DataFrame],
    ) -> Optional[pd.DataFrame]:
        """
        Pull supplementary context: top-performing branches and channel
        benchmarks to help Gen Agent give actionable recommendations.
        """
        try:
            # Get top 5 branches by ROI for comparison
            enrichment_sql = """
                SELECT branch, country, channel,
                       SUM(spend) as total_spend,
                       SUM(revenue) as total_revenue,
                       SUM(leads) as total_leads,
                       SUM(conversions) as total_conversions,
                       ROUND((SUM(revenue)/NULLIF(SUM(spend),0))::numeric, 2) as roi
                FROM spend_data
                WHERE client_id = %s
                  AND date >= CURRENT_DATE - INTERVAL '90 days'
                GROUP BY branch, country, channel
                HAVING SUM(spend) > 0
                ORDER BY roi DESC
                LIMIT 10
            """
            top_performers = db.execute_query(enrichment_sql, (client_id,))
            if top_performers is not None and len(top_performers) > 0:
                logger.info(f"Enrichment: fetched {len(top_performers)} top performer rows")
                return top_performers
        except Exception as e:
            logger.warning(f"Context enrichment failed: {e}")
        return None
    def _log_interaction(
        self,
        message: NormalisedMessage,
        client_config: ClientConfig,
        result: Dict,
    ):
        """Log interaction for monitoring."""
        try:
            from rag.monitoring import log_interaction
            log_interaction(
                client_id=client_config.client_id,
                query=message.text,
                num_chunks=result.get("rag_chunks_used", 0),
                chunk_types_used=[],  # Populated when retriever provides this info
                qa_score=result.get("qa_score", 0),
                qa_passed=result.get("qa_passed", False),
                retries=result.get("retries", 0),
                response_length=len(result.get("response", "")),
                elapsed_seconds=result.get("elapsed_seconds", 0),
            )
        except Exception as e:
            logger.debug(f"Monitoring log failed: {e}")
        logger.info(
            f"[{client_config.client_id}] Query processed: "
            f"score={result['qa_score']:.2f}, passed={result['qa_passed']}, "
            f"retries={result['retries']}, time={result['elapsed_seconds']}s"
        )
@@ -0,0 +1,165 @@
 """
 Clawrity — QA Agent
 Evaluates Gen Agent responses for faithfulness against data context.
 Uses Groq LLM at temperature 0.1 for strict, deterministic evaluation.
 Returns JSON: { score, passed, issues }
 Threshold from client YAML hallucination_threshold (default 0.75).
 """
 import json
 import logging
 from typing import Optional, List, Dict
 import pandas as pd
 from config.llm_client import get_llm_client, get_model_name
 logger = logging.getLogger(__name__)
 EVAL_PROMPT = """You are a strict quality assurance evaluator for business intelligence responses.
 Your job: verify that the response ONLY contains claims supported by the provided data.
 ## Data Context (ground truth)
 {data_context}
 ## Response to Evaluate
 {response}
 ## Evaluation Criteria
 ### 1. Branch Name Validation (CRITICAL)
 - Extract ALL branch/city names mentioned in the response
 - Compare against the branch names in the Data Context above
 - If ANY branch name appears in the response but NOT in the Data Context, this is a HALLUCINATION
 - Deduct 0.3 from score for EACH unrelated branch mentioned
 ### 2. Numerical Accuracy (CRITICAL)
 - ALL revenue, spend, lead, conversion, and ROI figures in the response must match the Data Context EXACTLY
 - If a number is mentioned that does not appear in the Data Context, deduct 0.2 from score
 - Rounded numbers are acceptable only if clearly approximate (e.g., "~$1.2M")
 ### 3. Historical Context Relevance
 - If the response includes historical context or trends, it is acceptable ONLY if it directly supports the answer about branches/entities present in the Data Context
 - Historical context about branches NOT in the current Data Context must be penalized: deduct 0.3 from score
 - Example: If Data Context shows Toronto, Vancouver, Dubai but response mentions "Lawton showed 16436% growth" — this is IRRELEVANT historical context and must be penalized
 ### 4. Completeness
 - Does the response address the user's question?
 - Are key data points from the Data Context included?
 ### 5. Appropriate Hedging
 - Does the response use uncertain language for inferences?
 - Recommendations should be clearly marked as suggestions, not facts
 ## Scoring
 Start at 1.0 and deduct points per the rules above. Minimum score is 0.0.
 Return a JSON object with exactly this structure:
 {{
    "score": <float between 0.0 and 1.0>,
    "passed": <true if score >= {threshold}>,
    "issues": [<list of specific issues found, empty if none>]
 }}
 IMPORTANT: If score < {threshold}, include in issues list exactly which branches, figures, or historical data were mentioned that do NOT appear in the Data Context. Format as:
 "Mentioned branches/figures not in current query result: [list them]"
 Return ONLY the JSON. No other text."""
 class QAAgent:
    """Quality assurance agent for validating Gen Agent responses."""
    def __init__(self):
        self.client = get_llm_client()
        self.model = get_model_name()
    def evaluate(
        self,
        response: str,
        data_context: Optional[pd.DataFrame] = None,
        threshold: float = 0.75,
        supplementary_context: Optional[pd.DataFrame] = None,
        user_question: str = "",
    ) -> Dict:
        """
        Evaluate a response for faithfulness.
        Args:
            response: Gen Agent's response text
            data_context: The data the response should be grounded in
            threshold: Minimum score to pass (from client YAML)
            supplementary_context: Benchmark data (top performers) that is also valid ground truth
            user_question: The user's original question (entities mentioned here are valid context)
        Returns:
            Dict with score (float), passed (bool), issues (list[str])
        """
        data_str = ""
        if data_context is not None and len(data_context) > 0:
            data_str = data_context.to_markdown(index=False)
        else:
            data_str = "No structured data available."
        # Include supplementary (benchmark) context as valid ground truth
        if supplementary_context is not None and len(supplementary_context) > 0:
            data_str += "\n\n### Benchmark Data (also valid ground truth)\n"
            data_str += supplementary_context.to_markdown(index=False)
        # Include user question so QA knows which entities are valid context
        if user_question:
            data_str += f"\n\n### User Question Context\nThe user asked: \"{user_question}\"\nBranch/entity names mentioned in the user's question are valid to reference in the response."
        prompt = EVAL_PROMPT.format(
            data_context=data_str,
            response=response,
            threshold=threshold,
        )
        try:
            result = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are a strict QA evaluator. Return only valid JSON. Pay special attention to branch names and figures that appear in the response but NOT in the data context — these are hallucinations."},
                    {"role": "user", "content": prompt},
                ],
                temperature=0.1,
                max_tokens=512,
            )
            raw = result.choices[0].message.content.strip()
            evaluation = self._parse_response(raw, threshold)
            logger.info(
                f"QA evaluation: score={evaluation['score']:.2f}, "
                f"passed={evaluation['passed']}, issues={len(evaluation['issues'])}"
            )
            return evaluation
        except Exception as e:
            logger.error(f"QA evaluation failed: {e}")
            # On failure, pass with warning
            return {"score": 0.5, "passed": True, "issues": [f"QA evaluation error: {str(e)}"]}
    def _parse_response(self, raw: str, threshold: float) -> Dict:
        """Parse JSON response from QA LLM call."""
        try:
            # Strip markdown code fences if present
            cleaned = raw.strip()
            if cleaned.startswith("```"):
                cleaned = cleaned.split("\n", 1)[1] if "\n" in cleaned else cleaned[3:]
            if cleaned.endswith("```"):
                cleaned = cleaned[:-3]
            cleaned = cleaned.strip()
            data = json.loads(cleaned)
            score = float(data.get("score", 0.5))
            return {
                "score": score,
                "passed": score >= threshold,
                "issues": data.get("issues", []),
            }
        except (json.JSONDecodeError, ValueError) as e:
            logger.warning(f"Could not parse QA response: {e}. Raw: {raw[:200]}")
            return {"score": 0.5, "passed": True, "issues": ["QA response parsing failed"]}
@@ -0,0 +1,214 @@
 """
 Clawrity — Scout Agent
 Fetches real-time competitor updates and sector-specific news.
 Runs inside HEARTBEAT digest job ONLY — never on ad-hoc /chat queries.
 Appends "Market Intelligence" section to morning digest.
 If nothing relevant is found, the section is omitted entirely — no filler.
 """
 import logging
 from datetime import datetime
 from typing import Optional
 from config.llm_client import get_llm_client, get_model_name
 from config.client_loader import ClientConfig
 from config.settings import get_settings
 from skills.web_search import web_search
 logger = logging.getLogger(__name__)
 SCOUT_PROMPT = """You are a business intelligence scout for {client_name}.
 Their sector: {sector}
 Their competitors: {competitors}
 Below are web search results from the last {lookback} day(s).
 Extract ONLY what is directly relevant to this client's business.
 Ignore anything generic or unrelated to their sector.
 If nothing is relevant, respond with exactly: NO_RELEVANT_NEWS
 Format relevant findings as a clean "Market Intelligence" section with bullet points.
 Each bullet should summarize one key finding with its source.
 Results:
 {search_results}"""
 QUERY_PROMPT = """You are a business intelligence scout for {client_name}.
 Sector: {sector}
 Competitors: {competitors}
 The user asked: "{query}"
 Below are web search results. Extract ONLY what is directly relevant to the
 user's question and this client's business context. Ignore generic or unrelated content.
 If nothing is relevant, respond with exactly: NO_RELEVANT_NEWS
 Format findings as concise bullet points with sources.
 Results:
 {search_results}"""
 class ScoutAgent:
    """Competitor and sector intelligence agent."""
    def __init__(self):
        self.client = get_llm_client()
        self.model = get_model_name()
    async def gather_intelligence(
        self,
        client_config: ClientConfig,
    ) -> Optional[str]:
        """
        Fetch and summarize competitor/sector news for digest.
        Args:
            client_config: Client config with scout section
        Returns:
            Formatted "Market Intelligence" markdown section, or None if nothing relevant
        """
        scout_config = client_config.scout
        if not scout_config.sector and not scout_config.competitors:
            logger.info(f"[{client_config.client_id}] No scout config — skipping")
            return None
        lookback = scout_config.news_lookback_days
        today = datetime.now().strftime("%Y-%m-%d")
        # Gather search results
        all_results = []
        # Search for each competitor
        for competitor in scout_config.competitors:
            query = f"{competitor} latest news"
            results = web_search(query, max_results=3, lookback_days=lookback)
            all_results.extend(results)
        # Search for sector keywords
        for keyword in scout_config.keywords[:3]:  # Limit to 3 keywords
            query = f"{keyword} news {today}"
            results = web_search(query, max_results=3, lookback_days=lookback)
            all_results.extend(results)
        if not all_results:
            logger.info(f"[{client_config.client_id}] No search results found")
            return None
        # Format results for LLM
        results_text = "\n\n".join(
            f"**{r['title']}** ({r['url']})\n{r['content']}"
            for r in all_results
        )
        # Summarize with Groq
        prompt = SCOUT_PROMPT.format(
            client_name=client_config.client_name,
            sector=scout_config.sector,
            competitors=", ".join(scout_config.competitors),
            lookback=lookback,
            search_results=results_text,
        )
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are a business intelligence scout."},
                    {"role": "user", "content": prompt},
                ],
                temperature=0.3,
                max_tokens=1024,
            )
            result = response.choices[0].message.content.strip()
            if result == "NO_RELEVANT_NEWS":
                logger.info(f"[{client_config.client_id}] Scout: no relevant news found")
                return None
            section = f"## 🔭 Market Intelligence\n\n{result}"
            logger.info(f"[{client_config.client_id}] Scout: generated intelligence section")
            return section
        except Exception as e:
            logger.error(f"Scout Agent failed: {e}")
            return None
    async def search_query(
        self,
        client_config: ClientConfig,
        query: str,
    ) -> Optional[str]:
        """
        Run a targeted scout search for a specific user query.
        Used by the /scout endpoint for ad-hoc competitor/news queries.
        Args:
            client_config: Client config with scout section
            query: User's specific question about competitors/market
        Returns:
            Formatted intelligence summary, or None if nothing relevant
        """
        scout_config = client_config.scout
        # Search with the user's query directly
        results = web_search(query, max_results=5, lookback_days=scout_config.news_lookback_days)
        # Also search with competitor names if they appear in the query
        for competitor in scout_config.competitors:
            if competitor.lower() in query.lower():
                extra = web_search(f"{competitor} latest news", max_results=3, lookback_days=scout_config.news_lookback_days)
                results.extend(extra)
        if not results:
            logger.info(f"[{client_config.client_id}] Scout query returned no results")
            return None
        # Deduplicate by URL
        seen_urls = set()
        unique_results = []
        for r in results:
            if r["url"] not in seen_urls:
                seen_urls.add(r["url"])
                unique_results.append(r)
        # Format results for LLM
        results_text = "\n\n".join(
            f"**{r['title']}** ({r['url']})\n{r['content']}"
            for r in unique_results
        )
        prompt = QUERY_PROMPT.format(
            client_name=client_config.client_name,
            sector=scout_config.sector,
            competitors=", ".join(scout_config.competitors),
            query=query,
            search_results=results_text,
        )
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are a business intelligence scout."},
                    {"role": "user", "content": prompt},
                ],
                temperature=0.3,
                max_tokens=1024,
            )
            result = response.choices[0].message.content.strip()
            if result == "NO_RELEVANT_NEWS":
                return None
            return result
        except Exception as e:
            logger.error(f"Scout query failed: {e}")
            return None
@@ -0,0 +1,121 @@
 """
 Clawrity — Protocol Adapter (OpenClaw Pattern)
 Normalises messages from any channel into a unified NormalisedMessage.
 Maps workspace/team IDs → client_id. Strips bot mentions.
 Interface: any channel handler produces NormalisedMessage — adding Teams,
 WhatsApp, etc. requires zero pipeline changes.
 """
 import re
 import logging
 from dataclasses import dataclass, field
 from datetime import datetime
 from typing import Dict, Optional
 from config.client_loader import ClientConfig
 logger = logging.getLogger(__name__)
@dataclass
 class NormalisedMessage:
    """Unified message format — channel-agnostic."""
    text: str
    channel: str  # Channel/conversation ID
    user_id: str
    client_id: str
    timestamp: datetime = field(default_factory=datetime.utcnow)
    source: str = "unknown"  # "slack", "teams", "api"
    raw_event: Optional[Dict] = None
 # Pattern to match Slack bot mentions like <@U1234567890>
 SLACK_MENTION_PATTERN = re.compile(r"<@[A-Z0-9]+>\s*")
 class ProtocolAdapter:
    """Normalises raw channel events into NormalisedMessages."""
    def __init__(self, client_configs: Dict[str, ClientConfig]):
        """
        Args:
            client_configs: Dict of client_id → ClientConfig
        """
        self.client_configs = client_configs
        # Build workspace → client_id lookup
        self._workspace_map: Dict[str, str] = {}
        for cid, config in client_configs.items():
            for ws_id in config.slack_workspace_ids:
                self._workspace_map[ws_id] = cid
        # If only one client, use it as default
        self._default_client_id = (
            list(client_configs.keys())[0] if len(client_configs) == 1 else None
        )
    def normalise_slack(self, event: dict, team_id: Optional[str] = None) -> NormalisedMessage:
        """
        Normalise a Slack event into a NormalisedMessage.
        Args:
            event: Raw Slack event dict (from Bolt SDK)
            team_id: Slack workspace/team ID
        Returns:
            NormalisedMessage
        """
        text = event.get("text", "")
        # Strip bot mention tags
        text = SLACK_MENTION_PATTERN.sub("", text).strip()
        channel = event.get("channel", "")
        user_id = event.get("user", "")
        # Map workspace to client
        client_id = self._resolve_client_id(team_id)
        return NormalisedMessage(
            text=text,
            channel=channel,
            user_id=user_id,
            client_id=client_id,
            source="slack",
            raw_event=event,
        )
    def normalise_api(self, client_id: str, message: str) -> NormalisedMessage:
        """Normalise a direct API call (POST /chat)."""
        return NormalisedMessage(
            text=message,
            channel="api",
            user_id="api_user",
            client_id=client_id,
            source="api",
        )
    def normalise_teams(self, activity: dict) -> NormalisedMessage:
        """
        Normalise a Microsoft Teams Bot Framework activity.
        # TODO: Implement full Teams normalisation when Teams handler is wired up.
        """
        text = activity.get("text", "")
        # Strip Teams bot mention (usually <at>BotName</at>)
        text = re.sub(r"<at>.*?</at>\s*", "", text).strip()
        return NormalisedMessage(
            text=text,
            channel=activity.get("channelId", "teams"),
            user_id=activity.get("from", {}).get("id", ""),
            client_id=self._default_client_id or "unknown",
            source="teams",
            raw_event=activity,
        )
    def _resolve_client_id(self, workspace_id: Optional[str]) -> str:
        """Resolve workspace/team ID to client_id."""
        if workspace_id and workspace_id in self._workspace_map:
            return self._workspace_map[workspace_id]
        if self._default_client_id:
            return self._default_client_id
        logger.warning(f"Could not resolve client for workspace: {workspace_id}")
        return "unknown"
@@ -0,0 +1,263 @@
 """
 Clawrity — Slack Handler (Socket Mode)
 Listens for app_mention and message events via Slack Bolt SDK.
 Runs in a background thread to not block FastAPI.
 === SETUP REQUIRED ===
 Before running, configure these in your .env file:
  SLACK_BOT_TOKEN=xoxb-...    ← OAuth & Permissions → Install to Workspace
  SLACK_APP_TOKEN=xapp-...    ← Socket Mode → Generate App-Level Token
  SLACK_SIGNING_SECRET=...    ← Basic Information → App Credentials
 See README.md for detailed Slack app setup instructions.
 =======================
 """
 import asyncio
 import logging
 import threading
 import time
 from concurrent.futures import ThreadPoolExecutor
 from typing import Dict, Optional, Set
 from config.settings import get_settings
 from config.client_loader import ClientConfig
 from channels.protocol_adapter import ProtocolAdapter, NormalisedMessage
 logger = logging.getLogger(__name__)
 # Thread pool for processing LLM pipeline without blocking event handlers
 _executor = ThreadPoolExecutor(max_workers=4, thread_name_prefix="clawrity-slack")
 # Module-level guard: only one SlackHandler should be active at a time
 _active_handler: Optional["SlackHandler"] = None
 class SlackHandler:
    """Slack Bot using Socket Mode via Bolt SDK."""
    def __init__(
        self,
        protocol_adapter: ProtocolAdapter,
        client_configs: Dict[str, ClientConfig],
        orchestrator,  # agents.orchestrator.Orchestrator
    ):
        self.adapter = protocol_adapter
        self.client_configs = client_configs
        self.orchestrator = orchestrator
        self._thread: Optional[threading.Thread] = None
        settings = get_settings()
        # ---------------------------------------------------------------
        # Bot Token (xoxb-...) — from .env SLACK_BOT_TOKEN
        # This is the OAuth token installed to your workspace.
        # ---------------------------------------------------------------
        self.bot_token = settings.slack_bot_token
        # ---------------------------------------------------------------
        # App-Level Token (xapp-...) — from .env SLACK_APP_TOKEN
        # Required for Socket Mode. Generated in Slack app settings.
        # ---------------------------------------------------------------
        self.app_token = settings.slack_app_token
        # ---------------------------------------------------------------
        # Signing Secret — from .env SLACK_SIGNING_SECRET
        # Used to verify incoming requests from Slack.
        # ---------------------------------------------------------------
        self.signing_secret = settings.slack_signing_secret
        self.app = None
        self.handler = None
        # Deduplication: track recently processed event timestamps
        # Slack retries events if handler is slow — this prevents duplicates
        self._processed_events: Set[str] = set()
        self._processed_lock = threading.Lock()
    def _validate_tokens(self) -> bool:
        """Check that all required Slack tokens are configured."""
        if not self.bot_token:
            logger.warning(
                "SLACK_BOT_TOKEN not set. Slack bot will not start. "
                "See README.md → Slack Bot Setup for instructions."
            )
            return False
        if not self.app_token:
            logger.warning(
                "SLACK_APP_TOKEN not set. Socket Mode requires an app-level token. "
                "Go to your Slack app → Socket Mode → Generate Token."
            )
            return False
        return True
    def _is_duplicate_event(self, event: dict) -> bool:
        """Check if we've already processed this event (Slack retry dedup)."""
        # Use multiple fields to build a robust dedup key.
        # client_msg_id is unique per user message (present on message events,
        # but NOT on app_mention events). event_ts is present on both.
        # We store keys for all strategies so cross-event-type dedup works.
        msg_id = event.get("client_msg_id")
        event_ts = event.get("event_ts") or event.get("ts", "")
        user = event.get("user", "")
        # Build candidate keys
        keys = set()
        if msg_id:
            keys.add(f"msg:{msg_id}")
        if event_ts:
            keys.add(f"ts:{event_ts}")
        # Fallback: combine event type + ts + user for events without client_msg_id
        event_type = event.get("type", "")
        if event_ts and user:
            keys.add(f"evt:{event_type}:{event_ts}:{user}")
        if not keys:
            return False
        with self._processed_lock:
            # Check ALL keys — if any match, it's a duplicate
            for key in keys:
                if key in self._processed_events:
                    logger.debug(f"Skipping duplicate event (matched key: {key})")
                    return True
            # Register ALL keys so cross-event-type dedup works
            # (app_mention and message for the same user message share event_ts)
            self._processed_events.update(keys)
            # Prune old entries (keep set from growing indefinitely)
            if len(self._processed_events) > 500:
                self._processed_events = set(list(self._processed_events)[-200:])
        return False
    def _setup_app(self):
        """Initialize Slack Bolt App and register event handlers."""
        from slack_bolt import App
        from slack_bolt.adapter.socket_mode import SocketModeHandler
        self.app = App(
            token=self.bot_token,
            signing_secret=self.signing_secret if self.signing_secret else None,
        )
        # Track bot's own user ID to prevent self-response loops
        self._bot_user_id = None
        try:
            auth = self.app.client.auth_test()
            self._bot_user_id = auth.get("user_id", "")
            logger.info(f"Bot user ID: {self._bot_user_id}")
        except Exception as e:
            logger.warning(f"Could not fetch bot user ID: {e}")
        # --- Event: Bot mentioned in a channel ---
        @self.app.event("app_mention")
        def handle_mention(event, say, context):
            # Return IMMEDIATELY so Slack gets ack — process in background
            if self._is_duplicate_event(event):
                return
            _executor.submit(self._handle_event, event, say, context)
        # --- Event: Direct message to bot ---
        @self.app.event("message")
        def handle_message(event, say, context):
            # Ignore bot's own messages and message_changed events
            if event.get("subtype") in (
                "bot_message",
                "message_changed",
                "message_deleted",
            ):
                return
            if event.get("bot_id"):
                return
            # Ignore if this is from the bot itself
            if self._bot_user_id and event.get("user") == self._bot_user_id:
                return
            # Skip channel messages that contain a bot mention —
            # those are handled by the app_mention handler above.
            # Only process DMs here (channel_type == "im").
            channel_type = event.get("channel_type", "")
            if channel_type != "im":
                return
            if self._is_duplicate_event(event):
                return
            # Return IMMEDIATELY — process in background
            _executor.submit(self._handle_event, event, say, context)
        self.handler = SocketModeHandler(self.app, self.app_token)
    def _handle_event(self, event: dict, say, context):
        """Process an incoming Slack event (runs in background thread)."""
        try:
            team_id = context.get("team_id", None) if context else None
            message = self.adapter.normalise_slack(event, team_id=team_id)
            if not message.text:
                return
            if message.client_id == "unknown":
                say("⚠️ Could not identify your workspace. Please contact support.")
                return
            client_config = self.client_configs.get(message.client_id)
            if not client_config:
                say(f"⚠️ No configuration found for client: {message.client_id}")
                return
            # Run the orchestrator pipeline (async in sync context)
            loop = asyncio.new_event_loop()
            try:
                result = loop.run_until_complete(
                    self.orchestrator.process(message, client_config)
                )
                say(result["response"])
            finally:
                loop.close()
        except Exception as e:
            logger.error(f"Slack event handler error: {e}", exc_info=True)
            say(
                "❌ I encountered an error processing your request. "
                "Please try again or contact support."
            )
    def start(self):
        """Start the Slack bot in a background thread."""
        global _active_handler
        if not self._validate_tokens():
            logger.info("Slack bot not started — missing tokens")
            return
        # Stop any existing handler to prevent duplicate Socket Mode connections
        if _active_handler is not None:
            logger.info("Stopping previous Slack handler before starting new one")
            _active_handler.stop()
            _active_handler = None
        try:
            self._setup_app()
            def _run():
                logger.info("Starting Slack bot (Socket Mode)...")
                self.handler.start()
            self._thread = threading.Thread(target=_run, daemon=True)
            self._thread.start()
            _active_handler = self
            logger.info("Slack bot started in background thread")
        except Exception as e:
            logger.error(f"Failed to start Slack bot: {e}")
    def stop(self):
        """Stop the Slack bot."""
        if self.handler:
            try:
                self.handler.close()
                logger.info("Slack bot stopped")
            except Exception as e:
                logger.warning(f"Error stopping Slack bot: {e}")
@@ -0,0 +1,124 @@
 """
 Clawrity — Microsoft Teams Handler (STUB)
 Skeleton implementation of the Bot Framework adapter for Microsoft Teams.
 Proves the multi-channel architecture is real — any channel handler produces
 NormalisedMessage via ProtocolAdapter, so the entire pipeline works unchanged.
 # TODO: Wire up Azure Bot credentials when ready for Teams demo.
 #       Required: MICROSOFT_APP_ID, MICROSOFT_APP_PASSWORD
 #       Package: botbuilder-core, botbuilder-schema
 Status: NOT IMPLEMENTED — Slack is the priority for development.
 """
 import logging
 from typing import Dict, Optional
 from channels.protocol_adapter import ProtocolAdapter, NormalisedMessage
 from config.client_loader import ClientConfig
 logger = logging.getLogger(__name__)
 class TeamsHandler:
    """
    Microsoft Teams bot handler stub.
    Architecture:
        Teams Activity → ProtocolAdapter.normalise_teams() → Orchestrator → Response
    The same pipeline used by Slack — zero business logic in this layer.
    """
    def __init__(
        self,
        protocol_adapter: ProtocolAdapter,
        client_configs: Dict[str, ClientConfig],
        orchestrator,  # agents.orchestrator.Orchestrator
    ):
        self.adapter = protocol_adapter
        self.client_configs = client_configs
        self.orchestrator = orchestrator
        # TODO: Wire up Azure Bot credentials from .env
        # self.app_id = settings.microsoft_app_id
        # self.app_password = settings.microsoft_app_password
    async def handle_activity(self, activity: dict) -> str:
        """
        Process an incoming Teams Bot Framework activity.
        # TODO: Implement when ready for Teams demo.
        Expected flow:
        1. Receive activity from Bot Framework webhook
        2. Normalise via ProtocolAdapter.normalise_teams(activity)
        3. Route to Orchestrator.process(message, client_config)
        4. Return response via Bot Framework turn context
        Args:
            activity: Raw Bot Framework activity dict
        Returns:
            Response text to send back to Teams
        """
        # --- Stub implementation ---
        message = self.adapter.normalise_teams(activity)
        client_config = self.client_configs.get(message.client_id)
        if not client_config:
            return f"No configuration found for client: {message.client_id}"
        result = await self.orchestrator.process(message, client_config)
        return result["response"]
    def setup_routes(self, app):
        """
        Register Teams webhook endpoint with FastAPI.
        # TODO: Implement Bot Framework adapter integration.
        Expected endpoint:
            POST /api/teams/messages → Bot Framework webhook
        Requires:
            - botbuilder-core package
            - BotFrameworkAdapter with app_id + app_password
            - CloudAdapter or BotFrameworkHttpClient
        """
        logger.info(
            "Teams handler stub loaded. "
            "To enable Teams: install botbuilder-core, set Azure Bot credentials."
        )
        # TODO: Uncomment and implement when ready
        #
        # from botbuilder.core import (
        #     BotFrameworkAdapter,
        #     BotFrameworkAdapterSettings,
        #     TurnContext,
        # )
        #
        # settings = BotFrameworkAdapterSettings(
        #     app_id=self.app_id,
        #     app_password=self.app_password,
        # )
        # adapter = BotFrameworkAdapter(settings)
        #
        # @app.post("/api/teams/messages")
        # async def teams_webhook(request: Request):
        #     body = await request.json()
        #     activity = Activity().deserialize(body)
        #     auth_header = request.headers.get("Authorization", "")
        #     response = await adapter.process_activity(
        #         activity, auth_header, self._on_turn
        #     )
        #     return response
        #
        # async def _on_turn(turn_context: TurnContext):
        #     activity = turn_context.activity
        #     response = await self.handle_activity(activity.__dict__)
        #     await turn_context.send_activity(response)
        pass
@@ -0,0 +1,158 @@
 """
 Clawrity — Client Configuration Loader
 Scans config/clients/ for YAML files and parses each into a ClientConfig model.
 Supports ${ENV_VAR} interpolation in YAML values.
 New client = new YAML file. Zero code changes.
 """
 import os
 import re
 import glob
 import logging
 from typing import Dict, List, Optional
 from pathlib import Path
 import yaml
 from pydantic import BaseModel
 from config.settings import get_settings
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Pydantic models for client YAML structure
 # ---------------------------------------------------------------------------
 class DataSourceConfig(BaseModel):
    type: str = "csv"
    path: str = ""
 class DatabaseConfig(BaseModel):
    url: str = ""
    schema_name: str = ""  # 'schema' is a Pydantic reserved attr
 class ScoutConfig(BaseModel):
    sector: str = ""
    competitors: List[str] = []
    keywords: List[str] = []
    news_lookback_days: int = 1
 class ClientConfig(BaseModel):
    client_id: str
    client_name: str = ""
    data_source: DataSourceConfig = DataSourceConfig()
    database: DatabaseConfig = DatabaseConfig()
    countries: List[str] = []
    risk_threshold: float = 0.15
    hallucination_threshold: float = 0.75
    digest_schedule: str = "08:00"
    timezone: str = "UTC"
    channels: Dict[str, str] = {}
    soul_file: str = ""
    heartbeat_file: str = ""
    column_mapping: Dict[str, str] = {}
    scout: ScoutConfig = ScoutConfig()
    # Runtime: workspace/team ID → client_id mapping for ProtocolAdapter
    slack_workspace_ids: List[str] = []
 # ---------------------------------------------------------------------------
 # Environment variable interpolation
 # ---------------------------------------------------------------------------
 _ENV_PATTERN = re.compile(r"\$\{(\w+)\}")
 def _interpolate_env(value: str) -> str:
    """Replace ${ENV_VAR} placeholders with actual environment variable values."""
    def _replace(match):
        var_name = match.group(1)
        return os.environ.get(var_name, match.group(0))
    if isinstance(value, str):
        return _ENV_PATTERN.sub(_replace, value)
    return value
 def _interpolate_dict(d: dict) -> dict:
    """Recursively interpolate environment variables in a dictionary."""
    result = {}
    for key, value in d.items():
        if isinstance(value, dict):
            result[key] = _interpolate_dict(value)
        elif isinstance(value, list):
            result[key] = [
                _interpolate_env(v) if isinstance(v, str) else v
                for v in value
            ]
        elif isinstance(value, str):
            result[key] = _interpolate_env(value)
        else:
            result[key] = value
    return result
 # ---------------------------------------------------------------------------
 # Loader
 # ---------------------------------------------------------------------------
 def load_client_configs(config_dir: Optional[str] = None) -> Dict[str, ClientConfig]:
    """
    Load all client YAML files from the config directory.
    Returns:
        Dict mapping client_id → ClientConfig
    """
    if config_dir is None:
        config_dir = get_settings().clients_config_dir
    configs: Dict[str, ClientConfig] = {}
    yaml_pattern = os.path.join(config_dir, "*.yaml")
    for yaml_path in glob.glob(yaml_pattern):
        try:
            with open(yaml_path, "r") as f:
                raw = yaml.safe_load(f)
            if not raw or "client_id" not in raw:
                logger.warning(f"Skipping {yaml_path}: missing client_id")
                continue
            # Interpolate environment variables
            interpolated = _interpolate_dict(raw)
            # Handle 'schema' → 'schema_name' mapping for Pydantic
            if "database" in interpolated and "schema" in interpolated["database"]:
                interpolated["database"]["schema_name"] = interpolated["database"].pop("schema")
            config = ClientConfig(**interpolated)
            configs[config.client_id] = config
            logger.info(f"Loaded client config: {config.client_id} from {yaml_path}")
        except Exception as e:
            logger.error(f"Error loading {yaml_path}: {e}")
    if not configs:
        logger.warning(f"No client configs found in {config_dir}")
    return configs
 def get_client_config(client_id: str, configs: Optional[Dict[str, ClientConfig]] = None) -> Optional[ClientConfig]:
    """Get a specific client config by ID."""
    if configs is None:
        configs = load_client_configs()
    return configs.get(client_id)
@@ -0,0 +1,36 @@
 client_id: acme_corp
 client_name: ACME Corporation
 data_source:
  type: "csv"
  path: "data/processed/acme_merged.csv"
 database:
  url: "${DATABASE_URL}"
  schema: "acme"
 countries: ["US", "Canada", "MENA"]
 risk_threshold: 0.15
 hallucination_threshold: 0.75
 digest_schedule: "08:00"
 timezone: "Asia/Kolkata"
 channels:
  slack_webhook: "${ACME_SLACK_WEBHOOK}"
 soul_file: "soul/acme_soul.md"
 heartbeat_file: "heartbeat/acme_heartbeat.md"
 column_mapping:
  Order Date: date
  Country: country
  City: branch
  Sales: revenue
  Profit: profit
 scout:
  sector: "global retail"
  competitors: ["IKEA", "Amazon", "Walmart", "Staples"]
  keywords: ["retail supply chain", "furniture market trends", "office supplies demand", "global retail ecommerce"]
  news_lookback_days: 1
@@ -0,0 +1,76 @@
 """
 Clawrity — LLM Client Factory
 Provides a unified LLM client that works with both NVIDIA NIM and Groq.
 Both are OpenAI-compatible APIs, so we use the OpenAI client with different
 base URLs and API keys.
 Auto-detects provider from settings:
  - NVIDIA NIM: base_url="https://integrate.api.nvidia.com/v1"
  - Groq: base_url="https://api.groq.com/openai/v1"
 """
 import logging
 from functools import lru_cache
 from openai import OpenAI
 from config.settings import get_settings
 logger = logging.getLogger(__name__)
 # Provider configs
 _PROVIDERS = {
    "nvidia": {
        "base_url": "https://integrate.api.nvidia.com/v1",
        "default_model": "meta/llama-3.3-70b-instruct",
    },
    "groq": {
        "base_url": "https://api.groq.com/openai/v1",
        "default_model": "llama-3.3-70b-versatile",
    },
 }
 def get_llm_client() -> OpenAI:
    """Get the configured LLM client (NVIDIA NIM or Groq)."""
    settings = get_settings()
    provider = settings.active_llm_provider
    if provider == "nvidia":
        api_key = settings.nvidia_api_key
    elif provider == "groq":
        api_key = settings.groq_api_key
    else:
        raise ValueError(f"Unknown LLM provider: {provider}")
    if not api_key:
        raise ValueError(
            f"No API key configured for LLM provider '{provider}'. "
            f"Set {'NVIDIA_API_KEY' if provider == 'nvidia' else 'GROQ_API_KEY'} in .env"
        )
    config = _PROVIDERS[provider]
    client = OpenAI(
        api_key=api_key,
        base_url=config["base_url"],
    )
    logger.info(f"LLM client: {provider} ({config['base_url']})")
    return client
 def get_model_name() -> str:
    """Get the model name for the active provider."""
    settings = get_settings()
    provider = settings.active_llm_provider
    # If user specified a model in settings, use it
    # Otherwise use the provider default
    model = settings.llm_model
    if model == "meta/llama-3.3-70b-instruct" and provider == "groq":
        model = _PROVIDERS["groq"]["default_model"]
    elif model == "llama-3.3-70b-versatile" and provider == "nvidia":
        model = _PROVIDERS["nvidia"]["default_model"]
    return model
@@ -0,0 +1,72 @@
 """
 Clawrity — Application Settings
 Loads environment variables via pydantic-settings.
 All secrets read from .env file — nothing is hardcoded.
 """
 import os
 from functools import lru_cache
 from typing import Optional
 from pydantic_settings import BaseSettings
 class Settings(BaseSettings):
    """Application settings loaded from environment variables."""
    # --- Database ---
    database_url: str = "postgresql://user:pass@localhost:5432/clawrity"
    # --- LLM Providers ---
    groq_api_key: str = ""
    nvidia_api_key: str = ""
    # --- Slack (Socket Mode) ---
    # Bot Token (xoxb-...) — OAuth & Permissions → Install to Workspace
    slack_bot_token: str = ""
    # App-Level Token (xapp-...) — Socket Mode → Generate Token
    slack_app_token: str = ""
    # Signing Secret — Basic Information → App Credentials
    slack_signing_secret: str = ""
    # --- Tavily Web Search ---
    tavily_api_key: str = ""
    # --- Slack Webhook for digest delivery ---
    acme_slack_webhook: str = ""
    # --- Paths ---
    data_raw_dir: str = "data/raw"
    data_processed_dir: str = "data/processed"
    logs_dir: str = "logs"
    clients_config_dir: str = "config/clients"
    # --- Model Defaults ---
    llm_model: str = "meta/llama-3.3-70b-instruct"
    llm_provider: str = ""  # auto-detected: "nvidia" or "groq"
    embedding_model: str = "all-MiniLM-L6-v2"
    embedding_dim: int = 384
    @property
    def active_llm_provider(self) -> str:
        """Auto-detect which LLM provider to use based on available keys."""
        if self.llm_provider:
            return self.llm_provider
        if self.nvidia_api_key:
            return "nvidia"
        if self.groq_api_key:
            return "groq"
        return "nvidia"  # default
    model_config = {
        "env_file": ".env",
        "env_file_encoding": "utf-8",
        "case_sensitive": False,
    }
@lru_cache()
 def get_settings() -> Settings:
    """Singleton settings instance. Cached after first call."""
    return Settings()
@@ -0,0 +1,42 @@
 """
 Clawrity — Base Data Connector
 Abstract interface for data connectors.
 All connectors implement load() → pd.DataFrame.
 """
 from abc import ABC, abstractmethod
 import pandas as pd
 class BaseConnector(ABC):
    """Abstract base class for data source connectors."""
    @abstractmethod
    def load(self, path: str, **kwargs) -> pd.DataFrame:
        """
        Load data from the source.
        Args:
            path: Path to the data source
            **kwargs: Additional arguments specific to the connector
        Returns:
            pandas DataFrame with loaded data
        """
        pass
    @abstractmethod
    def validate(self, df: pd.DataFrame, required_columns: list) -> bool:
        """
        Validate that the DataFrame has expected columns.
        Args:
            df: DataFrame to validate
            required_columns: List of column names that must be present
        Returns:
            True if valid
        """
        pass
@@ -0,0 +1,88 @@
 """
 Clawrity — CSV/Excel Data Connector
 Auto-detects file format based on extension:
  .csv → pandas read_csv
  .xlsx / .xls → pandas read_excel (via openpyxl)
 Supports both formats since Kaggle datasets vary by download version.
 """
 import logging
 from pathlib import Path
 import pandas as pd
 from connectors.base_connector import BaseConnector
 logger = logging.getLogger(__name__)
 class CSVConnector(BaseConnector):
    """Connector for CSV and Excel files with auto-detection."""
    def load(self, path: str, **kwargs) -> pd.DataFrame:
        """
        Load data from a CSV or Excel file.
        Auto-detects format based on file extension.
        Args:
            path: Path to the file (.csv, .xlsx, .xls)
            **kwargs: Passed through to pandas read function.
                      Useful kwargs: sheet_name, encoding, sep
        Returns:
            pandas DataFrame
        """
        file_path = Path(path)
        if not file_path.exists():
            raise FileNotFoundError(f"Data file not found: {path}")
        ext = file_path.suffix.lower()
        if ext == ".csv":
            logger.info(f"Loading CSV: {path}")
            df = pd.read_csv(path, encoding='latin-1', **kwargs)
        elif ext in (".xlsx", ".xls"):
            logger.info(f"Loading Excel ({ext}): {path}")
            # Default to first sheet unless specified
            sheet_name = kwargs.pop("sheet_name", 0)
            df = pd.read_excel(path, sheet_name=sheet_name, engine="openpyxl", **kwargs)
        else:
            raise ValueError(
                f"Unsupported file format: {ext}. "
                f"Supported: .csv, .xlsx, .xls"
            )
        logger.info(f"Loaded {len(df)} rows, {len(df.columns)} columns from {file_path.name}")
        return df
    def validate(self, df: pd.DataFrame, required_columns: list) -> bool:
        """
        Validate that the DataFrame has all required columns.
        Uses case-insensitive matching.
        Args:
            df: DataFrame to validate
            required_columns: List of column names that must be present
        Returns:
            True if all required columns found
        """
        df_cols_lower = {col.lower().strip() for col in df.columns}
        missing = []
        for col in required_columns:
            if col.lower().strip() not in df_cols_lower:
                missing.append(col)
        if missing:
            logger.error(
                f"Missing required columns: {missing}. "
                f"Available: {list(df.columns)}"
            )
            return False
        return True
@@ -0,0 +1,38 @@
 services:
  clawrity-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://user:pass@postgres:5432/clawrity
      - GROQ_API_KEY=${GROQ_API_KEY}
      - SLACK_BOT_TOKEN=${SLACK_BOT_TOKEN}
      - SLACK_APP_TOKEN=${SLACK_APP_TOKEN}
      - SLACK_SIGNING_SECRET=${SLACK_SIGNING_SECRET}
      - TAVILY_API_KEY=${TAVILY_API_KEY}
      - ACME_SLACK_WEBHOOK=${ACME_SLACK_WEBHOOK}
    depends_on:
      postgres:
        condition: service_healthy
    volumes:
      - ./data:/app/data
      - ./logs:/app/logs
  postgres:
    image: ankane/pgvector
    environment:
      POSTGRES_DB: clawrity
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
    volumes:
      - pg_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user -d clawrity"]
      interval: 5s
      timeout: 5s
      retries: 5
 volumes:
  pg_data:
@@ -0,0 +1,82 @@
 """
 Clawrity — ETL Normaliser
 Applies column mappings from client YAML, normalises data types,
 cleans strings, and handles nulls.
 """
 import logging
 from typing import Dict
 import pandas as pd
 import numpy as np
 logger = logging.getLogger(__name__)
 def normalise_dataframe(
    df: pd.DataFrame,
    column_mapping: Dict[str, str],
    date_column: str = "date",
 ) -> pd.DataFrame:
    """Normalise a DataFrame using the client's column mapping."""
    df = df.copy()
    original_len = len(df)
    # Step 1: Apply column mapping (case-insensitive)
    df_cols_map = {col.strip(): col for col in df.columns}
    rename_map = {}
    for source, target in column_mapping.items():
        if source in df_cols_map:
            rename_map[df_cols_map[source]] = target
        else:
            for orig_col, actual_col in df_cols_map.items():
                if orig_col.lower() == source.lower():
                    rename_map[actual_col] = target
                    break
    if rename_map:
        df = df.rename(columns=rename_map)
        logger.info(f"Renamed columns: {rename_map}")
    # Step 2: Parse dates
    if date_column in df.columns:
        df[date_column] = pd.to_datetime(df[date_column], errors="coerce")
        df = df.dropna(subset=[date_column])
        df[date_column] = df[date_column].dt.date
    # Step 3: Clean string columns
    for col in ["country", "branch", "channel"]:
        if col in df.columns:
            df[col] = (
                df[col].astype(str).str.strip().str.title()
                .replace({"Nan": None, "None": None, "": None})
            )
    # Step 4: Handle numeric nulls
    for col in ["spend", "revenue", "profit", "leads", "conversions"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")
    # Step 5: Remove duplicates
    df = df.drop_duplicates()
    dropped = original_len - len(df)
    if dropped > 0:
        logger.info(f"Removed {dropped} duplicate rows")
    logger.info(f"Normalisation complete: {len(df)} rows")
    return df
 def remove_outliers(df: pd.DataFrame, columns: list, n_std: float = 3.0) -> pd.DataFrame:
    """Remove rows with values > n_std standard deviations from mean."""
    df = df.copy()
    original_len = len(df)
    for col in columns:
        if col in df.columns and pd.api.types.is_numeric_dtype(df[col]):
            mean, std = df[col].mean(), df[col].std()
            if std > 0:
                df = df[(df[col] - mean).abs() <= n_std * std]
    removed = original_len - len(df)
    if removed > 0:
        logger.info(f"Removed {removed} outlier rows (>{n_std} std devs)")
    return df
@@ -0,0 +1,173 @@
 """
 Clawrity — Prophet Forecasting Engine
 Trains Prophet models on branch-level monthly revenue time series.
 Forecasts 6 months ahead. Caches results in PostgreSQL forecasts table.
 Limitations (be explicit):
 - Predicts revenue TRENDS only
 - Does NOT claim ROI-per-dollar forecasting (spend→revenue is approximate)
 - Requires minimum 2 years of data per branch
 """
 import json
 import logging
 from datetime import datetime
 from typing import Dict, List, Optional
 import pandas as pd
 from skills.postgres_connector import get_connector
 logger = logging.getLogger(__name__)
 MIN_MONTHS = 24  # Minimum 2 years of data
 FORECAST_MONTHS = 6
 class ProphetEngine:
    """Time series forecasting using Facebook Prophet."""
    def train_and_forecast(self, client_id: str) -> List[Dict]:
        """
        Train Prophet models for each branch and cache forecasts.
        Args:
            client_id: Client to forecast for
        Returns:
            List of forecast result dicts (one per branch)
        """
        from prophet import Prophet
        db = get_connector()
        # Get monthly revenue per branch
        sql = """
            SELECT branch, country,
                   DATE_TRUNC('month', date) AS month,
                   SUM(revenue) AS monthly_revenue
            FROM spend_data
            WHERE client_id = %s
            GROUP BY branch, country, DATE_TRUNC('month', date)
            ORDER BY branch, month
        """
        df = db.execute_query(sql, (client_id,))
        if df.empty:
            logger.warning(f"No data for forecasting: {client_id}")
            return []
        results = []
        branches = df.groupby(["branch", "country"])
        for (branch, country), group in branches:
            group = group.sort_values("month").reset_index(drop=True)
            if len(group) < MIN_MONTHS:
                logger.info(
                    f"Skipping {branch} ({country}): only {len(group)} months "
                    f"(need {MIN_MONTHS})"
                )
                continue
            try:
                # Prepare Prophet format: ds (date), y (value)
                prophet_df = pd.DataFrame({
                    "ds": pd.to_datetime(group["month"]),
                    "y": group["monthly_revenue"].astype(float),
                })
                # Train
                model = Prophet(
                    yearly_seasonality=True,
                    weekly_seasonality=False,
                    daily_seasonality=False,
                )
                model.fit(prophet_df)
                # Forecast
                future = model.make_future_dataframe(
                    periods=FORECAST_MONTHS, freq="MS"
                )
                forecast = model.predict(future)
                # Extract forecast period only
                forecast_only = forecast.tail(FORECAST_MONTHS)
                forecast_data = {
                    "branch": branch,
                    "country": country,
                    "horizon_months": FORECAST_MONTHS,
                    "dates": forecast_only["ds"].dt.strftime("%Y-%m-%d").tolist(),
                    "forecast_revenue": forecast_only["yhat"].round(2).tolist(),
                    "lower_bound": forecast_only["yhat_lower"].round(2).tolist(),
                    "upper_bound": forecast_only["yhat_upper"].round(2).tolist(),
                    "computed_at": datetime.utcnow().isoformat(),
                }
                # Cache in PostgreSQL
                self._cache_forecast(client_id, forecast_data)
                results.append(forecast_data)
                logger.info(
                    f"Forecast generated for {branch} ({country}): "
                    f"{FORECAST_MONTHS} months ahead"
                )
            except Exception as e:
                logger.error(f"Prophet failed for {branch} ({country}): {e}")
        logger.info(f"Forecasting complete: {len(results)} branches forecast")
        return results
    def get_cached_forecast(
        self,
        client_id: str,
        branch: str,
    ) -> Optional[Dict]:
        """Get the most recent cached forecast for a branch."""
        db = get_connector()
        sql = """
            SELECT forecast_data, computed_at
            FROM forecasts
            WHERE client_id = %s AND branch = %s
            ORDER BY computed_at DESC
            LIMIT 1
        """
        rows = db.execute_raw(sql, (client_id, branch))
        if not rows:
            return None
        row = rows[0]
        data = row["forecast_data"]
        if isinstance(data, str):
            data = json.loads(data)
        data["computed_at"] = str(row["computed_at"])
        return data
    def _cache_forecast(self, client_id: str, forecast_data: Dict):
        """Store forecast in PostgreSQL."""
        db = get_connector()
        # Delete old forecast for this branch
        db.execute_write(
            "DELETE FROM forecasts WHERE client_id = %s AND branch = %s AND country = %s",
            (client_id, forecast_data["branch"], forecast_data["country"]),
        )
        # Insert new
        db.execute_write(
            """INSERT INTO forecasts (client_id, branch, country, horizon_months, forecast_data)
               VALUES (%s, %s, %s, %s, %s)""",
            (
                client_id,
                forecast_data["branch"],
                forecast_data["country"],
                forecast_data["horizon_months"],
                json.dumps(forecast_data),
            ),
        )
@@ -0,0 +1,18 @@
 # HEARTBEAT — ACME Corporation
 ## Schedule
 - trigger: daily
 - time: "08:00"
 - timezone: "Asia/Kolkata"
 ## Digest Tasks
 1. Pull last 7 days spend + revenue per branch
 2. Identify bottom 3 performing branches by revenue
 3. Generate newsletter-style summary via Gen Agent → QA Agent
 4. Run Scout Agent for competitor + sector news
 5. Append Market Intelligence section to digest
 6. Push complete digest to Slack channel
 ## Retry
 - on_failure: retry after 15 minutes
 - max_retries: 3
@@ -0,0 +1,124 @@
 """
 Clawrity — HEARTBEAT Loader
 Parses HEARTBEAT.md files to extract schedule, digest tasks, and retry config.
 HEARTBEAT.md drives autonomous daily digest generation per client.
 """
 import re
 import logging
 from pathlib import Path
 from typing import Optional, Dict, Any
 from config.client_loader import ClientConfig
 logger = logging.getLogger(__name__)
 class HeartbeatConfig:
    """Parsed heartbeat configuration."""
    def __init__(self):
        self.trigger: str = "daily"
        self.time: str = "08:00"
        self.timezone: str = "UTC"
        self.retry_delay_minutes: int = 15
        self.max_retries: int = 3
        self.tasks: list = []
        self.raw_content: str = ""
    @property
    def hour(self) -> int:
        """Extract hour from time string."""
        return int(self.time.split(":")[0])
    @property
    def minute(self) -> int:
        """Extract minute from time string."""
        return int(self.time.split(":")[1])
 def load_heartbeat(client_config: ClientConfig) -> HeartbeatConfig:
    """
    Load and parse the HEARTBEAT.md file for a client.
    Args:
        client_config: The client's configuration containing heartbeat_file path.
    Returns:
        Parsed HeartbeatConfig with schedule, tasks, and retry settings.
    """
    config = HeartbeatConfig()
    heartbeat_path = Path(client_config.heartbeat_file)
    # Use client YAML timezone as fallback
    config.timezone = client_config.timezone
    if not heartbeat_path.exists():
        logger.warning(
            f"HEARTBEAT file not found at {heartbeat_path} for client "
            f"{client_config.client_id}. Using defaults from client YAML."
        )
        config.time = client_config.digest_schedule
        return config
    try:
        content = heartbeat_path.read_text(encoding="utf-8")
        config.raw_content = content
        _parse_heartbeat(content, config)
        logger.info(
            f"Loaded HEARTBEAT for {client_config.client_id}: "
            f"{config.trigger} at {config.time} {config.timezone}"
        )
    except Exception as e:
        logger.error(f"Error parsing HEARTBEAT file {heartbeat_path}: {e}")
        config.time = client_config.digest_schedule
    return config
 def _parse_heartbeat(content: str, config: HeartbeatConfig) -> None:
    """Parse markdown content and extract structured config."""
    lines = content.split("\n")
    current_section = None
    task_lines = []
    for line in lines:
        stripped = line.strip()
        # Detect section headers
        if stripped.startswith("## "):
            current_section = stripped[3:].strip().lower()
            continue
        if current_section == "schedule":
            # Parse key-value pairs like "- trigger: daily"
            match = re.match(r"-\s*(\w+):\s*\"?([^\"]+)\"?", stripped)
            if match:
                key, value = match.group(1).strip(), match.group(2).strip()
                if key == "trigger":
                    config.trigger = value
                elif key == "time":
                    config.time = value
                elif key == "timezone":
                    config.timezone = value
        elif current_section == "digest tasks":
            # Parse numbered list items
            match = re.match(r"\d+\.\s+(.*)", stripped)
            if match:
                config.tasks.append(match.group(1).strip())
        elif current_section == "retry":
            # Parse retry config
            match = re.match(r"-\s*(\w+):\s*(.+)", stripped)
            if match:
                key, value = match.group(1).strip(), match.group(2).strip()
                if "retry" in key and "after" in value:
                    # Extract minutes from "retry after 15 minutes"
                    mins = re.search(r"(\d+)", value)
                    if mins:
                        config.retry_delay_minutes = int(mins.group(1))
                elif key == "max_retries":
                    config.max_retries = int(value)
@@ -0,0 +1,295 @@
 """
 Clawrity — HEARTBEAT Scheduler
 APScheduler AsyncIOScheduler fires digest jobs per client at configured times.
 Schedule: ETL at 02:00 → RAG re-index at 03:00 → Digest + Scout at configured time.
 Retry: on failure, retry after N minutes, max retries from HEARTBEAT.md.
 """
 import asyncio
 import json
 import logging
 import os
 from datetime import datetime
 from typing import Dict, Optional
 import httpx
 from apscheduler.schedulers.asyncio import AsyncIOScheduler
 from apscheduler.triggers.cron import CronTrigger
 from agents.orchestrator import Orchestrator
 from channels.protocol_adapter import NormalisedMessage
 from config.client_loader import ClientConfig
 from config.settings import get_settings
 from heartbeat.heartbeat_loader import load_heartbeat
 from skills.postgres_connector import get_connector
 from soul.soul_loader import load_soul
 logger = logging.getLogger(__name__)
 async def run_digest(
    client_config: ClientConfig,
    orchestrator: Orchestrator,
    retry_count: int = 0,
 ) -> Optional[str]:
    """
    Run the daily digest for a client.
    Steps:
    1. Query bottom 3 branches by revenue (last 7 days)
    2. Gen Agent → QA Agent pipeline for digest
    3. Scout Agent for competitor/sector news
    4. Push to Slack webhook
    5. Log success/failure to JSONL
    Returns:
        Full digest text if successful, None on failure
    """
    from agents.gen_agent import GenAgent
    from agents.qa_agent import QAAgent
    client_id = client_config.client_id
    logger.info(f"[{client_id}] Running daily digest (attempt {retry_count + 1})")
    db = get_connector()
    try:
        # Step 1: Get bottom 3 branches by revenue with ROI
        bottom_sql = """
            SELECT branch, country, 
                   SUM(revenue) as total_revenue,
                   SUM(spend) as total_spend,
                   SUM(leads) as total_leads,
                   ROUND((SUM(revenue)/NULLIF(SUM(spend),0))::numeric, 2) as roi
            FROM spend_data
            WHERE client_id = %s
              AND date >= CURRENT_DATE - INTERVAL '7 days'
            GROUP BY branch, country
            ORDER BY total_revenue ASC
            LIMIT 3
        """
        data = db.execute_query(bottom_sql, (client_id,))
        # Step 2: Generate digest via Gen Agent with specific prompt
        soul_content = load_soul(client_config)
        gen_agent = GenAgent()
        qa_agent = QAAgent()
        # Retrieve RAG chunks for digest context
        rag_chunks = None
        if orchestrator.retriever:
            try:
                rag_chunks = orchestrator.retriever.retrieve(
                    query="weekly performance bottom performers budget recommendations",
                    client_id=client_id,
                )
            except Exception as e:
                logger.warning(f"RAG retrieval for digest failed: {e}")
        # Generate digest with explicit prompt
        digest = gen_agent.generate(
            question="Generate morning business digest. Highlight bottom 3 branches. Suggest where to focus budget. Newsletter style.",
            soul_content=soul_content,
            data_context=data,
            rag_chunks=rag_chunks,
        )
        # Step 2b: QA pass on digest (more lenient threshold for digest)
        qa_result = qa_agent.evaluate(
            response=digest,
            data_context=data,
            threshold=0.6,  # More lenient for digest
        )
        if not qa_result["passed"]:
            logger.warning(
                f"[{client_id}] Digest QA failed (score={qa_result['score']:.2f}), "
                f"retrying with strict instruction"
            )
            # Retry digest generation with strict instruction
            digest = gen_agent.generate(
                question="Generate morning business digest. Highlight bottom 3 branches. Suggest where to focus budget. Newsletter style.",
                soul_content=soul_content,
                data_context=data,
                rag_chunks=rag_chunks,
                retry_issues=qa_result["issues"],
                retry_count=1,
                strict_data_instruction=(
                    "CRITICAL: Only mention branches and figures that appear in the "
                    "Data Context. Do not reference any other branches or historical data."
                ),
            )
        # Step 3: Scout Agent for competitor/sector news
        scout_section = None
        try:
            from agents.scout_agent import ScoutAgent
            scout = ScoutAgent()
            scout_section = await scout.gather_intelligence(client_config)
        except Exception as e:
            logger.warning(f"Scout Agent failed: {e}")
        # Step 4: Assemble full digest
        full_digest = f"📊 **Clawrity Daily Digest — {client_config.client_name}**\n"
        full_digest += f"*{datetime.now().strftime('%B %d, %Y')}*\n\n"
        full_digest += digest
        if scout_section:
            full_digest += f"\n\n---\n\n{scout_section}"
        # Step 5: Push to Slack webhook
        webhook_url = client_config.channels.get("slack_webhook", "")
        if webhook_url:
            await _push_to_slack(webhook_url, full_digest)
        else:
            logger.warning(f"[{client_id}] No Slack webhook configured")
        # Step 6: Log success to JSONL
        _log_digest_event(client_id, "success", {
            "qa_score": qa_result["score"],
            "qa_passed": qa_result["passed"],
            "scout_included": scout_section is not None,
            "digest_length": len(full_digest),
        })
        logger.info(f"[{client_id}] Digest completed successfully")
        return full_digest
    except Exception as e:
        logger.error(f"[{client_id}] Digest failed: {e}", exc_info=True)
        _log_digest_event(client_id, "failure", {"error": str(e), "attempt": retry_count + 1})
        heartbeat = load_heartbeat(client_config)
        if retry_count < heartbeat.max_retries:
            delay_minutes = heartbeat.retry_delay_minutes
            logger.info(
                f"[{client_id}] Scheduling digest retry in {delay_minutes} minutes "
                f"(attempt {retry_count + 2}/{heartbeat.max_retries + 1})"
            )
            await asyncio.sleep(delay_minutes * 60)
            return await run_digest(client_config, orchestrator, retry_count + 1)
        else:
            logger.error(f"[{client_id}] Digest failed after {heartbeat.max_retries + 1} attempts")
            # Post failure notification to Slack
            webhook_url = client_config.channels.get("slack_webhook", "")
            if webhook_url:
                await _push_to_slack(
                    webhook_url,
                    "Clawrity digest unavailable. Backend may be offline."
                )
            return None
 async def _push_to_slack(webhook_url: str, message: str):
    """Push a message to a Slack incoming webhook."""
    try:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                webhook_url,
                json={"text": message},
                timeout=30,
            )
            if response.status_code == 200:
                logger.info("Digest pushed to Slack successfully")
            else:
                logger.error(f"Slack webhook returned {response.status_code}: {response.text}")
    except Exception as e:
        logger.error(f"Failed to push digest to Slack: {e}")
 def _log_digest_event(client_id: str, status: str, details: dict):
    """Log digest event to JSONL monitoring file."""
    settings = get_settings()
    logs_dir = settings.logs_dir
    os.makedirs(logs_dir, exist_ok=True)
    log_path = os.path.join(logs_dir, f"{client_id}_digest.jsonl")
    entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "client_id": client_id,
        "event": "digest",
        "status": status,
        **details,
    }
    try:
        with open(log_path, "a") as f:
            f.write(json.dumps(entry) + "\n")
    except Exception as e:
        logger.error(f"Failed to log digest event: {e}")
 def start_scheduler(
    client_configs: Dict[str, ClientConfig],
    orchestrator: Orchestrator,
 ) -> AsyncIOScheduler:
    """
    Start the APScheduler with digest jobs for all clients.
    Schedule per client:
    - Digest at configured time (from HEARTBEAT.md)
    - ETL sync at 02:00 (placeholder)
    - RAG re-index at 03:00 (placeholder)
    """
    scheduler = AsyncIOScheduler()
    for client_id, config in client_configs.items():
        heartbeat = load_heartbeat(config)
        # Daily digest at configured time
        scheduler.add_job(
            run_digest,
            CronTrigger(
                hour=heartbeat.hour,
                minute=heartbeat.minute,
                timezone=heartbeat.timezone,
            ),
            args=[config, orchestrator],
            id=f"digest_{client_id}",
            name=f"Daily Digest — {config.client_name}",
            replace_existing=True,
        )
        logger.info(
            f"Scheduled digest for {client_id}: "
            f"{heartbeat.time} {heartbeat.timezone}"
        )
        # ETL sync at 02:00 (placeholder)
        scheduler.add_job(
            _etl_sync_placeholder,
            CronTrigger(hour=2, minute=0, timezone=heartbeat.timezone),
            args=[client_id],
            id=f"etl_{client_id}",
            name=f"ETL Sync — {config.client_name}",
            replace_existing=True,
        )
        # RAG re-index at 03:00 (placeholder)
        scheduler.add_job(
            _rag_reindex_placeholder,
            CronTrigger(hour=3, minute=0, timezone=heartbeat.timezone),
            args=[client_id],
            id=f"rag_reindex_{client_id}",
            name=f"RAG Re-index — {config.client_name}",
            replace_existing=True,
        )
    scheduler.start()
    return scheduler
 async def _etl_sync_placeholder(client_id: str):
    """Placeholder for nightly ETL data sync."""
    logger.info(f"[{client_id}] ETL sync triggered (placeholder)")
 async def _rag_reindex_placeholder(client_id: str):
    """Placeholder for nightly RAG re-indexing."""
    logger.info(f"[{client_id}] RAG re-index triggered (placeholder)")
    try:
        from scripts.run_rag_pipeline import run_pipeline
        run_pipeline(client_id)
    except Exception as e:
        logger.warning(f"RAG re-index failed: {e}")
@@ -0,0 +1,345 @@
 """
 Clawrity — FastAPI Application
 Main entry point. Initializes database, loads client configs,
 starts Slack bot, and exposes REST endpoints.
 """
 import asyncio
 import logging
 from contextlib import asynccontextmanager
 from typing import Dict, Optional
 from fastapi import FastAPI, HTTPException
 from fastapi.middleware.cors import CORSMiddleware
 from pydantic import BaseModel
 from agents.orchestrator import Orchestrator
 from channels.protocol_adapter import ProtocolAdapter, NormalisedMessage
 from channels.slack_handler import SlackHandler
 from config.client_loader import ClientConfig, load_client_configs
 from config.settings import get_settings
 from skills.postgres_connector import get_connector
 logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s — %(message)s",
 )
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Global state
 # ---------------------------------------------------------------------------
 client_configs: Dict[str, ClientConfig] = {}
 orchestrator: Optional[Orchestrator] = None
 protocol_adapter: Optional[ProtocolAdapter] = None
 slack_handler: Optional[SlackHandler] = None
 scheduler = None  # Set by heartbeat.scheduler
 # ---------------------------------------------------------------------------
 # Lifespan
 # ---------------------------------------------------------------------------
@asynccontextmanager
 async def lifespan(app: FastAPI):
    """Startup and shutdown logic."""
    global client_configs, orchestrator, protocol_adapter, slack_handler, scheduler
    logger.info("=== Clawrity starting up ===")
    # 1. Init database schema
    db = get_connector()
    db.init_schema()
    logger.info("Database schema ready")
    # 2. Load client configs
    client_configs = load_client_configs()
    logger.info(f"Loaded {len(client_configs)} client(s): {list(client_configs.keys())}")
    # 3. Init orchestrator
    orchestrator = Orchestrator()
    # 4. Try to attach RAG retriever
    try:
        from rag.retriever import Retriever
        retriever = Retriever()
        orchestrator.set_retriever(retriever)
        logger.info("RAG retriever attached to orchestrator")
    except Exception as e:
        logger.info(f"RAG retriever not available (Phase 2): {e}")
    # 5. Init protocol adapter
    protocol_adapter = ProtocolAdapter(client_configs)
    # 6. Start Slack bot
    slack_handler = SlackHandler(protocol_adapter, client_configs, orchestrator)
    slack_handler.start()
    # 7. Start scheduler
    try:
        from heartbeat.scheduler import start_scheduler
        scheduler = start_scheduler(client_configs, orchestrator)
        logger.info("HEARTBEAT scheduler started")
    except Exception as e:
        logger.warning(f"Scheduler not started: {e}")
    logger.info("=== Clawrity ready ===")
    yield  # App runs here
    # Shutdown
    logger.info("=== Clawrity shutting down ===")
    if slack_handler:
        slack_handler.stop()
    if scheduler:
        scheduler.shutdown(wait=False)
    db.close()
 # ---------------------------------------------------------------------------
 # FastAPI App
 # ---------------------------------------------------------------------------
 app = FastAPI(
    title="Clawrity",
    description="Multi-channel AI business intelligence agent",
    version="1.0.0",
    lifespan=lifespan,
 )
 app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
 )
 # ---------------------------------------------------------------------------
 # Request/Response Models
 # ---------------------------------------------------------------------------
 class ChatRequest(BaseModel):
    client_id: str
    message: str
 class ChatResponse(BaseModel):
    response: str
    qa_score: float
    qa_passed: bool
    retries: int
    sql: Optional[str] = None
    data_rows: int = 0
    rag_chunks_used: int = 0
    elapsed_seconds: float = 0.0
 class CompareRequest(BaseModel):
    client_id: str
    message: str
 class CompareResponse(BaseModel):
    without_rag: ChatResponse
    with_rag: ChatResponse
 class ScoutRequest(BaseModel):
    client_id: str
    query: str
 class ClientRequest(BaseModel):
    client_id: str
 # ---------------------------------------------------------------------------
 # Endpoints
 # ---------------------------------------------------------------------------
@app.post("/chat", response_model=ChatResponse)
 async def chat(request: ChatRequest):
    """Send a message and get an AI response."""
    if request.client_id not in client_configs:
        raise HTTPException(status_code=404, detail=f"Client not found: {request.client_id}")
    config = client_configs[request.client_id]
    message = protocol_adapter.normalise_api(request.client_id, request.message)
    result = await orchestrator.process(message, config)
    return ChatResponse(**result)
@app.post("/compare", response_model=CompareResponse)
 async def compare(request: CompareRequest):
    """Side-by-side comparison: with RAG vs without RAG."""
    if request.client_id not in client_configs:
        raise HTTPException(status_code=404, detail=f"Client not found: {request.client_id}")
    config = client_configs[request.client_id]
    message = protocol_adapter.normalise_api(request.client_id, request.message)
    # Without RAG
    saved_retriever = orchestrator.retriever
    orchestrator.retriever = None
    result_no_rag = await orchestrator.process(message, config)
    orchestrator.retriever = saved_retriever
    # With RAG
    result_with_rag = await orchestrator.process(message, config)
    return CompareResponse(
        without_rag=ChatResponse(**result_no_rag),
        with_rag=ChatResponse(**result_with_rag),
    )
@app.post("/scout")
 async def scout(request: ScoutRequest):
    """Run a targeted scout search for competitor/market intelligence."""
    if request.client_id not in client_configs:
        raise HTTPException(status_code=404, detail=f"Client not found: {request.client_id}")
    config = client_configs[request.client_id]
    try:
        from agents.scout_agent import ScoutAgent
        scout_agent = ScoutAgent()
        result = await scout_agent.search_query(config, request.query)
        if result is None:
            return {"response": "No relevant competitor or market news found for this query.", "has_results": False}
        return {"response": result, "has_results": True}
    except Exception as e:
        logger.error(f"Scout endpoint failed: {e}")
        raise HTTPException(status_code=500, detail=str(e))
@app.post("/scout/digest")
 async def scout_digest(request: ClientRequest):
    """Run full scout agent digest for a client."""
    if request.client_id not in client_configs:
        raise HTTPException(status_code=404, detail=f"Client not found: {request.client_id}")
    config = client_configs[request.client_id]
    try:
        from agents.scout_agent import ScoutAgent
        scout_agent = ScoutAgent()
        result = await scout_agent.gather_intelligence(config)
        if result is None:
            return {"response": "No relevant market intelligence found.", "has_results": False}
        return {"response": result, "has_results": True}
    except Exception as e:
        logger.error(f"Scout digest failed: {e}")
        raise HTTPException(status_code=500, detail=str(e))
@app.post("/digest")
 async def trigger_digest(request: ClientRequest):
    """Manually trigger the daily digest pipeline (same as scheduled job)."""
    if request.client_id not in client_configs:
        raise HTTPException(status_code=404, detail=f"Client not found: {request.client_id}")
    config = client_configs[request.client_id]
    try:
        from heartbeat.scheduler import run_digest
        digest_text = await run_digest(config, orchestrator)
        if digest_text is None:
            raise HTTPException(status_code=500, detail="Digest generation failed after all retries")
        return {"response": digest_text, "status": "success"}
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Manual digest trigger failed: {e}")
        raise HTTPException(status_code=500, detail=str(e))
@app.get("/admin/stats/{client_id}")
 async def admin_stats(client_id: str):
    """RAG monitoring stats for a client."""
    if client_id not in client_configs:
        raise HTTPException(status_code=404, detail=f"Client not found: {client_id}")
    try:
        from rag.monitoring import get_stats
        return get_stats(client_id)
    except Exception as e:
        return {"error": str(e), "message": "Monitoring not yet configured"}
@app.post("/forecast/run/{client_id}")
 async def run_forecast(client_id: str):
    """Trigger Prophet forecasting for a client."""
    if client_id not in client_configs:
        raise HTTPException(status_code=404, detail=f"Client not found: {client_id}")
    try:
        from forecasting.prophet_engine import ProphetEngine
        engine = ProphetEngine()
        results = engine.train_and_forecast(client_id)
        return {"status": "success", "branches_forecast": len(results)}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
@app.get("/forecast/{client_id}/{branch}")
 async def get_forecast(client_id: str, branch: str):
    """Get cached forecast for a branch."""
    if client_id not in client_configs:
        raise HTTPException(status_code=404, detail=f"Client not found: {client_id}")
    try:
        from forecasting.prophet_engine import ProphetEngine
        engine = ProphetEngine()
        forecast = engine.get_cached_forecast(client_id, branch)
        if not forecast:
            raise HTTPException(status_code=404, detail=f"No forecast found for {branch}")
        return forecast
    except HTTPException:
        raise
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
 async def health():
    """System health check."""
    db = get_connector()
    db_connected = False
    try:
        db.execute_raw("SELECT 1")
        db_connected = True
    except Exception:
        pass
    scheduled_jobs = []
    if scheduler and hasattr(scheduler, 'get_jobs'):
        try:
            scheduled_jobs = [
                {"id": job.id, "name": job.name, "next_run": str(job.next_run_time)}
                for job in scheduler.get_jobs()
            ]
        except Exception:
            pass
    return {
        "status": "healthy" if db_connected else "degraded",
        "database": "connected" if db_connected else "disconnected",
        "clients": list(client_configs.keys()),
        "scheduler_running": scheduler is not None and scheduler.running if scheduler else False,
        "scheduled_jobs": scheduled_jobs,
        "slack_active": slack_handler is not None and slack_handler._thread is not None,
    }
@app.post("/slack/events")
 async def slack_events():
    """Slack webhook endpoint (HTTP mode fallback). Socket Mode is primary."""
    return {"message": "Slack events are handled via Socket Mode. This endpoint is a fallback."}
@@ -0,0 +1,287 @@
 """
 Clawrity — RAG Chunker
 Aggregation-based semantic chunking — NOT fixed-size, NOT sliding window.
 Source is structured tabular data. We aggregate rows into business-meaningful
 units and write natural language narratives.
 Three chunk types:
  1. branch_weekly   — GROUP BY branch, country, week
  2. channel_monthly — GROUP BY channel, country, month
  3. trend_qoq       — GROUP BY branch, country, quarter (QoQ delta COMPUTED)
 Plus Faker-generated narrative summaries reflecting real patterns.
 """
 import hashlib
 import logging
 from dataclasses import dataclass, field
 from typing import Dict, List, Optional
 import numpy as np
 import pandas as pd
 from faker import Faker
 logger = logging.getLogger(__name__)
 fake = Faker()
@dataclass
 class Chunk:
    """A single RAG chunk."""
    id: str
    client_id: str
    chunk_type: str
    text: str
    metadata: Dict
    def to_dict(self) -> Dict:
        return {
            "id": self.id,
            "client_id": self.client_id,
            "chunk_type": self.chunk_type,
            "text": self.text,
            "metadata": self.metadata,
        }
 def generate_chunks(df: pd.DataFrame, client_id: str) -> List[Chunk]:
    """Generate all chunk types from preprocessed data."""
    chunks = []
    df = df.copy()
    df["date"] = pd.to_datetime(df["date"])
    chunks.extend(_branch_weekly(df, client_id))
    chunks.extend(_channel_monthly(df, client_id))
    chunks.extend(_trend_qoq(df, client_id))
    chunks.extend(_faker_narratives(df, client_id))
    logger.info(f"Generated {len(chunks)} total chunks for {client_id}")
    return chunks
 def _chunk_id(client_id: str, chunk_type: str, *parts) -> str:
    """Generate a deterministic chunk ID."""
    raw = f"{client_id}:{chunk_type}:" + ":".join(str(p) for p in parts)
    return hashlib.md5(raw.encode()).hexdigest()[:16]
 # ---------------------------------------------------------------------------
 # Chunk Type 1: Branch Weekly
 # ---------------------------------------------------------------------------
 def _branch_weekly(df: pd.DataFrame, client_id: str) -> List[Chunk]:
    """GROUP BY branch, country, week. One chunk per branch per week."""
    chunks = []
    df = df.copy()
    df["week"] = df["date"].dt.isocalendar().week.astype(int)
    df["month"] = df["date"].dt.month_name()
    df["year"] = df["date"].dt.year
    grouped = df.groupby(["branch", "country", "year", "week", "month"]).agg(
        spend=("spend", "sum"),
        revenue=("revenue", "sum"),
        leads=("leads", "sum"),
        conversions=("conversions", "sum"),
    ).reset_index()
    for _, row in grouped.iterrows():
        spend = row["spend"]
        revenue = row["revenue"]
        roi = round(revenue / spend, 2) if spend > 0 else 0
        conv_rate = round(row["conversions"] / row["leads"] * 100, 1) if row["leads"] > 0 else 0
        text = (
            f"{row['branch']} ({row['country']}) in week {row['week']} of "
            f"{row['month']} {row['year']}: spent ${spend:,.0f}, earned "
            f"${revenue:,.0f}, ROI {roi}x, {row['leads']} leads, "
            f"{conv_rate}% conversion rate."
        )
        chunks.append(Chunk(
            id=_chunk_id(client_id, "branch_weekly", row["branch"], row["year"], row["week"]),
            client_id=client_id,
            chunk_type="branch_weekly",
            text=text,
            metadata={
                "branch": row["branch"],
                "country": row["country"],
                "week": int(row["week"]),
                "month": row["month"],
                "year": int(row["year"]),
                "roi": roi,
            },
        ))
    logger.info(f"Generated {len(chunks)} branch_weekly chunks")
    return chunks
 # ---------------------------------------------------------------------------
 # Chunk Type 2: Channel Monthly
 # ---------------------------------------------------------------------------
 def _channel_monthly(df: pd.DataFrame, client_id: str) -> List[Chunk]:
    """GROUP BY channel, country, month, quarter."""
    chunks = []
    df = df.copy()
    df["month"] = df["date"].dt.month_name()
    df["quarter"] = "Q" + df["date"].dt.quarter.astype(str)
    df["year"] = df["date"].dt.year
    grouped = df.groupby(["channel", "country", "year", "month", "quarter"]).agg(
        spend=("spend", "sum"),
        revenue=("revenue", "sum"),
        leads=("leads", "sum"),
        conversions=("conversions", "sum"),
    ).reset_index()
    for _, row in grouped.iterrows():
        spend = row["spend"]
        revenue = row["revenue"]
        roi = round(revenue / spend, 2) if spend > 0 else 0
        text = (
            f"{row['channel']} in {row['country']} during {row['month']} "
            f"({row['quarter']}) {row['year']}: ${spend:,.0f} spent, "
            f"${revenue:,.0f} revenue, ROI {roi}x."
        )
        chunks.append(Chunk(
            id=_chunk_id(client_id, "channel_monthly", row["channel"], row["country"], row["year"], row["month"]),
            client_id=client_id,
            chunk_type="channel_monthly",
            text=text,
            metadata={
                "channel": row["channel"],
                "country": row["country"],
                "month": row["month"],
                "quarter": row["quarter"],
                "year": int(row["year"]),
                "roi": roi,
            },
        ))
    logger.info(f"Generated {len(chunks)} channel_monthly chunks")
    return chunks
 # ---------------------------------------------------------------------------
 # Chunk Type 3: QoQ Trend (Most Important)
 # ---------------------------------------------------------------------------
 def _trend_qoq(df: pd.DataFrame, client_id: str) -> List[Chunk]:
    """GROUP BY branch, country, quarter. Compute quarter-over-quarter delta."""
    chunks = []
    df = df.copy()
    df["quarter"] = df["date"].dt.to_period("Q").astype(str)
    grouped = df.groupby(["branch", "country", "quarter"]).agg(
        spend=("spend", "sum"),
        revenue=("revenue", "sum"),
    ).reset_index()
    # Sort for QoQ calculation
    grouped = grouped.sort_values(["branch", "country", "quarter"])
    for (branch, country), group in grouped.groupby(["branch", "country"]):
        group = group.sort_values("quarter").reset_index(drop=True)
        for i in range(1, len(group)):
            prev = group.iloc[i - 1]
            curr = group.iloc[i]
            prev_rev = prev["revenue"]
            curr_rev = curr["revenue"]
            if prev_rev > 0:
                delta = round((curr_rev - prev_rev) / prev_rev * 100, 1)
            else:
                delta = 0
            direction = "grew" if delta > 0 else "declined"
            text = (
                f"{branch} ({country}) revenue {direction} {abs(delta)}% "
                f"in {curr['quarter']} vs {prev['quarter']}. "
                f"Total spend: ${curr['spend']:,.0f}, revenue: ${curr_rev:,.0f}."
            )
            chunks.append(Chunk(
                id=_chunk_id(client_id, "trend_qoq", branch, country, curr["quarter"]),
                client_id=client_id,
                chunk_type="trend_qoq",
                text=text,
                metadata={
                    "branch": branch,
                    "country": country,
                    "quarter": curr["quarter"],
                    "prev_quarter": prev["quarter"],
                    "delta_pct": delta,
                },
            ))
    logger.info(f"Generated {len(chunks)} trend_qoq chunks")
    return chunks
 # ---------------------------------------------------------------------------
 # Faker Narrative Chunks
 # ---------------------------------------------------------------------------
 def _faker_narratives(df: pd.DataFrame, client_id: str) -> List[Chunk]:
    """Generate plausible narrative chunks reflecting real data patterns."""
    chunks = []
    df = df.copy()
    df["quarter"] = df["date"].dt.to_period("Q").astype(str)
    # Find top and bottom performers per quarter
    quarterly = df.groupby(["branch", "country", "quarter"]).agg(
        revenue=("revenue", "sum"),
        spend=("spend", "sum"),
        leads=("leads", "sum"),
    ).reset_index()
    templates = [
        "{branch} branch demonstrated strong {quarter} performance driven by {channel} efficiency, outperforming regional averages.",
        "In {quarter}, {branch} ({country}) showed {trend} momentum with revenue reaching ${revenue:,.0f}, primarily through {channel} campaigns.",
        "{branch} branch in {country} maintained steady growth in {quarter}, with lead generation up and conversion rates holding above {conv_rate:.1f}%.",
        "Cost efficiency at {branch} ({country}) improved in {quarter}, with spend-to-revenue ratio tightening to {ratio:.2f}x.",
    ]
    channels = df["channel"].dropna().unique().tolist() or ["Paid Search", "Social Media", "Email"]
    for _, row in quarterly.iterrows():
        roi = row["revenue"] / row["spend"] if row["spend"] > 0 else 0
        conv_rate = np.random.uniform(5, 20)
        trend = "positive" if roi > 1.5 else "moderate" if roi > 1 else "challenging"
        channel = np.random.choice(channels)
        template = np.random.choice(templates)
        text = template.format(
            branch=row["branch"],
            country=row["country"],
            quarter=row["quarter"],
            channel=channel,
            revenue=row["revenue"],
            trend=trend,
            conv_rate=conv_rate,
            ratio=1 / roi if roi > 0 else 0,
        )
        chunks.append(Chunk(
            id=_chunk_id(client_id, "narrative", row["branch"], row["country"], row["quarter"]),
            client_id=client_id,
            chunk_type="narrative",
            text=text,
            metadata={
                "branch": row["branch"],
                "country": row["country"],
                "quarter": row["quarter"],
                "source": "generated_narrative",
            },
        ))
    logger.info(f"Generated {len(chunks)} narrative chunks")
    return chunks
@@ -0,0 +1,123 @@
 """
 Clawrity — RAG Evaluator
 Lightweight Groq-based evaluation (no OpenAI, no full RAGAs).
 Four metrics: faithfulness, answer_relevancy, context_precision, context_recall.
 Single Groq call with structured JSON output.
 """
 import json
 import logging
 from dataclasses import dataclass
 from typing import Dict, List, Optional
 from groq import Groq
 from config.settings import get_settings
 logger = logging.getLogger(__name__)
 EVAL_PROMPT = """Evaluate this RAG-augmented response on four criteria.
 ## User Query
 {query}
 ## Retrieved Context Chunks
 {chunks}
 ## Generated Response
 {response}
 ## Evaluation Criteria (score each 0.0 to 1.0)
 1. **Faithfulness**: Does the response ONLY contain information from the retrieved chunks? No hallucination?
 2. **Answer Relevancy**: Does the response directly address the user's question?
 3. **Context Precision**: Were the retrieved chunks actually relevant to the question?
 4. **Context Recall**: Did the retrieval capture enough context to answer the question fully?
 Return ONLY a JSON object:
 {{
    "faithfulness": <float>,
    "answer_relevancy": <float>,
    "context_precision": <float>,
    "context_recall": <float>,
    "overall": <float (average of all four)>,
    "notes": "<brief explanation>"
 }}"""
@dataclass
 class EvalResult:
    faithfulness: float = 0.0
    answer_relevancy: float = 0.0
    context_precision: float = 0.0
    context_recall: float = 0.0
    overall: float = 0.0
    notes: str = ""
 class RAGEvaluator:
    """Evaluates RAG pipeline quality using Groq LLM."""
    def __init__(self):
        settings = get_settings()
        self.client = Groq(api_key=settings.groq_api_key)
        self.model = settings.llm_model
    def evaluate(
        self,
        query: str,
        chunks: List[Dict],
        response: str,
    ) -> EvalResult:
        """Evaluate a RAG response."""
        chunks_text = "\n".join(
            f"{i+1}. {c.get('text', '')} (similarity: {c.get('similarity', 0):.2f})"
            for i, c in enumerate(chunks)
        ) if chunks else "No chunks retrieved."
        prompt = EVAL_PROMPT.format(
            query=query,
            chunks=chunks_text,
            response=response,
        )
        try:
            result = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are a RAG evaluation expert. Return only valid JSON."},
                    {"role": "user", "content": prompt},
                ],
                temperature=0.1,
                max_tokens=512,
            )
            raw = result.choices[0].message.content.strip()
            return self._parse(raw)
        except Exception as e:
            logger.error(f"RAG evaluation failed: {e}")
            return EvalResult(notes=f"Evaluation error: {str(e)}")
    def _parse(self, raw: str) -> EvalResult:
        """Parse JSON evaluation response."""
        try:
            cleaned = raw.strip()
            if cleaned.startswith("```"):
                cleaned = cleaned.split("\n", 1)[1] if "\n" in cleaned else cleaned[3:]
            if cleaned.endswith("```"):
                cleaned = cleaned[:-3]
            data = json.loads(cleaned.strip())
            return EvalResult(
                faithfulness=float(data.get("faithfulness", 0)),
                answer_relevancy=float(data.get("answer_relevancy", 0)),
                context_precision=float(data.get("context_precision", 0)),
                context_recall=float(data.get("context_recall", 0)),
                overall=float(data.get("overall", 0)),
                notes=data.get("notes", ""),
            )
        except Exception as e:
            logger.warning(f"Could not parse evaluation: {e}")
            return EvalResult(notes="Parse error")
@@ -0,0 +1,105 @@
 """
 Clawrity — RAG Monitoring
 Logs every interaction to JSONL and provides aggregated stats.
 Exposes data for /admin/stats/{client_id} endpoint.
 """
 import json
 import logging
 import os
 from datetime import datetime
 from typing import Dict, Optional
 from config.settings import get_settings
 logger = logging.getLogger(__name__)
 def _log_path(client_id: str) -> str:
    """Get the JSONL log file path for a client."""
    logs_dir = get_settings().logs_dir
    os.makedirs(logs_dir, exist_ok=True)
    return os.path.join(logs_dir, f"{client_id}_interactions.jsonl")
 def log_interaction(
    client_id: str,
    query: str,
    num_chunks: int,
    chunk_types_used: list,
    qa_score: float,
    qa_passed: bool,
    retries: int,
    response_length: int,
    elapsed_seconds: float = 0.0,
 ):
    """Log an interaction to JSONL."""
    entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "client_id": client_id,
        "query": query,
        "num_chunks": num_chunks,
        "chunk_types_used": chunk_types_used,
        "qa_score": qa_score,
        "qa_passed": qa_passed,
        "retries": retries,
        "response_length": response_length,
        "elapsed_seconds": elapsed_seconds,
    }
    try:
        path = _log_path(client_id)
        with open(path, "a") as f:
            f.write(json.dumps(entry) + "\n")
    except Exception as e:
        logger.error(f"Failed to log interaction: {e}")
 def get_stats(client_id: str) -> Dict:
    """
    Get aggregated monitoring stats for a client.
    Returns:
        Dict with: total_queries, pass_rate, avg_qa_score, avg_retries,
                   queries_needing_retry
    """
    path = _log_path(client_id)
    if not os.path.exists(path):
        return {
            "client_id": client_id,
            "total_queries": 0,
            "pass_rate": 0.0,
            "avg_qa_score": 0.0,
            "avg_retries": 0.0,
            "queries_needing_retry": 0,
        }
    entries = []
    try:
        with open(path, "r") as f:
            for line in f:
                line = line.strip()
                if line:
                    entries.append(json.loads(line))
    except Exception as e:
        logger.error(f"Error reading log file: {e}")
        return {"error": str(e)}
    if not entries:
        return {"client_id": client_id, "total_queries": 0}
    total = len(entries)
    passed = sum(1 for e in entries if e.get("qa_passed", False))
    scores = [e.get("qa_score", 0) for e in entries]
    retries = [e.get("retries", 0) for e in entries]
    retry_queries = sum(1 for r in retries if r > 0)
    return {
        "client_id": client_id,
        "total_queries": total,
        "pass_rate": round(passed / total * 100, 1) if total > 0 else 0,
        "avg_qa_score": round(sum(scores) / total, 3) if total > 0 else 0,
        "avg_retries": round(sum(retries) / total, 2) if total > 0 else 0,
        "queries_needing_retry": retry_queries,
    }
@@ -0,0 +1,72 @@
 """
 Clawrity — RAG Preprocessor
 Fetches data from PostgreSQL, cleans it for RAG chunking:
  - Removes nulls, outliers > 3 std devs, duplicates
  - Normalises string columns
 """
 import logging
 from typing import Optional
 import pandas as pd
 from etl.normaliser import remove_outliers
 from skills.postgres_connector import get_connector
 logger = logging.getLogger(__name__)
 def preprocess_for_rag(
    client_id: str,
    days: int = 365,
 ) -> pd.DataFrame:
    """
    Fetch and preprocess data for RAG chunking.
    Args:
        client_id: Client to fetch data for
        days: Number of days of data to fetch (default 365)
    Returns:
        Clean DataFrame ready for chunking
    """
    db = get_connector()
    sql = """
        SELECT date, country, branch, channel, spend, revenue, leads, conversions
        FROM spend_data
        WHERE client_id = %s AND date >= CURRENT_DATE - INTERVAL '%s days'
        ORDER BY date
    """
    # Can't parameterise interval directly, use string formatting for days
    safe_sql = f"""
        SELECT date, country, branch, channel, spend, revenue, leads, conversions
        FROM spend_data
        WHERE client_id = %s AND date >= CURRENT_DATE - INTERVAL '{int(days)} days'
        ORDER BY date
    """
    df = db.execute_query(safe_sql, (client_id,))
    logger.info(f"Fetched {len(df)} rows for RAG preprocessing")
    if df.empty:
        logger.warning(f"No data found for client {client_id}")
        return df
    # Remove rows with critical nulls
    critical_cols = ["date", "branch", "country", "revenue"]
    df = df.dropna(subset=[c for c in critical_cols if c in df.columns])
    # Remove outliers on numeric columns
    df = remove_outliers(df, ["spend", "revenue", "leads", "conversions"])
    # Clean strings
    for col in ["country", "branch", "channel"]:
        if col in df.columns:
            df[col] = df[col].astype(str).str.strip().str.title()
    # Remove duplicates
    df = df.drop_duplicates()
    logger.info(f"Preprocessed: {len(df)} rows ready for chunking")
    return df
@@ -0,0 +1,95 @@
 """
 Clawrity — RAG Retriever
 Detects query intent → selects chunk_type → searches pgvector.
 Intent detection based on keywords:
  - "should/recommend/allocate/shift" → trend_qoq
  - "channel/paid/email/social" → channel_monthly
  - everything else → branch_weekly
 """
 import logging
 import re
 from typing import List, Dict, Optional
 from rag.vector_store import search
 logger = logging.getLogger(__name__)
 # Intent → chunk_type mapping based on keywords
 INTENT_PATTERNS = {
    "trend_qoq": [
        "should", "recommend", "allocate", "shift", "increase", "decrease",
        "budget", "realloc", "invest", "optimize", "growth", "trend",
        "quarter", "qoq", "forecast", "predict",
    ],
    "channel_monthly": [
        "channel", "paid", "email", "social", "search", "display",
        "organic", "referral", "campaign", "marketing", "roi",
        "spend", "advertising",
    ],
 }
 class Retriever:
    """RAG retriever with intent-based chunk type filtering."""
    def retrieve(
        self,
        query: str,
        client_id: str,
        top_k: int = 5,
        chunk_type_override: Optional[str] = None,
    ) -> List[Dict]:
        """
        Retrieve relevant chunks based on query intent.
        Args:
            query: User's natural language query
            client_id: Client to search within
            top_k: Number of chunks to retrieve
            chunk_type_override: Force a specific chunk type
        Returns:
            List of dicts with text, metadata, similarity
        """
        if chunk_type_override:
            chunk_type = chunk_type_override
        else:
            chunk_type = self._detect_intent(query)
        logger.info(f"Detected intent → chunk_type: {chunk_type}")
        results = search(
            query=query,
            client_id=client_id,
            chunk_type=chunk_type,
            top_k=top_k,
        )
        # If no results with the detected type, fall back to all types
        if not results:
            logger.info(f"No results for {chunk_type}, falling back to all types")
            results = search(
                query=query,
                client_id=client_id,
                chunk_type=None,
                top_k=top_k,
            )
        return results
    def _detect_intent(self, query: str) -> str:
        """Detect query intent from keywords."""
        query_lower = query.lower()
        scores = {}
        for chunk_type, keywords in INTENT_PATTERNS.items():
            score = sum(1 for kw in keywords if kw in query_lower)
            scores[chunk_type] = score
        # Return the chunk type with highest score, default to branch_weekly
        if max(scores.values()) > 0:
            return max(scores, key=scores.get)
        return "branch_weekly"
@@ -0,0 +1,135 @@
 """
 Clawrity — RAG Vector Store
 Embeds chunks using sentence-transformers all-MiniLM-L6-v2 (CPU, 384 dims).
 Stores and searches via pgvector in PostgreSQL.
 """
 import logging
 from typing import List, Optional
 import numpy as np
 from rag.chunker import Chunk
 from skills.postgres_connector import get_connector
 logger = logging.getLogger(__name__)
 _model = None
 def _get_embedding_model():
    """Lazy-load the embedding model (CPU only, ~90MB)."""
    global _model
    if _model is None:
        from sentence_transformers import SentenceTransformer
        _model = SentenceTransformer("all-MiniLM-L6-v2")
        logger.info("Loaded embedding model: all-MiniLM-L6-v2 (384 dims)")
    return _model
 def embed_texts(texts: List[str], batch_size: int = 100) -> np.ndarray:
    """
    Embed a list of texts using MiniLM.
    Args:
        texts: List of text strings to embed
        batch_size: Batch size for encoding (default 100)
    Returns:
        numpy array of shape (len(texts), 384)
    """
    model = _get_embedding_model()
    embeddings = model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=len(texts) > 100,
        normalize_embeddings=True,
    )
    logger.info(f"Embedded {len(texts)} texts → shape {embeddings.shape}")
    return embeddings
 def embed_query(query: str) -> np.ndarray:
    """Embed a single query string."""
    model = _get_embedding_model()
    return model.encode(query, normalize_embeddings=True)
 def store_chunks(chunks: List[Chunk], embeddings: np.ndarray):
    """
    Upsert chunks + embeddings into pgvector.
    Uses ON CONFLICT DO UPDATE for safe nightly re-indexing.
    """
    seen = set()
    unique_chunks = []
    unique_embeddings = []
    for chunk, emb in zip(chunks, embeddings):
        if chunk.id not in seen:
            seen.add(chunk.id)
            unique_chunks.append(chunk)
            unique_embeddings.append(emb)
    chunks = unique_chunks
    embeddings = unique_embeddings
    db = get_connector()
    data = []
    for chunk, embedding in zip(chunks, embeddings):
        data.append({
            "id": chunk.id,
            "client_id": chunk.client_id,
            "chunk_type": chunk.chunk_type,
            "text": chunk.text,
            "metadata": chunk.metadata,
            "embedding": embedding.tolist(),
        })
    # Batch upsert
    batch_size = 100
    for i in range(0, len(data), batch_size):
        batch = data[i:i + batch_size]
        db.upsert_embeddings(batch)
    logger.info(f"Stored {len(data)} chunks in pgvector")
    # Try to create IVFFlat index (needs enough rows)
    try:
        db.create_vector_index()
    except Exception:
        pass
 def search(
    query: str,
    client_id: str,
    chunk_type: Optional[str] = None,
    top_k: int = 5,
 ) -> List[dict]:
    """
    Search pgvector for similar chunks.
    Args:
        query: Natural language query
        client_id: Client to search within
        chunk_type: Optional filter (branch_weekly, channel_monthly, trend_qoq)
        top_k: Number of results
    Returns:
        List of dicts with text, metadata, similarity
    """
    query_embedding = embed_query(query)
    db = get_connector()
    results = db.search_embeddings(
        query_embedding=query_embedding,
        client_id=client_id,
        chunk_type=chunk_type,
        top_k=top_k,
    )
    logger.info(
        f"Vector search: query='{query[:50]}...', "
        f"chunk_type={chunk_type}, results={len(results)}"
    )
    return results
@@ -0,0 +1,42 @@
 # === Core Framework ===
 fastapi>=0.115.0
 uvicorn[standard]>=0.30.0
 python-dotenv>=1.0.0
 # === LLM ===
 groq>=0.11.0
 # === Embeddings (CPU only — all-MiniLM-L6-v2, 384 dims, ~90MB) ===
 sentence-transformers>=3.0.0
 # === Database — PostgreSQL + pgvector ===
 psycopg2-binary>=2.9.9
 pgvector>=0.3.0
 asyncpg>=0.29.0
 # === Channel — Slack (Socket Mode) ===
 slack-bolt>=1.20.0
 # === Scheduler ===
 apscheduler>=3.10.0
 # === Web Search (Scout Agent) ===
 tavily-python>=0.5.0
 duckduckgo-search>=6.0.0
 # === Forecasting ===
 prophet>=1.1.5
 # === Data Processing ===
 pandas>=2.2.0
 numpy>=1.26.0
 openpyxl>=3.1.0
 faker>=28.0.0
 # === Config ===
 pydantic>=2.9.0
 pydantic-settings>=2.5.0
 pyyaml>=6.0.2
 # === HTTP Client ===
 httpx>=0.27.0
@@ -0,0 +1,67 @@
 """
 Clawrity — RAG Pipeline Script
 CLI to run the full RAG pipeline: preprocess → chunk → embed → store in pgvector.
 Usage:
    python scripts/run_rag_pipeline.py --client_id acme_corp
 """
 import argparse
 import logging
 import sys
 import os
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from rag.preprocessor import preprocess_for_rag
 from rag.chunker import generate_chunks
 from rag.vector_store import embed_texts, store_chunks
 logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
 logger = logging.getLogger(__name__)
 def run_pipeline(client_id: str, days: int = 365):
    """Run the full RAG pipeline for a client."""
    logger.info(f"=== RAG Pipeline: {client_id} ===")
    # Step 1: Preprocess
    logger.info("Step 1/4: Preprocessing data...")
    df = preprocess_for_rag(client_id, days=days)
    if df.empty:
        logger.error("No data to process. Run seed_demo_data.py first.")
        return
    # Step 2: Generate chunks
    logger.info("Step 2/4: Generating chunks...")
    chunks = generate_chunks(df, client_id)
    logger.info(f"Generated {len(chunks)} chunks")
    if not chunks:
        logger.error("No chunks generated.")
        return
    # Step 3: Embed
    logger.info("Step 3/4: Embedding chunks (CPU, batch_size=100)...")
    texts = [c.text for c in chunks]
    embeddings = embed_texts(texts, batch_size=100)
    # Step 4: Store in pgvector
    logger.info("Step 4/4: Upserting into pgvector...")
    store_chunks(chunks, embeddings)
    logger.info(f"=== RAG Pipeline complete: {len(chunks)} chunks indexed ===")
 def main():
    parser = argparse.ArgumentParser(description="Run RAG pipeline")
    parser.add_argument("--client_id", required=True, help="Client ID")
    parser.add_argument("--days", type=int, default=365, help="Days of data to process")
    args = parser.parse_args()
    run_pipeline(args.client_id, args.days)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,214 @@
 """
 Clawrity — Demo Data Seeder
 Merges Global Superstore + Marketing Campaign datasets with Faker gap-filling.
 Inserts into PostgreSQL spend_data table.
 Usage:
    python scripts/seed_demo_data.py --client_id acme_corp \
        --superstore data/raw/Global_Superstore2.csv \
        --marketing data/raw/marketing_campaign_dataset.csv
 """
 import argparse
 import logging
 import random
 import sys
 import os
 import numpy as np
 import pandas as pd
 from faker import Faker
 # Add project root to path
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from connectors.csv_connector import CSVConnector
 from etl.normaliser import normalise_dataframe
 from skills.postgres_connector import PostgresConnector
 logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
 logger = logging.getLogger(__name__)
 fake = Faker()
 Faker.seed(42)
 random.seed(42)
 np.random.seed(42)
 # Marketing channels to assign
 CHANNELS = ["Paid Search", "Social Media", "Email", "Display", "Organic", "Referral"]
 # Column mapping for Global Superstore
 SUPERSTORE_MAPPING = {
    "Order Date": "date",
    "Country": "country",
    "City": "branch",
    "Sales": "revenue",
    "Profit": "profit",
 }
 def load_superstore(path: str) -> pd.DataFrame:
    """Load and normalize the Global Superstore dataset."""
    connector = CSVConnector()
    df = connector.load(path)
    logger.info(f"Superstore columns: {list(df.columns)}")
    # Apply column mapping
    df = normalise_dataframe(df, SUPERSTORE_MAPPING)
    # Keep only needed columns
    keep = ["date", "country", "branch", "revenue", "profit"]
    available = [c for c in keep if c in df.columns]
    df = df[available].copy()
    logger.info(f"Superstore: {len(df)} rows after normalisation")
    return df
 def load_marketing(path: str) -> pd.DataFrame:
    """Load the Marketing Campaign Performance dataset."""
    connector = CSVConnector()
    df = connector.load(path)
    logger.info(f"Marketing columns: {list(df.columns)}")
    # Standardize column names
    col_map = {}
    for col in df.columns:
        cl = col.lower().strip()
        if "channel" in cl:
            col_map[col] = "channel"
        elif "spend" in cl or "budget" in cl:
            col_map[col] = "spend"
        elif "click" in cl:
            col_map[col] = "leads"
        elif "conversion" in cl:
            col_map[col] = "conversions"
        elif "roi" in cl:
            col_map[col] = "roi_raw"
        elif "impression" in cl:
            col_map[col] = "impressions"
    df = df.rename(columns=col_map)
    logger.info(f"Marketing: {len(df)} rows, mapped columns: {list(df.columns)}")
    return df
 def merge_datasets(superstore: pd.DataFrame, marketing: pd.DataFrame) -> pd.DataFrame:
    """
    Merge superstore (base) with marketing channel metrics.
    Each superstore row gets a channel + spend/leads/conversions.
    """
    df = superstore.copy()
    # Assign channels proportionally from marketing data
    if "channel" in marketing.columns:
        channel_list = marketing["channel"].dropna().unique().tolist()
        if not channel_list:
            channel_list = CHANNELS
    else:
        channel_list = CHANNELS
    # Assign channel to each row (deterministic based on index)
    df["channel"] = [channel_list[i % len(channel_list)] for i in range(len(df))]
    # Build channel-level spend/leads/conversions stats from marketing data
    channel_stats = {}
    if "spend" in marketing.columns and "channel" in marketing.columns:
        for ch in channel_list:
            ch_data = marketing[marketing["channel"] == ch] if "channel" in marketing.columns else marketing
            channel_stats[ch] = {
                "avg_spend": ch_data["spend"].mean() if "spend" in ch_data.columns and len(ch_data) > 0 else 500,
                "avg_leads": ch_data["leads"].mean() if "leads" in ch_data.columns and len(ch_data) > 0 else 50,
                "avg_conv": ch_data["conversions"].mean() if "conversions" in ch_data.columns and len(ch_data) > 0 else 5,
            }
    # Fill spend, leads, conversions using marketing stats + Faker variation
    spends, leads_list, conv_list = [], [], []
    for _, row in df.iterrows():
        ch = row["channel"]
        stats = channel_stats.get(ch, {"avg_spend": 500, "avg_leads": 50, "avg_conv": 5})
        rev = row.get("revenue", 1000)
        # Spend: proportion of revenue with channel-based variation
        spend = max(10, rev * random.uniform(0.3, 0.6) + random.gauss(0, stats["avg_spend"] * 0.1))
        leads = max(1, int(spend / random.uniform(15, 40)))
        conversions = max(0, int(leads * random.uniform(0.05, 0.20)))
        spends.append(round(spend, 2))
        leads_list.append(leads)
        conv_list.append(conversions)
    df["spend"] = spends
    df["leads"] = leads_list
    df["conversions"] = conv_list
    # Drop profit column (not in spend_data schema)
    if "profit" in df.columns:
        df = df.drop(columns=["profit"])
    logger.info(f"Merged dataset: {len(df)} rows, columns: {list(df.columns)}")
    return df
 def seed_to_postgres(df: pd.DataFrame, client_id: str):
    """Insert merged data into PostgreSQL spend_data table."""
    connector = PostgresConnector()
    connector.init_schema()
    # Clear existing data for this client
    connector.execute_write(
        "DELETE FROM spend_data WHERE client_id = %s", (client_id,)
    )
    logger.info(f"Cleared existing data for client: {client_id}")
    # Add client_id column
    df["client_id"] = client_id
    # Prepare batch insert
    sql = """
        INSERT INTO spend_data (date, country, branch, channel, spend, revenue, leads, conversions, client_id)
        VALUES %s
    """
    data = [
        (
            row["date"], row["country"], row["branch"], row["channel"],
            row["spend"], row["revenue"], row["leads"], row["conversions"],
            row["client_id"]
        )
        for _, row in df.iterrows()
    ]
    connector.execute_batch(sql, data, page_size=2000)
    count = connector.get_table_count("spend_data", client_id)
    logger.info(f"Seeded {count} rows into spend_data for client: {client_id}")
    # Save processed CSV
    os.makedirs("data/processed", exist_ok=True)
    output_path = f"data/processed/{client_id}_merged.csv"
    df.to_csv(output_path, index=False)
    logger.info(f"Saved processed data to {output_path}")
    connector.close()
 def main():
    parser = argparse.ArgumentParser(description="Seed demo data into PostgreSQL")
    parser.add_argument("--client_id", default="acme_corp", help="Client ID")
    parser.add_argument("--superstore", required=True, help="Path to Global Superstore CSV/XLSX")
    parser.add_argument("--marketing", required=True, help="Path to Marketing Campaign CSV")
    args = parser.parse_args()
    logger.info(f"=== Seeding data for client: {args.client_id} ===")
    superstore = load_superstore(args.superstore)
    marketing = load_marketing(args.marketing)
    merged = merge_datasets(superstore, marketing)
    seed_to_postgres(merged, args.client_id)
    logger.info("=== Seeding complete ===")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,140 @@
 """
 Clawrity — NL-to-SQL Engine
 Converts natural language questions into valid PostgreSQL SELECT queries.
 Uses LLM at temperature 0.1 for deterministic SQL generation.
 Safety: Only SELECT queries allowed. INSERT/UPDATE/DELETE/DROP rejected.
 """
 import re
 import logging
 from typing import Optional
 from config.llm_client import get_llm_client, get_model_name
 logger = logging.getLogger(__name__)
 # Dangerous SQL patterns — reject anything that isn't a SELECT
 UNSAFE_PATTERNS = re.compile(
    r"\b(INSERT|UPDATE|DELETE|DROP|ALTER|TRUNCATE|CREATE|GRANT|REVOKE|EXEC)\b",
    re.IGNORECASE
 )
 SYSTEM_PROMPT = """You are a PostgreSQL SQL generator. Generate ONLY a valid SELECT query.
 Return ONLY the raw SQL — no markdown, no explanation, no code fences.
 Table: spend_data
 Columns:
  - id: SERIAL PRIMARY KEY
  - date: DATE
  - country: VARCHAR(100)
  - branch: VARCHAR(100)
  - channel: VARCHAR(100)
  - spend: FLOAT
  - revenue: FLOAT
  - leads: INT
  - conversions: INT
  - client_id: VARCHAR(100)
 Available countries: {countries}
 Available branches (sample): {branches}
 Available channels: {channels}
 Date range: {date_min} to {date_max}
 RULES:
 1. ALWAYS include WHERE client_id = '{client_id}' in your queries
 2. Use standard PostgreSQL syntax
 3. For date ranges, use DATE type comparisons
 4. For "last N days", use: date >= CURRENT_DATE - INTERVAL '{n} days'
 5. For "last month", use: date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
 6. Return meaningful aggregations with GROUP BY when appropriate
 7. Use aliases for computed columns (e.g., SUM(revenue) AS total_revenue)
 8. LIMIT results to 50 rows maximum unless the user asks for all
 9. For "bottom N" use ASC ordering, for "top N" use DESC ordering
 """
 class NLToSQL:
    """Natural language to SQL converter using LLM."""
    def __init__(self):
        self.client = get_llm_client()
        self.model = get_model_name()
    def generate_sql(
        self,
        question: str,
        client_id: str,
        schema_metadata: dict,
    ) -> Optional[str]:
        """
        Convert a natural language question to a PostgreSQL SELECT query.
        Args:
            question: User's natural language question
            client_id: Client ID for filtering
            schema_metadata: Dict with countries, branches, channels, date_min, date_max
        Returns:
            Valid SQL SELECT string, or None on failure
        """
        # Build the system prompt with schema context
        system = SYSTEM_PROMPT.format(
            countries=", ".join(schema_metadata.get("countries", [])[:20]),
            branches=", ".join(schema_metadata.get("branches", [])[:20]),
            channels=", ".join(schema_metadata.get("channels", [])),
            date_min=schema_metadata.get("date_min", "unknown"),
            date_max=schema_metadata.get("date_max", "unknown"),
            client_id=client_id,
            n="7",  # Default for interval template
        )
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": system},
                    {"role": "user", "content": question},
                ],
                temperature=0.1,
                max_tokens=1024,
            )
            raw_sql = response.choices[0].message.content.strip()
            sql = self._clean_sql(raw_sql)
            if not self._validate_sql(sql):
                logger.warning(f"Generated SQL failed validation: {sql}")
                return None
            logger.info(f"Generated SQL: {sql}")
            return sql
        except Exception as e:
            logger.error(f"NL-to-SQL generation failed: {e}")
            return None
    def _clean_sql(self, raw: str) -> str:
        """Extract SQL from LLM response, stripping markdown code fences."""
        # Remove markdown code blocks
        cleaned = re.sub(r"```(?:sql)?\s*", "", raw)
        cleaned = re.sub(r"```\s*$", "", cleaned)
        cleaned = cleaned.strip().rstrip(";") + ";"
        return cleaned
    def _validate_sql(self, sql: str) -> bool:
        """Validate that the SQL is a safe SELECT query."""
        if not sql or len(sql) < 10:
            return False
        # Must start with SELECT
        if not sql.strip().upper().startswith("SELECT"):
            logger.warning("SQL does not start with SELECT")
            return False
        # Must not contain dangerous operations
        if UNSAFE_PATTERNS.search(sql):
            logger.warning("SQL contains unsafe operations")
            return False
        return True
@@ -0,0 +1,384 @@
 """
 Clawrity — PostgreSQL + pgvector Connector
 Connection pool management, schema initialization, and query execution.
 Single database handles both structured queries (NL-to-SQL) and vector search (pgvector).
 """
 import logging
 import time
 from typing import Any, Dict, List, Optional, Tuple
 import numpy as np
 import pandas as pd
 import psycopg2
 import psycopg2.extras
 from pgvector.psycopg2 import register_vector
 from config.settings import get_settings
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Schema DDL
 # ---------------------------------------------------------------------------
 INIT_SCHEMA_SQL = """
 -- Enable pgvector extension
 CREATE EXTENSION IF NOT EXISTS vector;
 -- Structured business data (replaces BigQuery)
 CREATE TABLE IF NOT EXISTS spend_data (
    id          SERIAL PRIMARY KEY,
    date        DATE,
    country     VARCHAR(100),
    branch      VARCHAR(100),
    channel     VARCHAR(100),
    spend       FLOAT,
    revenue     FLOAT,
    leads       INT,
    conversions INT,
    client_id   VARCHAR(100)
 );
 -- Vector embeddings (replaces ChromaDB)
 CREATE TABLE IF NOT EXISTS embeddings (
    id          VARCHAR(200) PRIMARY KEY,
    client_id   VARCHAR(100),
    chunk_type  VARCHAR(50),
    text        TEXT,
    metadata    JSONB,
    embedding   vector(384)
 );
 -- Forecast cache
 CREATE TABLE IF NOT EXISTS forecasts (
    id              SERIAL PRIMARY KEY,
    client_id       VARCHAR(100),
    branch          VARCHAR(100),
    country         VARCHAR(100),
    horizon_months  INT,
    forecast_data   JSONB,
    computed_at     TIMESTAMP DEFAULT NOW()
 );
 -- Indexes
 CREATE INDEX IF NOT EXISTS idx_spend_data_client
    ON spend_data (client_id);
 CREATE INDEX IF NOT EXISTS idx_spend_data_date
    ON spend_data (client_id, date);
 CREATE INDEX IF NOT EXISTS idx_embeddings_client_type
    ON embeddings (client_id, chunk_type);
 CREATE INDEX IF NOT EXISTS idx_forecasts_client
    ON forecasts (client_id, branch, country);
 """
 # IVFFlat index requires rows to exist — created separately after data load
 IVFFLAT_INDEX_SQL = """
 CREATE INDEX IF NOT EXISTS idx_embeddings_cosine
    ON embeddings USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);
 """
 class PostgresConnector:
    """PostgreSQL + pgvector connection manager."""
    def __init__(self, database_url: Optional[str] = None):
        self.database_url = database_url or get_settings().database_url
        self._conn: Optional[psycopg2.extensions.connection] = None
    def _get_connection(self) -> psycopg2.extensions.connection:
        """Get or create a database connection with retry logic."""
        if self._conn is None or self._conn.closed:
            max_retries = 3
            for attempt in range(max_retries):
                try:
                    self._conn = psycopg2.connect(self.database_url)
                    register_vector(self._conn)
                    logger.info("Connected to PostgreSQL with pgvector support")
                    return self._conn
                except psycopg2.OperationalError as e:
                    wait = 2**attempt
                    logger.warning(
                        f"DB connection attempt {attempt + 1}/{max_retries} failed: {e}. "
                        f"Retrying in {wait}s..."
                    )
                    time.sleep(wait)
            raise ConnectionError("Failed to connect to PostgreSQL after 3 attempts")
        return self._conn
    def close(self):
        """Close the database connection."""
        if self._conn and not self._conn.closed:
            self._conn.close()
            logger.info("PostgreSQL connection closed")
    def init_schema(self):
        """Create tables and extensions if they don't exist."""
        conn = self._get_connection()
        try:
            with conn.cursor() as cur:
                cur.execute(INIT_SCHEMA_SQL)
            conn.commit()
            logger.info("Database schema initialized successfully")
        except Exception as e:
            conn.rollback()
            logger.error(f"Schema initialization failed: {e}")
            raise
    def create_vector_index(self):
        """Create IVFFlat index — call AFTER data has been loaded into embeddings."""
        conn = self._get_connection()
        try:
            with conn.cursor() as cur:
                cur.execute(IVFFLAT_INDEX_SQL)
            conn.commit()
            logger.info("IVFFlat vector index created")
        except Exception as e:
            conn.rollback()
            logger.warning(f"Could not create IVFFlat index (may need more rows): {e}")
    # ------------------------------------------------------------------
    # Query execution
    # ------------------------------------------------------------------
    def execute_query(self, sql: str, params: Optional[tuple] = None) -> pd.DataFrame:
        """
        Execute a SELECT query and return results as a DataFrame.
        Args:
            sql: SQL query string (must be SELECT only)
            params: Query parameters for parameterised queries
        Returns:
            pandas DataFrame with query results
        """
        conn = self._get_connection()
        try:
            df = pd.read_sql_query(sql, conn, params=params)
            conn.rollback()
            logger.debug(f"Query returned {len(df)} rows")
            return df
        except Exception as e:
            logger.error(f"Query execution failed: {e}")
            conn.rollback()
            raise
    def execute_raw(self, sql: str, params: Optional[tuple] = None) -> List[Dict]:
        """Execute a query and return raw dictionaries."""
        conn = self._get_connection()
        try:
            with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
                cur.execute(sql, params)
                if cur.description:
                    results = [dict(row) for row in cur.fetchall()]
                    conn.rollback()
                    return results
                conn.commit()
                return []
        except Exception as e:
            conn.rollback()
            logger.error(f"Raw query execution failed: {e}")
            raise
    def execute_write(self, sql: str, params: Optional[tuple] = None):
        """Execute an INSERT/UPDATE/DELETE statement."""
        conn = self._get_connection()
        try:
            with conn.cursor() as cur:
                cur.execute(sql, params)
            conn.commit()
        except Exception as e:
            conn.rollback()
            logger.error(f"Write execution failed: {e}")
            raise
    def execute_batch(self, sql: str, data: List[tuple], page_size: int = 1000):
        """Execute a batch INSERT using execute_values for performance."""
        conn = self._get_connection()
        try:
            with conn.cursor() as cur:
                psycopg2.extras.execute_values(cur, sql, data, page_size=page_size)
            conn.commit()
            logger.info(f"Batch insert: {len(data)} rows")
        except Exception as e:
            conn.rollback()
            logger.error(f"Batch execution failed: {e}")
            raise
    # ------------------------------------------------------------------
    # pgvector operations
    # ------------------------------------------------------------------
    def upsert_embeddings(self, embeddings_data: List[Dict[str, Any]]):
        """
        Upsert embedding records into the embeddings table.
        Args:
            embeddings_data: List of dicts with keys:
                id, client_id, chunk_type, text, metadata, embedding
        """
        conn = self._get_connection()
        sql = """
            INSERT INTO embeddings (id, client_id, chunk_type, text, metadata, embedding)
            VALUES %s
            ON CONFLICT (id) DO UPDATE SET
                text = EXCLUDED.text,
                metadata = EXCLUDED.metadata,
                embedding = EXCLUDED.embedding
        """
        data = [
            (
                d["id"],
                d["client_id"],
                d["chunk_type"],
                d["text"],
                psycopg2.extras.Json(d["metadata"]),
                np.array(d["embedding"]),
            )
            for d in embeddings_data
        ]
        try:
            with conn.cursor() as cur:
                psycopg2.extras.execute_values(cur, sql, data, page_size=100)
            conn.commit()
            logger.info(f"Upserted {len(data)} embeddings")
        except Exception as e:
            conn.rollback()
            logger.error(f"Embedding upsert failed: {e}")
            raise
    def search_embeddings(
        self,
        query_embedding: np.ndarray,
        client_id: str,
        chunk_type: Optional[str] = None,
        top_k: int = 5,
    ) -> List[Dict]:
        """
        Search for similar embeddings using pgvector cosine similarity.
        Args:
            query_embedding: Query vector (384 dims)
            client_id: Filter by client
            chunk_type: Optional filter by chunk type
            top_k: Number of results to return
        Returns:
            List of dicts with text, metadata, and similarity score
        """
        conn = self._get_connection()
        query_vec = np.array(query_embedding)
        if chunk_type:
            sql = """
                SELECT text, metadata, 1 - (embedding <=> %s) AS similarity
                FROM embeddings
                WHERE client_id = %s AND chunk_type = %s
                ORDER BY embedding <=> %s
                LIMIT %s
            """
            params = (query_vec, client_id, chunk_type, query_vec, top_k)
        else:
            sql = """
                SELECT text, metadata, 1 - (embedding <=> %s) AS similarity
                FROM embeddings
                WHERE client_id = %s
                ORDER BY embedding <=> %s
                LIMIT %s
            """
            params = (query_vec, client_id, query_vec, top_k)
        try:
            with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
                cur.execute(sql, params)
                results = [dict(row) for row in cur.fetchall()]
            logger.debug(f"Vector search returned {len(results)} results")
            return results
        except Exception as e:
            logger.error(f"Vector search failed: {e}")
            raise
    # ------------------------------------------------------------------
    # Utility
    # ------------------------------------------------------------------
    def get_table_count(self, table: str, client_id: Optional[str] = None) -> int:
        """Get row count for a table, optionally filtered by client_id."""
        conn = self._get_connection()
        try:
            with conn.cursor() as cur:
                if client_id:
                    cur.execute(
                        f"SELECT COUNT(*) FROM {table} WHERE client_id = %s",
                        (client_id,),
                    )
                else:
                    cur.execute(f"SELECT COUNT(*) FROM {table}")
                return cur.fetchone()[0]
        except Exception as e:
            logger.error(f"Count query failed: {e}")
            return 0
    def get_spend_data_schema(self, client_id: str) -> Dict:
        """Get metadata about available data for a client — used by NL-to-SQL."""
        conn = self._get_connection()
        try:
            with conn.cursor() as cur:
                cur.execute(
                    "SELECT DISTINCT country FROM spend_data WHERE client_id = %s ORDER BY country",
                    (client_id,),
                )
                countries = [row[0] for row in cur.fetchall()]
                cur.execute(
                    "SELECT DISTINCT branch FROM spend_data WHERE client_id = %s ORDER BY branch",
                    (client_id,),
                )
                branches = [row[0] for row in cur.fetchall()]
                cur.execute(
                    "SELECT DISTINCT channel FROM spend_data WHERE client_id = %s ORDER BY channel",
                    (client_id,),
                )
                channels = [row[0] for row in cur.fetchall()]
                cur.execute(
                    "SELECT MIN(date), MAX(date) FROM spend_data WHERE client_id = %s",
                    (client_id,),
                )
                date_range = cur.fetchone()
            return {
                "countries": countries,
                "branches": branches,
                "channels": channels,
                "date_min": str(date_range[0]) if date_range[0] else None,
                "date_max": str(date_range[1]) if date_range[1] else None,
            }
        except Exception as e:
            logger.error(f"Schema metadata query failed: {e}")
            return {
                "countries": [],
                "branches": [],
                "channels": [],
                "date_min": None,
                "date_max": None,
            }
 # ---------------------------------------------------------------------------
 # Module-level singleton
 # ---------------------------------------------------------------------------
 _connector: Optional[PostgresConnector] = None
 def get_connector() -> PostgresConnector:
    """Get the shared PostgresConnector singleton."""
    global _connector
    if _connector is None:
        _connector = PostgresConnector()
    return _connector
@@ -0,0 +1,139 @@
 """
 Clawrity — Web Search Skill
 Primary: Tavily API (clean, summarised results built for LLM agents)
 Fallback: duckduckgo-search (no API key, no rate limits, free)
 Auto-fallback: if Tavily errors or quota exceeded, silently switch to DuckDuckGo.
 """
 import logging
 from datetime import datetime, timedelta
 from typing import List, Dict, Optional
 from config.settings import get_settings
 logger = logging.getLogger(__name__)
 def web_search(
    query: str,
    max_results: int = 5,
    lookback_days: int = 1,
 ) -> List[Dict]:
    """
    Search the web using Tavily (primary) or DuckDuckGo (fallback).
    Args:
        query: Search query string
        max_results: Maximum number of results
        lookback_days: Only keep results from the last N days
    Returns:
        List of dicts with: title, url, content, date
    """
    results = _tavily_search(query, max_results)
    if not results:
        logger.info("Tavily returned no results, falling back to DuckDuckGo")
        results = _ddg_search(query, max_results)
    # Filter by recency
    if lookback_days > 0:
        results = _filter_recent(results, lookback_days)
    return results
 def _tavily_search(query: str, max_results: int = 5) -> List[Dict]:
    """Search using Tavily API."""
    settings = get_settings()
    if not settings.tavily_api_key:
        logger.info("Tavily API key not configured, skipping")
        return []
    try:
        from tavily import TavilyClient
        client = TavilyClient(api_key=settings.tavily_api_key)
        response = client.search(
            query=query,
            search_depth="advanced",
            max_results=max_results,
        )
        results = []
        for item in response.get("results", []):
            results.append({
                "title": item.get("title", ""),
                "url": item.get("url", ""),
                "content": item.get("content", ""),
                "date": item.get("published_date", ""),
                "source": "tavily",
            })
        logger.info(f"Tavily returned {len(results)} results for: {query[:50]}")
        return results
    except Exception as e:
        logger.warning(f"Tavily search failed: {e}")
        return []
 def _ddg_search(query: str, max_results: int = 5) -> List[Dict]:
    """Search using DuckDuckGo (fallback — no API key needed)."""
    try:
        from duckduckgo_search import DDGS
        results = []
        with DDGS() as ddgs:
            for r in ddgs.text(query, max_results=max_results):
                results.append({
                    "title": r.get("title", ""),
                    "url": r.get("href", ""),
                    "content": r.get("body", ""),
                    "date": "",
                    "source": "duckduckgo",
                })
        logger.info(f"DuckDuckGo returned {len(results)} results for: {query[:50]}")
        return results
    except Exception as e:
        logger.warning(f"DuckDuckGo search failed: {e}")
        return []
 def _filter_recent(results: List[Dict], lookback_days: int) -> List[Dict]:
    """Filter results to only include items from the last N days."""
    if not results:
        return results
    cutoff = datetime.utcnow() - timedelta(days=lookback_days)
    filtered = []
    for r in results:
        date_str = r.get("date", "")
        if not date_str:
            # No date info — include it (benefit of the doubt)
            filtered.append(r)
            continue
        try:
            # Try common date formats
            for fmt in ("%Y-%m-%dT%H:%M:%S", "%Y-%m-%d", "%B %d, %Y"):
                try:
                    dt = datetime.strptime(date_str[:19], fmt)
                    if dt >= cutoff:
                        filtered.append(r)
                    break
                except ValueError:
                    continue
            else:
                # Can't parse date, include it
                filtered.append(r)
        except Exception:
            filtered.append(r)
    return filtered
@@ -0,0 +1,17 @@
 # SOUL — ACME Corporation
 ## Identity
 You are Clawrity, ACME's business intelligence assistant.
 Speak professionally but conversationally.
 Always ground answers in data. Never speculate.
 ## Business Context
 - Operates in: US, Canada, MENA
 - Primary metric: Revenue per lead
 - Risk tolerance: Conservative (max 15% budget reallocation per suggestion)
 ## Rules
 - If data unavailable, say "I don't have that data right now"
 - Always surface bottom 3 branches in daily digests
 - Budget suggestions must cite specific historical data points
 - Never compare to competitors by name unless from Scout Agent
@@ -0,0 +1,56 @@
 """
 Clawrity — SOUL Loader
 Reads the SOUL.md file for a client and returns raw text for prompt injection.
 SOUL.md defines the AI's personality, business context, and rules per client.
 """
 import logging
 from pathlib import Path
 from typing import Optional
 from config.client_loader import ClientConfig
 logger = logging.getLogger(__name__)
 def load_soul(client_config: ClientConfig) -> str:
    """
    Load the SOUL.md content for a client.
    Args:
        client_config: The client's configuration containing soul_file path.
    Returns:
        Raw markdown text of the SOUL file, or a default prompt if file not found.
    """
    soul_path = Path(client_config.soul_file)
    if not soul_path.exists():
        logger.warning(
            f"SOUL file not found at {soul_path} for client {client_config.client_id}. "
            f"Using default personality."
        )
        return _default_soul(client_config)
    try:
        content = soul_path.read_text(encoding="utf-8")
        logger.info(f"Loaded SOUL for {client_config.client_id} from {soul_path}")
        return content
    except Exception as e:
        logger.error(f"Error reading SOUL file {soul_path}: {e}")
        return _default_soul(client_config)
 def _default_soul(client_config: ClientConfig) -> str:
    """Generate a minimal default SOUL if the file is missing."""
    return f"""# SOUL — {client_config.client_name}
 ## Identity
 You are Clawrity, {client_config.client_name}'s business intelligence assistant.
 Speak professionally. Always ground answers in data. Never speculate.
 ## Rules
 - If data unavailable, say "I don't have that data right now"
 - Always cite specific data points in your responses
 """