Skip to content

System Design Document

William Garneau edited this page Nov 4, 2024 · 15 revisions

System Design Document

1. Introduction

This document outlines the system design for the Veracity Evaluation Backend, a sophisticated platform designed to detect and mitigate misinformation using language models and web search capabilities.

1.1 Purpose

The purpose of this system is to provide s backend service that can analyze claims, evaluate their veracity, and provide detailed responses to combat misinformation.

1.2 Scope

This system will handle user queries, interact with language models, perform web searches, store and retrieve data, and manage user interactions.

2. System Architecture

2.1 High-Level Architecture Diagram

graph TD
    A[Frontend/Browser Extension] -->|RESTful API| B[FastAPI Backend]
    B --> C[LLM - Llama 3.1 70B]
    B --> D[Google Web Search API]
    B --> E[(Cloud SQL PostgreSQL)]
    F[Google Kubernetes Engine] -->|Orchestration| B
    F -->|Orchestration| C
    F -->|Orchestration| D
    G[Google Cloud Platform] -->|Hosts| F
    G -->|Hosts| E
    H[Memorystore for Redis] -->|Fast Data Access| B
    I[Cloud Load Balancing] -->|Load Distribution| B
    J[Secret Manager] -->|Secrets Management| B
    K[Certificate Manager] -->|SSL/TLS| I
    L[Auth0] -->|Authentication| B
    M[Cloud CDN] -->|Static Assets| A
Loading

2.2 Component Description

  1. FastAPI Backend: Core application logic, API endpoints, and request handling.
  2. LLM (Llama 3.1 70B): Large language model for advanced text analysis and generation.
  3. Google Web Search API: Provides real-time web search results to enrich responses.
  4. Cloud SQL (PostgreSQL): Main database for storing user data, claims, and analysis results.
  5. Google Kubernetes Engine: Orchestrates and manages containerized application components.
  6. Memorystore for Redis: In-memory cache for fast data retrieval and temporary storage.
  7. Cloud Load Balancing: Distributes incoming traffic across multiple backend instances.
  8. Secret Manager: Securely stores and manages sensitive information like API keys.
  9. Certificate Manager: Handles SSL/TLS certificates for secure communications.
  10. Auth0: Manages user authentication and authorization.
  11. Cloud CDN: Delivers static assets with low latency.

3. Data Flow

3.1 Query Processing Flow

  1. User submits a query through the frontend or browser extension.
  2. Request is received by FastAPI backend via RESTful API.
  3. Backend validates the request and authenticates the user.
  4. Query is sent to the LLM for initial analysis.
  5. Relevant keywords are extracted and sent to Google Web Search API.
  6. Search results are processed and combined with LLM analysis.
  7. Final response is generated and sent back to the user.
  8. Query and results are stored in the database for future reference.

3.2 User Authentication Flow

  1. User initiates login or registration process.
  2. Request is sent to Auth0 for authentication.
  3. Upon successful authentication, a JWT token is generated.
  4. Token is sent back to the client for use in subsequent API calls.

4. Database Schema (High-Level)

4.1 Users Table

  • id (UUID)
  • username (String)
  • email (String)
  • auth0_id (String)
  • created_at (Timestamp)
  • last_login (Timestamp)

4.2 Claims Table

  • id (UUID)
  • user_id (UUID, foreign key to Users)
  • claim_text (Text)
  • context (Text)
  • created_at (Timestamp)

4.3 Analysis Table

  • id (UUID)
  • claim_id (UUID, foreign key to Claims)
  • veracity_score (Float)
  • confidence_score (Float)
  • analysis_text (Text)
  • created_at (Timestamp)

4.4 Sources Table

  • id (UUID)
  • analysis_id (UUID, foreign key to Analysis)
  • url (String)
  • title (String)
  • snippet (Text)
  • credibility_score (Float)

4.5 Feedback Table

  • id (UUID)
  • analysis_id (UUID, foreign key to Analysis)
  • user_id (UUID, foreign key to Users
  • rating (Float, rating >= 1 and rating <= 5)
  • comment (Text)
  • created_at (Timestamp)

4.6 Conversations Table

  • id (UUID)
  • user_id (UUID, foreign key to Users)
  • start_time (Timestamp)
  • end_time (Timestamp, nullable)
  • status (String, default 'active')

4.7 Messages Table

Note

(either conversation_id or claim_conversation_id must be non-null, but not both)

  • id (UUID)
  • conversation_id (UUID, nullable, foreign key to Conversations)
  • claim_conversation_id (UUID, nullable, foreign key to Claim_Conversations)
  • sender_type (String, enum: 'user' or 'bot')
  • content (Text)
  • timestamp (Timestamp)
  • claim_id (UUID, foreign key to Claims, nullable)
  • analysis_id (UUID, foreign key to Analysis, nullable)

4.8 Domains table

  • id (UUID)
  • domain_name (String, Unique)
  • credibility_score (Float)
  • is_reliable (Boolean)
  • description (Text, Nullable)
  • created_at (Timestamp)
  • updated_at (Timestamp)

4.9 Claim_Conversations Table

  • id (UUID, primary key)
  • conversation_id (UUID, foreign key to Conversations)
  • claim_id (UUID, foreign key to Claims)
  • start_time (Timestamp)
  • end_time (Timestamp, nullable)
  • status (String, default "active")

5. API Endpoints

Refer to the API Specification for detailed information on available endpoints.

6. Security Considerations

  • All communications are encrypted using SSL/TLS.
  • API keys and sensitive configurations are stored in Secret Manager.
  • Regular security audits and penetration testing will be conducted.

7. Scalability and Performance

  • Kubernetes allows for easy horizontal scaling of application components.
  • Redis cache reduces database load for frequently accessed data.
  • Cloud Load Balancing ensures efficient distribution of incoming requests.
  • Cloud CDN minimizes latency for static asset delivery.

8. Monitoring and Logging

  • Application logs will be centralized and analyzed for performance and error tracking.
  • Key metrics (response times, error rates, etc.) will be monitored and alerted on.
  • Regular performance reviews will be conducted to identify optimization opportunities.

9. Disaster Recovery and Backup

  • Regular database backups will be performed and stored in a separate geographic region.
  • A disaster recovery plan will be developed and tested periodically.

10. Relationships

  • Users can create multiple Claims, Feedback, and initiate multiple Conversations.
  • Each Claim can have one Analysis.
  • Each Analysis can have multiple Sources and Feedback.
  • Each Conversation can have multiple Messages.
  • Messages can optionally be associated with a Claim and an Analysis.
  • Domains are standalone entities used to evaluate the credibility of Sources.
flowchart TD
    USER --> |"initiates (1:n)"| CONVERSATION
    USER --> |"sends (1:n)"| MESSAGE
    USER --> |"provides (1:n)"| FEEDBACK
    
    CONVERSATION --> |"contains (1:n)"| CLAIM_CONVERSATION
    CONVERSATION --> |"has general (1:n)"| MESSAGE
    
    CLAIM_CONVERSATION --> |"has specific (1:n)"| MESSAGE
    CLAIM_CONVERSATION --> |"is about (1:1)"| CLAIM
    
    CLAIM --> |"has (1:1)"| ANALYSIS
    
    ANALYSIS --> |"cites (1:n)"| SOURCE
    ANALYSIS --> |"receives (1:n)"| FEEDBACK
Loading

11. Future Considerations

  • Implementing a feedback loop for continuous improvement of the system's accuracy.

This document serves as a high-level overview of the system design and will be updated as the project evolves.