Menu

Tree [2a347f] master /
 History

HTTPS access


File Date Author Commit
 deploy 2026-03-01 Nikolaos Bosinas Nikolaos Bosinas [f50d8e] Add Web Push notifications, self-hosted PWA, an...
 pwa 2026-03-03 Nikolaos Bosinas Nikolaos Bosinas [2a347f] Add PWA icons generated from WFU AI logo
 results 2026-02-28 Nikolaos Bosinas Nikolaos Bosinas [e3fbd3] Add Pass 2 Q&A extraction pipeline with Gemini ...
 scripts 2026-02-28 Nikolaos Bosinas Nikolaos Bosinas [c772ff] Add Phase 4 foundation: email monitor, Firestor...
 src 2026-03-02 Nikolaos Bosinas Nikolaos Bosinas [7b81ee] Add Irrelevant button with exclusion rules for ...
 .env.example 2026-02-25 Nikolaos Bosinas Nikolaos Bosinas [98bdd0] Add Gmail email extraction tool for WFU plugin ...
 .gitignore 2026-02-28 Nikolaos Bosinas Nikolaos Bosinas [212a1e] Add scripts/_archive/ to .gitignore
 README.md 2026-02-28 Nikolaos Bosinas Nikolaos Bosinas [bebfea] Update README with Pass 1/Pass 2 classification...
 SETUP_GOOGLE_CLOUD.md 2026-02-25 Nikolaos Bosinas Nikolaos Bosinas [98bdd0] Add Gmail email extraction tool for WFU plugin ...
 keys_select.txt 2026-02-26 Nikolaos Bosinas Nikolaos Bosinas [fa8a45] Add email filtering pipeline with whitelist lab...
 labels_checklist.txt 2026-02-26 Nikolaos Bosinas Nikolaos Bosinas [fa8a45] Add email filtering pipeline with whitelist lab...
 labels_select.txt 2026-02-26 Nikolaos Bosinas Nikolaos Bosinas [fa8a45] Add email filtering pipeline with whitelist lab...
 pyproject.toml 2026-02-28 Nikolaos Bosinas Nikolaos Bosinas [c6ec84] Add context assembly pipeline with vector simil...
 recipients_checklist.txt 2026-02-26 Nikolaos Bosinas Nikolaos Bosinas [fa8a45] Add email filtering pipeline with whitelist lab...
 recipients_select.txt 2026-02-26 Nikolaos Bosinas Nikolaos Bosinas [fa8a45] Add email filtering pipeline with whitelist lab...

Read Me

WFU-Gmail

A Python CLI tool that extracts, consolidates, and classifies support data from three sources into a unified SQLite database for the WordPress File Upload plugin:

  1. Gmail — Fetches all emails via the Gmail API (OAuth2, batch requests, resumable)
  2. WordPress comments — Parses a wp_comments MySQL dump from iptanus.com
  3. WordPress.org forum — Scrapes the plugin's support forum at wordpress.org

All data is stored in a single messages table with full-text search (FTS5), allowing unified querying across all sources. A two-pass classification pipeline using Gemini 2.5 Flash on Vertex AI then tags each message (Pass 1: edition + topic) and extracts Q&A pairs from support threads (Pass 2), producing a knowledge base of 3,918 Q&A pairs from 24,205 analyzed threads.

Project Structure

git/wfu-ai/
├── pyproject.toml                  # Package config, dependencies, CLI entry point
├── .env.example                    # Environment variable template
├── .gitignore
├── SETUP_GOOGLE_CLOUD.md           # Google Cloud OAuth setup instructions
├── src/
   └── wfu_gmail/
       ├── __init__.py
       ├── cli.py                  # CLI with 18 subcommands
       ├── config.py               # .env loading, Config dataclass
       ├── auth.py                 # Google OAuth2 flow + token management
       ├── fetcher.py              # Gmail API batch fetching with retry
       ├── parser.py               # MIME tree walker, base64 decoding
       ├── models.py               # EmailMessage / EmailAttachment dataclasses
       ├── storage_sqlite.py       # SQLite schema, FTS5, migrations, CRUD
       ├── storage_json.py         # Raw JSON file storage (data/raw/YYYY/MM/)
       ├── rate_limiter.py         # Per-second API throttle
       ├── wp_comments.py          # MySQL dump parser + DB importer
       ├── wp_org_scraper.py       # wordpress.org forum scraper
       ├── classifier.py           # Pass 1: Gemini batch classification (edition + topic)
       └── pass2_analyzer.py       # Pass 2: Gemini Q&A extraction per thread
├── scripts/
   ├── parse_wp_comments.py        # Standalone script for WP comment stats
   ├── test_gemini_qa.py           # Gemini Q&A prompt testing (v1/v2/v3)
   └── pass2_qa_report.py          # Q&A extraction report generator
├── data/                           # Created at runtime, gitignored
   ├── raw/                        # JSON: data/raw/2024/01/msg_{id}.json
   ├── wfu_emails.db               # SQLite database
   └── wp_comments.sql             # MySQL dump (user-provided)
└── credentials/                    # Gitignored
    ├── client_secret.json          # From Google Cloud Console
    └── token.json                  # Generated by OAuth flow

Installation

Requires Python 3.12+.

cd git/wfu-ai
pip install -e .

Dependencies: google-api-python-client, google-auth-oauthlib, google-auth, python-dotenv, beautifulsoup4, requests, google-genai.

CLI Commands

Run via wfu-gmail <command> or python -m wfu_gmail.cli <command>.

Core Gmail Commands

Command Description
auth Run OAuth2 flow, verify connection, print email address and mailbox size
fetch [-q QUERY] Fetch all emails (or filtered by Gmail search query). Resumable — skips already-fetched messages
stats Print total messages, threads, attachments, date range, top 10 senders
labels Fetch Gmail labels from API, save to DB, show with local message counts
backfill-threads Re-parse raw JSON files to backfill In-Reply-To and References headers

Email Filtering Commands

These commands work together in a checklist-based workflow. Generate a text file, edit it to mark items with [*], then apply it.

Command Description
filter-recipients Generate recipients_checklist.txt — mark unrelated recipient addresses
apply-recipients Import checked addresses from the checklist as unrelated
filter-labels Generate labels_checklist.txt — mark unrelated Gmail labels
apply-labels Import checked labels from the checklist as unrelated
select-recipients Generate recipients_select.txt — mark related recipient addresses (whitelist)
apply-selected-recipients Import checked addresses as related
select-labels Generate labels_select.txt — mark related Gmail labels (whitelist)
apply-selected-labels Import checked labels as related, compute included message counts with thread propagation
apply-keys Read keywords from keys_select.txt, search all messages, store keyword match counts
count-related Combine all filters (labels + recipients + keywords), apply thread propagation, mark is_related=1 in DB

Data Import Commands

Command Description
import-comments Parse data/wp_comments.sql MySQL dump and import approved comments
import-wp-org Scrape all support topics from wordpress.org plugin forum and import

Classification Commands

Command Description
classify [--sample N] Pass 1: Batch-classify emails with Gemini 2.5 Flash (edition + topic tag). Processes 200 emails per API call.
pass2 [--milestone N] Pass 2: Deep Q&A extraction per thread with Gemini 2.5 Flash. Processes up to N threads (default 1000) then stops for review. Resumable.
pass2 --dry-run Show pending thread counts and cost estimate without making API calls

Database Schema

All data sources are stored in one unified messages table:

Column Type Description
message_id TEXT PK Unique ID (msg_id for email, wp_comment_{id}, wporg_{id})
thread_id TEXT Groups related messages (thread_id for email, wp_post_{id}, wporg_topic_{slug})
subject TEXT Email subject or forum topic title
sender TEXT From header or comment author
to_recipients TEXT To header or owner email
date_utc TEXT ISO timestamp
body_plain TEXT Plain text content
body_html TEXT HTML content (emails only)
label_ids TEXT JSON array of Gmail labels or ["wp_comment"]/["wp_org_topic"]
category TEXT email, wp_comment, or wp_org_topic
edition TEXT free, pro, general, or unknown
is_related INTEGER 1 if message is relevant to plugin support, 0 otherwise
json_path TEXT Path to raw JSON file (emails) or source URL (wp.org)
in_reply_to TEXT Parent message ID for threading
fetched_at TEXT When the record was imported

Additional tables: attachments (email attachment metadata), labels (Gmail label mappings), related_recipients, related_labels, related_keywords, unrelated_recipients, unrelated_labels, sync_metadata, message_classifications (Pass 1 results), thread_analyses (Pass 2 results).

thread_analyses Table (Pass 2)

Column Type Description
thread_id TEXT PK References messages.thread_id
thread_type TEXT support, billing, feature_request, other, skip
topic TEXT config, error, upload, hooks, styling, compatibility, cloud, feature, security, performance, other, skip
sub_topic TEXT Specific sub-topic description
edition TEXT free, pro, both, unknown
summary TEXT Thread summary
resolution TEXT How the issue was resolved (if applicable)
resolved INTEGER 1 if resolved, 0 if not
qa_question TEXT Extracted Q&A question (null if no Q&A)
qa_answer TEXT Extracted Q&A answer (null if no Q&A)
environment TEXT WordPress version, PHP version, etc.
keywords TEXT Comma-separated keywords
model_used TEXT gemini-2.5-flash, claude-opus-4-6, or rule-based
classified_at TEXT ISO timestamp

Full-text search is provided via an messages_fts FTS5 virtual table indexing subject, body_plain, and sender, kept in sync via triggers.

Module Reference

cli.py — Command-Line Interface

Entry point for all operations. Uses argparse with 18 subcommands. Each command is implemented as a _cmd_* function that loads config, initializes the database, and orchestrates the relevant modules.

Key functions:

  • main() — Parses arguments, sets up logging, dispatches to the appropriate command handler.
  • _cmd_fetch(config, query) — Lists all Gmail message IDs, filters out already-fetched ones, then batch-fetches the rest. Each message is saved as raw JSON and inserted into SQLite. Tracks sync metadata for resumability.
  • _cmd_count_related(config) — The main filtering pipeline. Loads all whitelist filters (labels, recipients, keywords), scans every message, applies thread propagation (if any message in a thread matches, the whole thread is included), then marks is_related=1 in the database.
  • _cmd_import_comments(config) — Reads data/wp_comments.sql, parses it with wp_comments.load_wp_comments(), inserts into the messages table.
  • _cmd_import_wp_org(config) — Scrapes the wordpress.org forum topic listing, then scrapes each individual topic page, importing all posts.
  • _cmd_classify(config, sample) — Pass 1 classification. Batches 200 emails per Gemini API call to tag edition and topic.
  • _cmd_pass2(config, milestone, dry_run) — Pass 2 Q&A extraction. Processes one thread per API call, stops at milestone for cost review.

config.py — Configuration

Loads settings from a .env file and provides them as a Config dataclass.

  • Config — Dataclass with fields for OAuth paths, data directory, SQLite path, Gmail API parameters (batch size, max results, query), and log level. __post_init__ resolves all paths relative to project_root and creates required directories.
  • load_config(project_root) — Reads .env via dotenv_values(), constructs and returns a Config instance with environment overrides.

auth.py — Gmail OAuth2 Authentication

Handles the Google OAuth2 desktop flow for Gmail API access using gmail.readonly scope.

  • get_gmail_service(config) — Returns an authenticated Gmail API Resource object. Loads saved credentials or triggers the OAuth flow.
  • _load_or_create_credentials(config) — Tries to load credentials/token.json. If the token exists but is expired, refreshes it. If refresh fails or no token exists, runs the interactive OAuth flow.
  • _run_oauth_flow(config) — Opens a browser window for Google OAuth consent. Requires credentials/client_secret.json from Google Cloud Console. Saves the resulting token for reuse.
  • _save_token(creds, token_path) — Writes credentials to a JSON file.

fetcher.py — Gmail API Batch Fetching

Fetches full message data from the Gmail API using batch HTTP requests for efficiency.

  • list_all_message_ids(service, query, max_results_per_page) — Paginates through messages.list to collect all message IDs matching an optional Gmail search query. Returns a list of dicts with id and threadId.
  • fetch_messages(service, message_ids, on_message, batch_size, max_retries) — Processes message IDs in batches of 50. For each batch, builds a BatchHttpRequest with messages.get(format='full') calls. Parses each response into an EmailMessage and passes it to the on_message callback. Logs progress as percentage. Returns a stats dict with fetched, errors, and total counts.
  • _fetch_batch_with_retry(service, message_ids, max_retries) — Executes a single batch request. On HTTP 429 (rate limit) errors, retries with exponential backoff (2, 4, 8 seconds). Returns a dict mapping message IDs to either their response data or an Exception.

parser.py — MIME Message Parser

Converts Gmail API format='full' responses into EmailMessage objects by walking the MIME tree.

  • parse_message(raw) — Top-level parser. Extracts headers, parses the date, then walks the MIME tree to collect text bodies and attachment metadata. Returns an EmailMessage.
  • _extract_headers(headers) — Converts Gmail's [{"name": ..., "value": ...}] header list to a dict (first occurrence wins for duplicate headers).
  • _parse_date(date_str) — Parses RFC 2822 email date headers into UTC datetime objects. Falls back to current time on parse failure.
  • _walk_parts(part, bodies, attachments) — Recursively traverses the MIME part tree. Parts with filenames become EmailAttachment entries. multipart/* parts are recursed into. text/plain and text/html leaf parts are base64-decoded with charset detection.
  • _decode_body(data, charset) — Decodes base64url-encoded body data. Tries the declared charset first, then falls back through UTF-8 and Latin-1.
  • _get_charset(mime_type_header) — Extracts the charset= parameter from a Content-Type header string.

models.py — Data Models

  • EmailMessage — Dataclass representing a parsed email with 20 fields: message_id, thread_id, history_id, label_ids, snippet, subject, sender, to, cc, bcc, date_header, date_utc, message_id_header, in_reply_to, references, body_plain, body_html, raw_size_estimate, attachments, and raw_payload.
  • EmailAttachment — Dataclass with filename, mime_type, size, and attachment_id. Only metadata is stored — attachment content is not downloaded.

storage_sqlite.py — SQLite Database

Manages the SQLite database including schema creation, migrations, message insertion, and filtering queries.

  • init_db(db_path) — Creates the database file and executes the schema SQL (WAL mode, tables, indexes, FTS5, triggers). Calls _migrate() to add any new columns.
  • _migrate(conn) — Checks existing columns via PRAGMA table_info and adds any missing ones (message_id_hdr, in_reply_to, references_hdr, is_related, category, edition). This allows seamless schema evolution without dropping data.
  • insert_message(conn, msg, json_path) — Inserts an EmailMessage into the messages table using INSERT OR IGNORE for idempotency. Also inserts any attachment metadata. Returns True if the insert happened, False if it was a duplicate.
  • get_fetched_message_ids(conn) — Returns the set of all message IDs in the database. Used by fetch to skip already-fetched messages.
  • mark_related_messages(conn, related_ids) — Resets all messages to is_related=0, then batch-updates the given IDs to is_related=1 in chunks of 500 to avoid SQLite's variable limit.
  • get_stats(conn) — Returns a summary dict: total messages, threads, attachments, date range, top 10 senders.
  • Filter management functionssave_unrelated_recipients, save_unrelated_labels, save_related_recipients, save_related_labels, save_related_keywords and their corresponding getters. Each uses a clear-and-reinsert pattern for idempotent updates.
  • update_sync_metadata(conn, key, value) — Upserts key-value pairs for tracking fetch state (last fetch start/end, total listed IDs).

storage_json.py — Raw JSON Storage

Saves the raw Gmail API response as JSON files organized by date for lossless archival.

  • save_message_json(msg, data_dir) — Writes msg.raw_payload to data/raw/YYYY/MM/msg_{id}.json. Skips if the file already exists (resumability). Returns the relative path for storage in the database's json_path column.

rate_limiter.py — API Rate Limiter

  • RateLimiter(calls_per_second) — Simple time-based throttle. Gmail allows 250 quota units/second; each messages.get costs 5 units, so batches of 50 (= 250 units) are limited to ~1 per second.
  • wait() — Blocks until enough time has passed since the last call.

wp_comments.py — WordPress Comment Importer

Parses a MySQL wp_comments table dump exported from phpMyAdmin and imports approved comments into the messages table.

  • parse_mysql_values(sql_text) — Character-by-character parser for MySQL INSERT ... VALUES (...) syntax. Handles MySQL-specific backslash escaping (\', \\, \n, \r, \t). Returns a list of tuples, one per row. Only accepts rows with exactly 15 fields (matching the wp_comments schema).
  • extract_values_blocks(data) — Extracts the VALUES ...; portion from each INSERT statement. Tracks whether the cursor is inside a quoted string to avoid matching semicolons within comment content as statement terminators.
  • load_wp_comments(sql_path) — Reads the SQL file, extracts values blocks, parses all rows, then filters to only comment_type='comment' and comment_approved='1' (excluding spam, trash, order notes, reviews, pingbacks). Decodes HTML entities in content and author names. Returns a list of comment dicts.
  • import_comments_to_db(conn, comments, owner_email) — Maps each comment to the unified message schema:
  • message_id = wp_comment_{comment_id}
  • thread_id = wp_post_{post_id}
  • sender = "Author <email>" format
  • in_reply_to = wp_comment_{parent_id} (for threaded replies)
  • category = wp_comment, edition = free, is_related = 1
  • Uses INSERT OR IGNORE for idempotent re-runs

wp_org_scraper.py — WordPress.org Forum Scraper

Scrapes the plugin's support forum on wordpress.org using requests + BeautifulSoup to parse the bbPress HTML.

  • create_session() — Creates a requests.Session with a custom User-Agent and Accept headers.
  • scrape_topic_urls(session) — Iterates through all pages of the topic listing at https://wordpress.org/support/plugin/wp-file-upload/. Uses CSS selectors to find topic links: first tries ul.bbp-topics li a.bbp-topic-permalink, then falls back to td.column-primary a, then a broad link scan. Detects next-page links to paginate. Returns a list of dicts with url and title.
  • scrape_topic(url, session) — Fetches a single topic page and all its reply pages. Extracts the topic title from <h1>, then calls _parse_page_posts() for each page. Follows pagination links within the topic. Returns a dict with url, title, and posts list.
  • _parse_page_posts(soup) — Extracts posts from a single page of a topic:
  • Lead topic: Found in ul.bbp-lead-topic. Extracts author from div.bbp-topic-author, content from div.bbp-topic-content, date from p.bbp-topic-post-date.
  • Replies: Found in div[id^="post-"] elements. Extracts author from div.bbp-reply-author, content from div.bbp-reply-content, date from p.bbp-reply-post-date.
  • _parse_post_author(post_el) — Finds <a> links with /support/users/ in the href. Extracts the username from the URL slug and the display name from the link text. Returns (display_name, username).
  • _parse_post_date(post_el) — Tries three strategies: <time datetime="..."> attribute, then timestamp link title attributes with format "January 1, 2024 at 10:27 am", then a regex scan of the text content.
  • _parse_post_content(post_el) — Looks for div.bbp-reply-content or div.bbp-topic-content, falls back to collecting all <p> tags.
  • import_topics_to_db(conn, topics, owner_email) — Maps each post to the unified message schema:
  • message_id = wporg_{post_id}
  • thread_id = wporg_topic_{slug}
  • sender = "Author (@username)"
  • in_reply_to = previous post's message_id (preserves conversation order)
  • category = wp_org_topic, edition = free, is_related = 1
  • get_scraped_topic_urls(conn) — Returns the set of topic URLs already imported (stored in json_path), enabling resumable scraping.

classifier.py — Pass 1 Batch Classification

Classifies emails in batches of 200 using Gemini 2.5 Flash on Vertex AI. Tags each email with edition (free/pro/both/unknown) and topic_tag.

  • run_classification(conn, sample) — Main loop: queries unclassified emails, batches them into groups of 200, sends each batch to Gemini with a structured prompt, parses the JSON response, and stores results in message_classifications. Resumable via LEFT JOIN to skip already-classified messages.
  • classify_batch(client, emails) — Builds a prompt with 200 email snippets (subject + first 200 chars of body), calls Gemini 2.5 Flash (temp=0.1), returns parsed JSON array of classifications.
  • Uses google.genai SDK with us-central1 location and Vertex AI backend.

pass2_analyzer.py — Pass 2 Q&A Extraction

Processes one support thread at a time through Gemini 2.5 Flash to extract structured Q&A pairs, summaries, and metadata.

  • run_pass2(conn, milestone, dry_run) — Main pipeline. Queries pending threads (not yet in thread_analyses), builds 5 few-shot examples, processes each thread sequentially with cost tracking. Stops after milestone threads for user review. Returns stats dict with counts, tokens, and cost.
  • get_pending_threads(conn, limit) — Queries threads ordered by message count ascending (smallest first) that have no entry in thread_analyses.
  • build_fewshot_messages() — Constructs 5 conversation-turn examples from known thread IDs for few-shot prompting.
  • analyze_thread(client, fewshot_messages, thread_text) — Sends one thread to Gemini with the few-shot context, parses the JSON response into topic, summary, Q&A, and metadata fields.
  • get_thread_full(conn, thread_id) — Concatenates all messages in a thread chronologically, truncating individual emails longer than 3,000 chars.
  • Retry logic: 3 retries with exponential backoff (2, 4, 8 seconds). JSON parse errors are logged but not stored, allowing re-processing on next run.
  • Cost: Gemini 2.5 Flash at $0.15/MTok input, $0.60/MTok output.

Data Pipeline

1. Gmail fetch          --> raw JSON files + SQLite (category='email', edition='unknown')
2. WP comments import   --> SQLite only (category='wp_comment', edition='free')
3. WP.org forum scrape  --> SQLite only (category='wp_org_topic', edition='free')
4. Filtering pipeline   --> marks is_related=1 on relevant messages
   (labels + recipients + keywords + thread propagation)
5. Pass 1: classify     --> edition + topic tag per email (message_classifications table)
6. Pass 2: pass2        --> Q&A extraction per thread (thread_analyses table)

The filtering pipeline uses three whitelist criteria (related labels, related recipients, keyword matches) combined with thread propagation — if any message in a thread is relevant, all messages in that thread are included. WordPress comments and forum posts are automatically marked as related.

Classification Results

Metric Count
Total threads analyzed 24,205
Q&A pairs extracted 3,918
By Gemini 2.5 Flash 9,915
By Claude Opus (manual) 262
Rule-based skips 14,028
Email threads 23,303
WP comment threads 86
WP.org topics 816
MongoDB Logo MongoDB