A Python CLI tool that extracts, consolidates, and classifies support data from three sources into a unified SQLite database for the WordPress File Upload plugin:
wp_comments MySQL dump from iptanus.comAll data is stored in a single messages table with full-text search (FTS5), allowing unified querying across all sources. A two-pass classification pipeline using Gemini 2.5 Flash on Vertex AI then tags each message (Pass 1: edition + topic) and extracts Q&A pairs from support threads (Pass 2), producing a knowledge base of 3,918 Q&A pairs from 24,205 analyzed threads.
git/wfu-ai/
├── pyproject.toml # Package config, dependencies, CLI entry point
├── .env.example # Environment variable template
├── .gitignore
├── SETUP_GOOGLE_CLOUD.md # Google Cloud OAuth setup instructions
├── src/
│ └── wfu_gmail/
│ ├── __init__.py
│ ├── cli.py # CLI with 18 subcommands
│ ├── config.py # .env loading, Config dataclass
│ ├── auth.py # Google OAuth2 flow + token management
│ ├── fetcher.py # Gmail API batch fetching with retry
│ ├── parser.py # MIME tree walker, base64 decoding
│ ├── models.py # EmailMessage / EmailAttachment dataclasses
│ ├── storage_sqlite.py # SQLite schema, FTS5, migrations, CRUD
│ ├── storage_json.py # Raw JSON file storage (data/raw/YYYY/MM/)
│ ├── rate_limiter.py # Per-second API throttle
│ ├── wp_comments.py # MySQL dump parser + DB importer
│ ├── wp_org_scraper.py # wordpress.org forum scraper
│ ├── classifier.py # Pass 1: Gemini batch classification (edition + topic)
│ └── pass2_analyzer.py # Pass 2: Gemini Q&A extraction per thread
├── scripts/
│ ├── parse_wp_comments.py # Standalone script for WP comment stats
│ ├── test_gemini_qa.py # Gemini Q&A prompt testing (v1/v2/v3)
│ └── pass2_qa_report.py # Q&A extraction report generator
├── data/ # Created at runtime, gitignored
│ ├── raw/ # JSON: data/raw/2024/01/msg_{id}.json
│ ├── wfu_emails.db # SQLite database
│ └── wp_comments.sql # MySQL dump (user-provided)
└── credentials/ # Gitignored
├── client_secret.json # From Google Cloud Console
└── token.json # Generated by OAuth flow
Requires Python 3.12+.
cd git/wfu-ai
pip install -e .
Dependencies: google-api-python-client, google-auth-oauthlib, google-auth, python-dotenv, beautifulsoup4, requests, google-genai.
Run via wfu-gmail <command> or python -m wfu_gmail.cli <command>.
| Command | Description |
|---|---|
auth |
Run OAuth2 flow, verify connection, print email address and mailbox size |
fetch [-q QUERY] |
Fetch all emails (or filtered by Gmail search query). Resumable — skips already-fetched messages |
stats |
Print total messages, threads, attachments, date range, top 10 senders |
labels |
Fetch Gmail labels from API, save to DB, show with local message counts |
backfill-threads |
Re-parse raw JSON files to backfill In-Reply-To and References headers |
These commands work together in a checklist-based workflow. Generate a text file, edit it to mark items with [*], then apply it.
| Command | Description |
|---|---|
filter-recipients |
Generate recipients_checklist.txt — mark unrelated recipient addresses |
apply-recipients |
Import checked addresses from the checklist as unrelated |
filter-labels |
Generate labels_checklist.txt — mark unrelated Gmail labels |
apply-labels |
Import checked labels from the checklist as unrelated |
select-recipients |
Generate recipients_select.txt — mark related recipient addresses (whitelist) |
apply-selected-recipients |
Import checked addresses as related |
select-labels |
Generate labels_select.txt — mark related Gmail labels (whitelist) |
apply-selected-labels |
Import checked labels as related, compute included message counts with thread propagation |
apply-keys |
Read keywords from keys_select.txt, search all messages, store keyword match counts |
count-related |
Combine all filters (labels + recipients + keywords), apply thread propagation, mark is_related=1 in DB |
| Command | Description |
|---|---|
import-comments |
Parse data/wp_comments.sql MySQL dump and import approved comments |
import-wp-org |
Scrape all support topics from wordpress.org plugin forum and import |
| Command | Description |
|---|---|
classify [--sample N] |
Pass 1: Batch-classify emails with Gemini 2.5 Flash (edition + topic tag). Processes 200 emails per API call. |
pass2 [--milestone N] |
Pass 2: Deep Q&A extraction per thread with Gemini 2.5 Flash. Processes up to N threads (default 1000) then stops for review. Resumable. |
pass2 --dry-run |
Show pending thread counts and cost estimate without making API calls |
All data sources are stored in one unified messages table:
| Column | Type | Description |
|---|---|---|
message_id |
TEXT PK | Unique ID (msg_id for email, wp_comment_{id}, wporg_{id}) |
thread_id |
TEXT | Groups related messages (thread_id for email, wp_post_{id}, wporg_topic_{slug}) |
subject |
TEXT | Email subject or forum topic title |
sender |
TEXT | From header or comment author |
to_recipients |
TEXT | To header or owner email |
date_utc |
TEXT | ISO timestamp |
body_plain |
TEXT | Plain text content |
body_html |
TEXT | HTML content (emails only) |
label_ids |
TEXT | JSON array of Gmail labels or ["wp_comment"]/["wp_org_topic"] |
category |
TEXT | email, wp_comment, or wp_org_topic |
edition |
TEXT | free, pro, general, or unknown |
is_related |
INTEGER | 1 if message is relevant to plugin support, 0 otherwise |
json_path |
TEXT | Path to raw JSON file (emails) or source URL (wp.org) |
in_reply_to |
TEXT | Parent message ID for threading |
fetched_at |
TEXT | When the record was imported |
Additional tables: attachments (email attachment metadata), labels (Gmail label mappings), related_recipients, related_labels, related_keywords, unrelated_recipients, unrelated_labels, sync_metadata, message_classifications (Pass 1 results), thread_analyses (Pass 2 results).
| Column | Type | Description |
|---|---|---|
thread_id |
TEXT PK | References messages.thread_id |
thread_type |
TEXT | support, billing, feature_request, other, skip |
topic |
TEXT | config, error, upload, hooks, styling, compatibility, cloud, feature, security, performance, other, skip |
sub_topic |
TEXT | Specific sub-topic description |
edition |
TEXT | free, pro, both, unknown |
summary |
TEXT | Thread summary |
resolution |
TEXT | How the issue was resolved (if applicable) |
resolved |
INTEGER | 1 if resolved, 0 if not |
qa_question |
TEXT | Extracted Q&A question (null if no Q&A) |
qa_answer |
TEXT | Extracted Q&A answer (null if no Q&A) |
environment |
TEXT | WordPress version, PHP version, etc. |
keywords |
TEXT | Comma-separated keywords |
model_used |
TEXT | gemini-2.5-flash, claude-opus-4-6, or rule-based |
classified_at |
TEXT | ISO timestamp |
Full-text search is provided via an messages_fts FTS5 virtual table indexing subject, body_plain, and sender, kept in sync via triggers.
Entry point for all operations. Uses argparse with 18 subcommands. Each command is implemented as a _cmd_* function that loads config, initializes the database, and orchestrates the relevant modules.
Key functions:
main() — Parses arguments, sets up logging, dispatches to the appropriate command handler._cmd_fetch(config, query) — Lists all Gmail message IDs, filters out already-fetched ones, then batch-fetches the rest. Each message is saved as raw JSON and inserted into SQLite. Tracks sync metadata for resumability._cmd_count_related(config) — The main filtering pipeline. Loads all whitelist filters (labels, recipients, keywords), scans every message, applies thread propagation (if any message in a thread matches, the whole thread is included), then marks is_related=1 in the database._cmd_import_comments(config) — Reads data/wp_comments.sql, parses it with wp_comments.load_wp_comments(), inserts into the messages table._cmd_import_wp_org(config) — Scrapes the wordpress.org forum topic listing, then scrapes each individual topic page, importing all posts._cmd_classify(config, sample) — Pass 1 classification. Batches 200 emails per Gemini API call to tag edition and topic._cmd_pass2(config, milestone, dry_run) — Pass 2 Q&A extraction. Processes one thread per API call, stops at milestone for cost review.Loads settings from a .env file and provides them as a Config dataclass.
Config — Dataclass with fields for OAuth paths, data directory, SQLite path, Gmail API parameters (batch size, max results, query), and log level. __post_init__ resolves all paths relative to project_root and creates required directories.load_config(project_root) — Reads .env via dotenv_values(), constructs and returns a Config instance with environment overrides.Handles the Google OAuth2 desktop flow for Gmail API access using gmail.readonly scope.
get_gmail_service(config) — Returns an authenticated Gmail API Resource object. Loads saved credentials or triggers the OAuth flow._load_or_create_credentials(config) — Tries to load credentials/token.json. If the token exists but is expired, refreshes it. If refresh fails or no token exists, runs the interactive OAuth flow._run_oauth_flow(config) — Opens a browser window for Google OAuth consent. Requires credentials/client_secret.json from Google Cloud Console. Saves the resulting token for reuse._save_token(creds, token_path) — Writes credentials to a JSON file.Fetches full message data from the Gmail API using batch HTTP requests for efficiency.
list_all_message_ids(service, query, max_results_per_page) — Paginates through messages.list to collect all message IDs matching an optional Gmail search query. Returns a list of dicts with id and threadId.fetch_messages(service, message_ids, on_message, batch_size, max_retries) — Processes message IDs in batches of 50. For each batch, builds a BatchHttpRequest with messages.get(format='full') calls. Parses each response into an EmailMessage and passes it to the on_message callback. Logs progress as percentage. Returns a stats dict with fetched, errors, and total counts._fetch_batch_with_retry(service, message_ids, max_retries) — Executes a single batch request. On HTTP 429 (rate limit) errors, retries with exponential backoff (2, 4, 8 seconds). Returns a dict mapping message IDs to either their response data or an Exception.Converts Gmail API format='full' responses into EmailMessage objects by walking the MIME tree.
parse_message(raw) — Top-level parser. Extracts headers, parses the date, then walks the MIME tree to collect text bodies and attachment metadata. Returns an EmailMessage._extract_headers(headers) — Converts Gmail's [{"name": ..., "value": ...}] header list to a dict (first occurrence wins for duplicate headers)._parse_date(date_str) — Parses RFC 2822 email date headers into UTC datetime objects. Falls back to current time on parse failure._walk_parts(part, bodies, attachments) — Recursively traverses the MIME part tree. Parts with filenames become EmailAttachment entries. multipart/* parts are recursed into. text/plain and text/html leaf parts are base64-decoded with charset detection._decode_body(data, charset) — Decodes base64url-encoded body data. Tries the declared charset first, then falls back through UTF-8 and Latin-1._get_charset(mime_type_header) — Extracts the charset= parameter from a Content-Type header string.EmailMessage — Dataclass representing a parsed email with 20 fields: message_id, thread_id, history_id, label_ids, snippet, subject, sender, to, cc, bcc, date_header, date_utc, message_id_header, in_reply_to, references, body_plain, body_html, raw_size_estimate, attachments, and raw_payload.EmailAttachment — Dataclass with filename, mime_type, size, and attachment_id. Only metadata is stored — attachment content is not downloaded.Manages the SQLite database including schema creation, migrations, message insertion, and filtering queries.
init_db(db_path) — Creates the database file and executes the schema SQL (WAL mode, tables, indexes, FTS5, triggers). Calls _migrate() to add any new columns._migrate(conn) — Checks existing columns via PRAGMA table_info and adds any missing ones (message_id_hdr, in_reply_to, references_hdr, is_related, category, edition). This allows seamless schema evolution without dropping data.insert_message(conn, msg, json_path) — Inserts an EmailMessage into the messages table using INSERT OR IGNORE for idempotency. Also inserts any attachment metadata. Returns True if the insert happened, False if it was a duplicate.get_fetched_message_ids(conn) — Returns the set of all message IDs in the database. Used by fetch to skip already-fetched messages.mark_related_messages(conn, related_ids) — Resets all messages to is_related=0, then batch-updates the given IDs to is_related=1 in chunks of 500 to avoid SQLite's variable limit.get_stats(conn) — Returns a summary dict: total messages, threads, attachments, date range, top 10 senders.save_unrelated_recipients, save_unrelated_labels, save_related_recipients, save_related_labels, save_related_keywords and their corresponding getters. Each uses a clear-and-reinsert pattern for idempotent updates.update_sync_metadata(conn, key, value) — Upserts key-value pairs for tracking fetch state (last fetch start/end, total listed IDs).Saves the raw Gmail API response as JSON files organized by date for lossless archival.
save_message_json(msg, data_dir) — Writes msg.raw_payload to data/raw/YYYY/MM/msg_{id}.json. Skips if the file already exists (resumability). Returns the relative path for storage in the database's json_path column.RateLimiter(calls_per_second) — Simple time-based throttle. Gmail allows 250 quota units/second; each messages.get costs 5 units, so batches of 50 (= 250 units) are limited to ~1 per second.wait() — Blocks until enough time has passed since the last call.Parses a MySQL wp_comments table dump exported from phpMyAdmin and imports approved comments into the messages table.
parse_mysql_values(sql_text) — Character-by-character parser for MySQL INSERT ... VALUES (...) syntax. Handles MySQL-specific backslash escaping (\', \\, \n, \r, \t). Returns a list of tuples, one per row. Only accepts rows with exactly 15 fields (matching the wp_comments schema).extract_values_blocks(data) — Extracts the VALUES ...; portion from each INSERT statement. Tracks whether the cursor is inside a quoted string to avoid matching semicolons within comment content as statement terminators.load_wp_comments(sql_path) — Reads the SQL file, extracts values blocks, parses all rows, then filters to only comment_type='comment' and comment_approved='1' (excluding spam, trash, order notes, reviews, pingbacks). Decodes HTML entities in content and author names. Returns a list of comment dicts.import_comments_to_db(conn, comments, owner_email) — Maps each comment to the unified message schema:message_id = wp_comment_{comment_id}thread_id = wp_post_{post_id}sender = "Author <email>" formatin_reply_to = wp_comment_{parent_id} (for threaded replies)category = wp_comment, edition = free, is_related = 1INSERT OR IGNORE for idempotent re-runsScrapes the plugin's support forum on wordpress.org using requests + BeautifulSoup to parse the bbPress HTML.
create_session() — Creates a requests.Session with a custom User-Agent and Accept headers.scrape_topic_urls(session) — Iterates through all pages of the topic listing at https://wordpress.org/support/plugin/wp-file-upload/. Uses CSS selectors to find topic links: first tries ul.bbp-topics li a.bbp-topic-permalink, then falls back to td.column-primary a, then a broad link scan. Detects next-page links to paginate. Returns a list of dicts with url and title.scrape_topic(url, session) — Fetches a single topic page and all its reply pages. Extracts the topic title from <h1>, then calls _parse_page_posts() for each page. Follows pagination links within the topic. Returns a dict with url, title, and posts list._parse_page_posts(soup) — Extracts posts from a single page of a topic:ul.bbp-lead-topic. Extracts author from div.bbp-topic-author, content from div.bbp-topic-content, date from p.bbp-topic-post-date.div[id^="post-"] elements. Extracts author from div.bbp-reply-author, content from div.bbp-reply-content, date from p.bbp-reply-post-date._parse_post_author(post_el) — Finds <a> links with /support/users/ in the href. Extracts the username from the URL slug and the display name from the link text. Returns (display_name, username)._parse_post_date(post_el) — Tries three strategies: <time datetime="..."> attribute, then timestamp link title attributes with format "January 1, 2024 at 10:27 am", then a regex scan of the text content._parse_post_content(post_el) — Looks for div.bbp-reply-content or div.bbp-topic-content, falls back to collecting all <p> tags.import_topics_to_db(conn, topics, owner_email) — Maps each post to the unified message schema:message_id = wporg_{post_id}thread_id = wporg_topic_{slug}sender = "Author (@username)"in_reply_to = previous post's message_id (preserves conversation order)category = wp_org_topic, edition = free, is_related = 1get_scraped_topic_urls(conn) — Returns the set of topic URLs already imported (stored in json_path), enabling resumable scraping.Classifies emails in batches of 200 using Gemini 2.5 Flash on Vertex AI. Tags each email with edition (free/pro/both/unknown) and topic_tag.
run_classification(conn, sample) — Main loop: queries unclassified emails, batches them into groups of 200, sends each batch to Gemini with a structured prompt, parses the JSON response, and stores results in message_classifications. Resumable via LEFT JOIN to skip already-classified messages.classify_batch(client, emails) — Builds a prompt with 200 email snippets (subject + first 200 chars of body), calls Gemini 2.5 Flash (temp=0.1), returns parsed JSON array of classifications.google.genai SDK with us-central1 location and Vertex AI backend.Processes one support thread at a time through Gemini 2.5 Flash to extract structured Q&A pairs, summaries, and metadata.
run_pass2(conn, milestone, dry_run) — Main pipeline. Queries pending threads (not yet in thread_analyses), builds 5 few-shot examples, processes each thread sequentially with cost tracking. Stops after milestone threads for user review. Returns stats dict with counts, tokens, and cost.get_pending_threads(conn, limit) — Queries threads ordered by message count ascending (smallest first) that have no entry in thread_analyses.build_fewshot_messages() — Constructs 5 conversation-turn examples from known thread IDs for few-shot prompting.analyze_thread(client, fewshot_messages, thread_text) — Sends one thread to Gemini with the few-shot context, parses the JSON response into topic, summary, Q&A, and metadata fields.get_thread_full(conn, thread_id) — Concatenates all messages in a thread chronologically, truncating individual emails longer than 3,000 chars.1. Gmail fetch --> raw JSON files + SQLite (category='email', edition='unknown')
2. WP comments import --> SQLite only (category='wp_comment', edition='free')
3. WP.org forum scrape --> SQLite only (category='wp_org_topic', edition='free')
4. Filtering pipeline --> marks is_related=1 on relevant messages
(labels + recipients + keywords + thread propagation)
5. Pass 1: classify --> edition + topic tag per email (message_classifications table)
6. Pass 2: pass2 --> Q&A extraction per thread (thread_analyses table)
The filtering pipeline uses three whitelist criteria (related labels, related recipients, keyword matches) combined with thread propagation — if any message in a thread is relevant, all messages in that thread are included. WordPress comments and forum posts are automatically marked as related.
| Metric | Count |
|---|---|
| Total threads analyzed | 24,205 |
| Q&A pairs extracted | 3,918 |
| By Gemini 2.5 Flash | 9,915 |
| By Claude Opus (manual) | 262 |
| Rule-based skips | 14,028 |
| Email threads | 23,303 |
| WP comment threads | 86 |
| WP.org topics | 816 |