Wordpress File Upload AI Code

Brought to you by: nickboss

Tree [2a347f] master /

History

HTTPS access

File	Date	Author	Commit
deploy	2026-03-01	Nikolaos Bosinas	[f50d8e] Add Web Push notifications, self-hosted PWA, an...
pwa	2026-03-03	Nikolaos Bosinas	[2a347f] Add PWA icons generated from WFU AI logo
results	2026-02-28	Nikolaos Bosinas	[e3fbd3] Add Pass 2 Q&A extraction pipeline with Gemini ...
scripts	2026-02-28	Nikolaos Bosinas	[c772ff] Add Phase 4 foundation: email monitor, Firestor...
src	2026-03-02	Nikolaos Bosinas	[7b81ee] Add Irrelevant button with exclusion rules for ...
.env.example	2026-02-25	Nikolaos Bosinas	[98bdd0] Add Gmail email extraction tool for WFU plugin ...
.gitignore	2026-02-28	Nikolaos Bosinas	[212a1e] Add scripts/_archive/ to .gitignore
README.md	2026-02-28	Nikolaos Bosinas	[bebfea] Update README with Pass 1/Pass 2 classification...
SETUP_GOOGLE_CLOUD.md	2026-02-25	Nikolaos Bosinas	[98bdd0] Add Gmail email extraction tool for WFU plugin ...
keys_select.txt	2026-02-26	Nikolaos Bosinas	[fa8a45] Add email filtering pipeline with whitelist lab...
labels_checklist.txt	2026-02-26	Nikolaos Bosinas	[fa8a45] Add email filtering pipeline with whitelist lab...
labels_select.txt	2026-02-26	Nikolaos Bosinas	[fa8a45] Add email filtering pipeline with whitelist lab...
pyproject.toml	2026-02-28	Nikolaos Bosinas	[c6ec84] Add context assembly pipeline with vector simil...
recipients_checklist.txt	2026-02-26	Nikolaos Bosinas	[fa8a45] Add email filtering pipeline with whitelist lab...
recipients_select.txt	2026-02-26	Nikolaos Bosinas	[fa8a45] Add email filtering pipeline with whitelist lab...

Read Me

WFU-Gmail

A Python CLI tool that extracts, consolidates, and classifies support data from three sources into a unified SQLite database for the WordPress File Upload plugin:

Gmail — Fetches all emails via the Gmail API (OAuth2, batch requests, resumable)
WordPress comments — Parses a wp_comments MySQL dump from iptanus.com
WordPress.org forum — Scrapes the plugin's support forum at wordpress.org

All data is stored in a single messages table with full-text search (FTS5), allowing unified querying across all sources. A two-pass classification pipeline using Gemini 2.5 Flash on Vertex AI then tags each message (Pass 1: edition + topic) and extracts Q&A pairs from support threads (Pass 2), producing a knowledge base of 3,918 Q&A pairs from 24,205 analyzed threads.

Project Structure

git/wfu-ai/
├── pyproject.toml                  # Package config, dependencies, CLI entry point
├── .env.example                    # Environment variable template
├── .gitignore
├── SETUP_GOOGLE_CLOUD.md           # Google Cloud OAuth setup instructions
├── src/
│   └── wfu_gmail/
│       ├── __init__.py
│       ├── cli.py                  # CLI with 18 subcommands
│       ├── config.py               # .env loading, Config dataclass
│       ├── auth.py                 # Google OAuth2 flow + token management
│       ├── fetcher.py              # Gmail API batch fetching with retry
│       ├── parser.py               # MIME tree walker, base64 decoding
│       ├── models.py               # EmailMessage / EmailAttachment dataclasses
│       ├── storage_sqlite.py       # SQLite schema, FTS5, migrations, CRUD
│       ├── storage_json.py         # Raw JSON file storage (data/raw/YYYY/MM/)
│       ├── rate_limiter.py         # Per-second API throttle
│       ├── wp_comments.py          # MySQL dump parser + DB importer
│       ├── wp_org_scraper.py       # wordpress.org forum scraper
│       ├── classifier.py           # Pass 1: Gemini batch classification (edition + topic)
│       └── pass2_analyzer.py       # Pass 2: Gemini Q&A extraction per thread
├── scripts/
│   ├── parse_wp_comments.py        # Standalone script for WP comment stats
│   ├── test_gemini_qa.py           # Gemini Q&A prompt testing (v1/v2/v3)
│   └── pass2_qa_report.py          # Q&A extraction report generator
├── data/                           # Created at runtime, gitignored
│   ├── raw/                        # JSON: data/raw/2024/01/msg_{id}.json
│   ├── wfu_emails.db               # SQLite database
│   └── wp_comments.sql             # MySQL dump (user-provided)
└── credentials/                    # Gitignored
    ├── client_secret.json          # From Google Cloud Console
    └── token.json                  # Generated by OAuth flow

Installation

Requires Python 3.12+.

cd git/wfu-ai
pip install -e .

Dependencies: google-api-python-client, google-auth-oauthlib, google-auth, python-dotenv, beautifulsoup4, requests, google-genai.

CLI Commands

Run via wfu-gmail <command> or python -m wfu_gmail.cli <command>.

Core Gmail Commands

Command	Description
`auth`	Run OAuth2 flow, verify connection, print email address and mailbox size
`fetch [-q QUERY]`	Fetch all emails (or filtered by Gmail search query). Resumable — skips already-fetched messages
`stats`	Print total messages, threads, attachments, date range, top 10 senders
`labels`	Fetch Gmail labels from API, save to DB, show with local message counts
`backfill-threads`	Re-parse raw JSON files to backfill In-Reply-To and References headers

Email Filtering Commands

These commands work together in a checklist-based workflow. Generate a text file, edit it to mark items with [*], then apply it.

Command	Description
`filter-recipients`	Generate `recipients_checklist.txt` — mark unrelated recipient addresses
`apply-recipients`	Import checked addresses from the checklist as unrelated
`filter-labels`	Generate `labels_checklist.txt` — mark unrelated Gmail labels
`apply-labels`	Import checked labels from the checklist as unrelated
`select-recipients`	Generate `recipients_select.txt` — mark related recipient addresses (whitelist)
`apply-selected-recipients`	Import checked addresses as related
`select-labels`	Generate `labels_select.txt` — mark related Gmail labels (whitelist)
`apply-selected-labels`	Import checked labels as related, compute included message counts with thread propagation
`apply-keys`	Read keywords from `keys_select.txt`, search all messages, store keyword match counts
`count-related`	Combine all filters (labels + recipients + keywords), apply thread propagation, mark `is_related=1` in DB

Data Import Commands

Command	Description
`import-comments`	Parse `data/wp_comments.sql` MySQL dump and import approved comments
`import-wp-org`	Scrape all support topics from wordpress.org plugin forum and import

Classification Commands

Command	Description
`classify [--sample N]`	Pass 1: Batch-classify emails with Gemini 2.5 Flash (edition + topic tag). Processes 200 emails per API call.
`pass2 [--milestone N]`	Pass 2: Deep Q&A extraction per thread with Gemini 2.5 Flash. Processes up to N threads (default 1000) then stops for review. Resumable.
`pass2 --dry-run`	Show pending thread counts and cost estimate without making API calls

Database Schema

All data sources are stored in one unified messages table:

Column	Type	Description
`message_id`	TEXT PK	Unique ID (`msg_id` for email, `wp_comment_{id}`, `wporg_{id}`)
`thread_id`	TEXT	Groups related messages (`thread_id` for email, `wp_post_{id}`, `wporg_topic_{slug}`)
`subject`	TEXT	Email subject or forum topic title
`sender`	TEXT	From header or comment author
`to_recipients`	TEXT	To header or owner email
`date_utc`	TEXT	ISO timestamp
`body_plain`	TEXT	Plain text content
`body_html`	TEXT	HTML content (emails only)
`label_ids`	TEXT	JSON array of Gmail labels or `["wp_comment"]`/`["wp_org_topic"]`
`category`	TEXT	`email`, `wp_comment`, or `wp_org_topic`
`edition`	TEXT	`free`, `pro`, `general`, or `unknown`
`is_related`	INTEGER	1 if message is relevant to plugin support, 0 otherwise
`json_path`	TEXT	Path to raw JSON file (emails) or source URL (wp.org)
`in_reply_to`	TEXT	Parent message ID for threading
`fetched_at`	TEXT	When the record was imported

Additional tables: attachments (email attachment metadata), labels (Gmail label mappings), related_recipients, related_labels, related_keywords, unrelated_recipients, unrelated_labels, sync_metadata, message_classifications (Pass 1 results), thread_analyses (Pass 2 results).

thread_analyses Table (Pass 2)

Column	Type	Description
`thread_id`	TEXT PK	References `messages.thread_id`
`thread_type`	TEXT	`support`, `billing`, `feature_request`, `other`, `skip`
`topic`	TEXT	`config`, `error`, `upload`, `hooks`, `styling`, `compatibility`, `cloud`, `feature`, `security`, `performance`, `other`, `skip`
`sub_topic`	TEXT	Specific sub-topic description
`edition`	TEXT	`free`, `pro`, `both`, `unknown`
`summary`	TEXT	Thread summary
`resolution`	TEXT	How the issue was resolved (if applicable)
`resolved`	INTEGER	1 if resolved, 0 if not
`qa_question`	TEXT	Extracted Q&A question (null if no Q&A)
`qa_answer`	TEXT	Extracted Q&A answer (null if no Q&A)
`environment`	TEXT	WordPress version, PHP version, etc.
`keywords`	TEXT	Comma-separated keywords
`model_used`	TEXT	`gemini-2.5-flash`, `claude-opus-4-6`, or `rule-based`
`classified_at`	TEXT	ISO timestamp

Full-text search is provided via an messages_fts FTS5 virtual table indexing subject, body_plain, and sender, kept in sync via triggers.

Module Reference

cli.py — Command-Line Interface

Entry point for all operations. Uses argparse with 18 subcommands. Each command is implemented as a _cmd_* function that loads config, initializes the database, and orchestrates the relevant modules.

Key functions:

main() — Parses arguments, sets up logging, dispatches to the appropriate command handler.
_cmd_fetch(config, query) — Lists all Gmail message IDs, filters out already-fetched ones, then batch-fetches the rest. Each message is saved as raw JSON and inserted into SQLite. Tracks sync metadata for resumability.
_cmd_count_related(config) — The main filtering pipeline. Loads all whitelist filters (labels, recipients, keywords), scans every message, applies thread propagation (if any message in a thread matches, the whole thread is included), then marks is_related=1 in the database.
_cmd_import_comments(config) — Reads data/wp_comments.sql, parses it with wp_comments.load_wp_comments(), inserts into the messages table.
_cmd_import_wp_org(config) — Scrapes the wordpress.org forum topic listing, then scrapes each individual topic page, importing all posts.
_cmd_classify(config, sample) — Pass 1 classification. Batches 200 emails per Gemini API call to tag edition and topic.
_cmd_pass2(config, milestone, dry_run) — Pass 2 Q&A extraction. Processes one thread per API call, stops at milestone for cost review.

config.py — Configuration

Loads settings from a .env file and provides them as a Config dataclass.

Config — Dataclass with fields for OAuth paths, data directory, SQLite path, Gmail API parameters (batch size, max results, query), and log level. __post_init__ resolves all paths relative to project_root and creates required directories.
load_config(project_root) — Reads .env via dotenv_values(), constructs and returns a Config instance with environment overrides.

auth.py — Gmail OAuth2 Authentication

Handles the Google OAuth2 desktop flow for Gmail API access using gmail.readonly scope.

get_gmail_service(config) — Returns an authenticated Gmail API Resource object. Loads saved credentials or triggers the OAuth flow.
_load_or_create_credentials(config) — Tries to load credentials/token.json. If the token exists but is expired, refreshes it. If refresh fails or no token exists, runs the interactive OAuth flow.
_run_oauth_flow(config) — Opens a browser window for Google OAuth consent. Requires credentials/client_secret.json from Google Cloud Console. Saves the resulting token for reuse.
_save_token(creds, token_path) — Writes credentials to a JSON file.

fetcher.py — Gmail API Batch Fetching

Fetches full message data from the Gmail API using batch HTTP requests for efficiency.

list_all_message_ids(service, query, max_results_per_page) — Paginates through messages.list to collect all message IDs matching an optional Gmail search query. Returns a list of dicts with id and threadId.
fetch_messages(service, message_ids, on_message, batch_size, max_retries) — Processes message IDs in batches of 50. For each batch, builds a BatchHttpRequest with messages.get(format='full') calls. Parses each response into an EmailMessage and passes it to the on_message callback. Logs progress as percentage. Returns a stats dict with fetched, errors, and total counts.
_fetch_batch_with_retry(service, message_ids, max_retries) — Executes a single batch request. On HTTP 429 (rate limit) errors, retries with exponential backoff (2, 4, 8 seconds). Returns a dict mapping message IDs to either their response data or an Exception.

parser.py — MIME Message Parser

Converts Gmail API format='full' responses into EmailMessage objects by walking the MIME tree.

parse_message(raw) — Top-level parser. Extracts headers, parses the date, then walks the MIME tree to collect text bodies and attachment metadata. Returns an EmailMessage.
_extract_headers(headers) — Converts Gmail's [{"name": ..., "value": ...}] header list to a dict (first occurrence wins for duplicate headers).
_parse_date(date_str) — Parses RFC 2822 email date headers into UTC datetime objects. Falls back to current time on parse failure.
_walk_parts(part, bodies, attachments) — Recursively traverses the MIME part tree. Parts with filenames become EmailAttachment entries. multipart/* parts are recursed into. text/plain and text/html leaf parts are base64-decoded with charset detection.
_decode_body(data, charset) — Decodes base64url-encoded body data. Tries the declared charset first, then falls back through UTF-8 and Latin-1.
_get_charset(mime_type_header) — Extracts the charset= parameter from a Content-Type header string.

models.py — Data Models

EmailMessage — Dataclass representing a parsed email with 20 fields: message_id, thread_id, history_id, label_ids, snippet, subject, sender, to, cc, bcc, date_header, date_utc, message_id_header, in_reply_to, references, body_plain, body_html, raw_size_estimate, attachments, and raw_payload.
EmailAttachment — Dataclass with filename, mime_type, size, and attachment_id. Only metadata is stored — attachment content is not downloaded.

storage_sqlite.py — SQLite Database

Manages the SQLite database including schema creation, migrations, message insertion, and filtering queries.

init_db(db_path) — Creates the database file and executes the schema SQL (WAL mode, tables, indexes, FTS5, triggers). Calls _migrate() to add any new columns.
_migrate(conn) — Checks existing columns via PRAGMA table_info and adds any missing ones (message_id_hdr, in_reply_to, references_hdr, is_related, category, edition). This allows seamless schema evolution without dropping data.
insert_message(conn, msg, json_path) — Inserts an EmailMessage into the messages table using INSERT OR IGNORE for idempotency. Also inserts any attachment metadata. Returns True if the insert happened, False if it was a duplicate.
get_fetched_message_ids(conn) — Returns the set of all message IDs in the database. Used by fetch to skip already-fetched messages.
mark_related_messages(conn, related_ids) — Resets all messages to is_related=0, then batch-updates the given IDs to is_related=1 in chunks of 500 to avoid SQLite's variable limit.
get_stats(conn) — Returns a summary dict: total messages, threads, attachments, date range, top 10 senders.
Filter management functions — save_unrelated_recipients, save_unrelated_labels, save_related_recipients, save_related_labels, save_related_keywords and their corresponding getters. Each uses a clear-and-reinsert pattern for idempotent updates.
update_sync_metadata(conn, key, value) — Upserts key-value pairs for tracking fetch state (last fetch start/end, total listed IDs).

storage_json.py — Raw JSON Storage

Saves the raw Gmail API response as JSON files organized by date for lossless archival.

save_message_json(msg, data_dir) — Writes msg.raw_payload to data/raw/YYYY/MM/msg_{id}.json. Skips if the file already exists (resumability). Returns the relative path for storage in the database's json_path column.

rate_limiter.py — API Rate Limiter

RateLimiter(calls_per_second) — Simple time-based throttle. Gmail allows 250 quota units/second; each messages.get costs 5 units, so batches of 50 (= 250 units) are limited to ~1 per second.
wait() — Blocks until enough time has passed since the last call.

wp_comments.py — WordPress Comment Importer

Parses a MySQL wp_comments table dump exported from phpMyAdmin and imports approved comments into the messages table.

parse_mysql_values(sql_text) — Character-by-character parser for MySQL INSERT ... VALUES (...) syntax. Handles MySQL-specific backslash escaping (\', \\, \n, \r, \t). Returns a list of tuples, one per row. Only accepts rows with exactly 15 fields (matching the wp_comments schema).
extract_values_blocks(data) — Extracts the VALUES ...; portion from each INSERT statement. Tracks whether the cursor is inside a quoted string to avoid matching semicolons within comment content as statement terminators.
load_wp_comments(sql_path) — Reads the SQL file, extracts values blocks, parses all rows, then filters to only comment_type='comment' and comment_approved='1' (excluding spam, trash, order notes, reviews, pingbacks). Decodes HTML entities in content and author names. Returns a list of comment dicts.
import_comments_to_db(conn, comments, owner_email) — Maps each comment to the unified message schema:
message_id = wp_comment_{comment_id}
thread_id = wp_post_{post_id}
sender = "Author <email>" format
in_reply_to = wp_comment_{parent_id} (for threaded replies)
category = wp_comment, edition = free, is_related = 1
Uses INSERT OR IGNORE for idempotent re-runs

wp_org_scraper.py — WordPress.org Forum Scraper

Scrapes the plugin's support forum on wordpress.org using requests + BeautifulSoup to parse the bbPress HTML.

create_session() — Creates a requests.Session with a custom User-Agent and Accept headers.
scrape_topic_urls(session) — Iterates through all pages of the topic listing at https://wordpress.org/support/plugin/wp-file-upload/. Uses CSS selectors to find topic links: first tries ul.bbp-topics li a.bbp-topic-permalink, then falls back to td.column-primary a, then a broad link scan. Detects next-page links to paginate. Returns a list of dicts with url and title.
scrape_topic(url, session) — Fetches a single topic page and all its reply pages. Extracts the topic title from <h1>, then calls _parse_page_posts() for each page. Follows pagination links within the topic. Returns a dict with url, title, and posts list.
_parse_page_posts(soup) — Extracts posts from a single page of a topic:
Lead topic: Found in ul.bbp-lead-topic. Extracts author from div.bbp-topic-author, content from div.bbp-topic-content, date from p.bbp-topic-post-date.
Replies: Found in div[id^="post-"] elements. Extracts author from div.bbp-reply-author, content from div.bbp-reply-content, date from p.bbp-reply-post-date.
_parse_post_author(post_el) — Finds <a> links with /support/users/ in the href. Extracts the username from the URL slug and the display name from the link text. Returns (display_name, username).
_parse_post_date(post_el) — Tries three strategies: <time datetime="..."> attribute, then timestamp link title attributes with format "January 1, 2024 at 10:27 am", then a regex scan of the text content.
_parse_post_content(post_el) — Looks for div.bbp-reply-content or div.bbp-topic-content, falls back to collecting all <p> tags.
import_topics_to_db(conn, topics, owner_email) — Maps each post to the unified message schema:
message_id = wporg_{post_id}
thread_id = wporg_topic_{slug}
sender = "Author (@username)"
in_reply_to = previous post's message_id (preserves conversation order)
category = wp_org_topic, edition = free, is_related = 1
get_scraped_topic_urls(conn) — Returns the set of topic URLs already imported (stored in json_path), enabling resumable scraping.

classifier.py — Pass 1 Batch Classification

Classifies emails in batches of 200 using Gemini 2.5 Flash on Vertex AI. Tags each email with edition (free/pro/both/unknown) and topic_tag.

run_classification(conn, sample) — Main loop: queries unclassified emails, batches them into groups of 200, sends each batch to Gemini with a structured prompt, parses the JSON response, and stores results in message_classifications. Resumable via LEFT JOIN to skip already-classified messages.
classify_batch(client, emails) — Builds a prompt with 200 email snippets (subject + first 200 chars of body), calls Gemini 2.5 Flash (temp=0.1), returns parsed JSON array of classifications.
Uses google.genai SDK with us-central1 location and Vertex AI backend.

pass2_analyzer.py — Pass 2 Q&A Extraction

Processes one support thread at a time through Gemini 2.5 Flash to extract structured Q&A pairs, summaries, and metadata.

run_pass2(conn, milestone, dry_run) — Main pipeline. Queries pending threads (not yet in thread_analyses), builds 5 few-shot examples, processes each thread sequentially with cost tracking. Stops after milestone threads for user review. Returns stats dict with counts, tokens, and cost.
get_pending_threads(conn, limit) — Queries threads ordered by message count ascending (smallest first) that have no entry in thread_analyses.
build_fewshot_messages() — Constructs 5 conversation-turn examples from known thread IDs for few-shot prompting.
analyze_thread(client, fewshot_messages, thread_text) — Sends one thread to Gemini with the few-shot context, parses the JSON response into topic, summary, Q&A, and metadata fields.
get_thread_full(conn, thread_id) — Concatenates all messages in a thread chronologically, truncating individual emails longer than 3,000 chars.
Retry logic: 3 retries with exponential backoff (2, 4, 8 seconds). JSON parse errors are logged but not stored, allowing re-processing on next run.
Cost: Gemini 2.5 Flash at $0.15/MTok input, $0.60/MTok output.

Data Pipeline

1. Gmail fetch          --> raw JSON files + SQLite (category='email', edition='unknown')
2. WP comments import   --> SQLite only (category='wp_comment', edition='free')
3. WP.org forum scrape  --> SQLite only (category='wp_org_topic', edition='free')
4. Filtering pipeline   --> marks is_related=1 on relevant messages
   (labels + recipients + keywords + thread propagation)
5. Pass 1: classify     --> edition + topic tag per email (message_classifications table)
6. Pass 2: pass2        --> Q&A extraction per thread (thread_analyses table)

The filtering pipeline uses three whitelist criteria (related labels, related recipients, keyword matches) combined with thread propagation — if any message in a thread is relevant, all messages in that thread are included. WordPress comments and forum posts are automatically marked as related.

Classification Results

Metric	Count
Total threads analyzed	24,205
Q&A pairs extracted	3,918
By Gemini 2.5 Flash	9,915
By Claude Opus (manual)	262
Rule-based skips	14,028
Email threads	23,303
WP comment threads	86
WP.org topics	816