TECHNOLOGY

How I Maintained 99.99% Uptime for MMORPG Databases

12 primary databases, 12 replicas, 100M+ rows per table, and gamers who notice 50ms of latency. Lessons from the trenches of game infrastructure.

By JK Jung, Staff Developer | Los Angeles Bureau | Thursday, April 16, 2026

How I Maintained 99.99% Uptime for MMORPG Databases

In 2004, I joined Softon Entertainment as a database engineer for a live MMORPG. The scale was immediately humbling: 12 primary MySQL databases, 12 replication nodes, tables with 100+ million rows, and data growth of 500GB per month. The SLA was 99.99% — roughly 52 minutes of allowed downtime per year.

MMORPG data models are uniquely challenging because they combine high-frequency writes with complex relational structures. A single player entity touches character attributes, inventory items, guild memberships, quest progress, mail messages, and auction house listings. We normalized aggressively for data integrity but denormalized read-heavy paths like character profiles into materialized views that refreshed every 30 seconds.

“12 primary databases, 12 replicas, 100M+ rows per table, and gamers who notice 50ms of latency. Lessons from the trenches of game infrastructure.”

The first thing I learned: replication lag kills games. When a player buys an item on the primary and the replica hasn't caught up, they see the gold deducted but no item in their inventory. Panic ensues. Support tickets flood in. We solved this with read-after-write consistency for critical operations — player state reads always hit the primary.

Connection management across 12 primary databases required a custom pooling layer. Each application server maintained persistent connections to every primary and its replica, with health-check queries running every 5 seconds. When a connection failed the health check, it was immediately recycled and a new one established. We capped connections per application server at 50 per database to prevent connection storms during traffic spikes — a hard lesson learned after a login event caused 3,000 simultaneous connection attempts.

Failover was our biggest operational risk. When I arrived, failing over a primary database took 10 minutes of manual work. I built an automated failover system that detected primary failures via heartbeat monitoring and promoted a replica within 60 seconds. The key was pre-warming the replica connections — the application layer maintained standby connections to replicas that could become primaries.

Backup strategy had to balance completeness against performance impact. Full backups of our largest tables took over 4 hours and caused measurable replication lag. We ran full backups during the weekly maintenance window and supplemented with continuous binary log archiving to S3. This gave us point-in-time recovery capability to any second within the past 30 days. We tested recovery monthly, restoring a full database to a staging environment to verify backup integrity — because an untested backup is not a backup.

Monitoring at this scale required prediction, not reaction. I built a Python-based system that tracked replication lag trends, query execution time distributions, and disk I/O patterns. When any metric's trend line projected a threshold breach within 4 hours, it alerted the on-call engineer. This caught 70% of potential incidents before they became user-visible.

Query optimization on tables exceeding 100 million rows demanded a fundamentally different approach than small-scale development. Full table scans were obviously impossible, but even indexed queries could degrade when the index itself exceeded available memory. We implemented a query review process where every new query hitting production was analyzed with EXPLAIN ANALYZE, and any query touching more than 10,000 rows required explicit approval. Composite indexes were designed around actual query patterns extracted from the slow query log, not theoretical access patterns.

The international expansion to Japan doubled our concurrent users overnight. We couldn't just scale vertically — MySQL has hard limits on connections and buffer pool size. Instead, we sharded the user base geographically: Japanese players on Japan-region databases, Korean players on Korea-region databases, with a global lookup service routing connections.

Cross-region data consistency introduced latency that our game design had to accommodate. A player trading items with someone on the Japan servers experienced a 150ms round-trip to the Korean primary — unacceptable for real-time gameplay. We solved this by making all cross-region interactions asynchronous: trades completed through a mailbox system, auction house listings replicated on a 5-second delay, and guild chat used a separate low-latency messaging layer that didn't depend on database replication.

The most important lesson from three years of database operations: the system that keeps you up at night isn't the one that fails loudly. It's the one that degrades silently — a replication lag that grows by 100ms per day, a table that fragments 0.1% per week, a query that gets 2% slower with each million rows added. Monitoring trends matters more than monitoring thresholds.

Capacity planning became a core discipline. I maintained growth models that projected storage, IOPS, and connection requirements 6 months ahead, based on player acquisition rates and gameplay data generation patterns. When projections showed we'd hit disk IOPS limits within 3 months, we'd begin provisioning and migrating before any user felt the impact. This proactive approach transformed database operations from a reactive firefighting role into a predictable engineering practice.

...

Tags: MySQL, Infrastructure, Gaming, High Availability

Key Facts

• Category: Dev
• Reading time: 18 min read
• Technology: MySQL
• Technology: Infrastructure
• Technology: Gaming

Related Dispatches

The Five-Word Quiz That Fills an Empty Deck on Day One AI Agents Are Replacing the Traditional Software Development Lifecycle Building a Multi-Tenant Marketplace from Scratch PostgreSQL vs Firestore: A Practical Decision Framework How GenAI Reduced Our Operational Overhead by 90%All articles →