Growing User Base No Longer Breaking the Platform

The Situation

the platform is a B2C platform offering interactive coding courses to self-taught developers. Over 18 months, their user base had grown from 8,000 to 65,000 monthly active users , impressive growth that their original infrastructure wasn't designed to handle.

Peak hours , typically 7-10pm in North American time zones , had become a reliability crisis. Intermittent crashes, 12+ second page loads, and a course video player that frequently failed to load were generating thousands of support tickets and a growing volume of negative reviews.

Their two-person engineering team was in perpetual firefighting mode, unable to make meaningful progress on the new features that would drive the next growth phase.

The Challenge

A preliminary investigation by Kovil AI in the first week revealed a pattern of interconnected issues rather than a single root cause:

A MongoDB aggregation query on the course progress collection , unindexed, running on every page load , was the primary source of the performance spikes. At 65K users, it was timing out under load.
All static assets (images, JavaScript bundles, course videos) were being served directly from Node.js , no CDN , adding significant latency for users outside the primary data center region
Client-side JavaScript bundles were 4.2MB unminified, with no code splitting. Every page loaded the entire application on first visit.
Session state was stored in MongoDB with no caching layer, causing database reads on every authenticated request
No error tracking was in place , crashes were discovered through user support tickets, not monitoring

Our Approach

We prioritized the interventions by impact-to-effort ratio and worked through them systematically over 45 days, reporting progress weekly. The internal engineering team was kept informed and involved in all decisions , we weren't making changes to their codebase without their review and sign-off.

Our guiding principle: fix infrastructure problems before touching application code. The biggest gains were almost always at the infrastructure layer.

The Solution

The 45-day engagement addressed the following:

Database indexing and query optimization: Added compound indexes on the three most-queried collections; rewrote the course progress aggregation query to use a pre-computed materialized view updated asynchronously. Database query time dropped 89%.
CDN deployment: Migrated all static assets and course videos to AWS CloudFront with edge caching. Reduced median asset load time from 2.8s to 0.4s for North American users.
Bundle optimization: Implemented Next.js dynamic imports and route-based code splitting. Initial bundle size dropped from 4.2MB to 680KB. Subsequent pages load incrementally.
Session caching: Introduced Redis for session state. Eliminated MongoDB reads on every authenticated request; session lookup time went from ~180ms to ~4ms.
Error tracking: Deployed Sentry with custom alerting thresholds. Within 48 hours of deployment, identified 7 undetected error sources, 3 of which were causing silent data corruption.
Load testing: Established a regular load testing cadence using k6, giving the team early warning before traffic spikes hit production.

Results

Within 45 days, P95 page load time dropped from 11.2 seconds to 5.0 seconds. By day 60, after all optimizations were in production, it was 2.4 seconds , a 55% improvement from baseline. The crash rate, which had been affecting 3.2% of all sessions, dropped to 0.08% , essentially zero.

The Lighthouse performance score improved from 51 to 92. the platform's conversion rate from trial to paid subscription improved by 18% in the 60 days following the engagement , not a direct attribution, but a number the CTO cited as directly correlated with the load time improvements. The internal engineering team was finally able to focus on product development again.