600 concurrent users, one ₹0 server, and my first real taste of production chaos.
Let me set the scene: it is February, it is 8:47 PM, and the NIT Surathkal annual fest Incident 2024 registrations go live in thirteen minutes. I am in the computer science department lab, I am sweating, and the infrastructure for this whole thing is a free-tier Render deployment and a Supabase project I had set up three months ago when 'scale' was a concept that existed in textbooks rather than a thing that was about to happen to me personally.
I had rebuilt the fest website from scratch because the old one was a static HTML file from 2019 that looked like it had been designed by someone who had heard about the internet secondhand. I used Next.js and Supabase and Tailwind and everything felt very clean and modern in development. In production, with 600 people hitting register simultaneously, it felt like I had built a skyscraper out of good intentions.
The Supabase connection pool hit its limit at 8:54 PM. Users started seeing timeout errors. The free-tier Render server cold-started four times in the first hour. My beautiful optimistic UI updates were showing users success states for registrations that had not actually committed to the database. I found this out at 9:30 when a batch of confirmation emails did not go out.
What followed was three hours of the most focused debugging of my life. I sat in that lab with my laptop and two of my juniors who I had texted in mild panic, and we went through logs line by line. We found the connection pool issue first. Fixed it by upgrading Supabase tier, which cost actual money I paid from my pocket. Found the cold start issue second and implemented a cron job to ping the server every five minutes. Fixed the optimistic updates by adding a proper loading state and server-side validation before showing success.
By 11 PM the site was stable. By midnight, the roasting had begun on the fest WhatsApp group. Someone had screenshotted the error page. Someone else had captioned it with something I will not repeat here but that was genuinely quite funny in retrospect. The fest coordinator sent me a message asking if everything was okay that had the particular quality of a message that was trying very hard not to be angry.
Here is the thing about being publicly roasted for a technical failure: it is educational in a way that no course is. I knew about connection pools in theory. After that night I understood them in the particular way that you understand something you have felt go wrong at the worst possible time. The same goes for error boundaries, graceful degradation, database query optimisation, and why you should always test your application under simulated load before going live with something people are actually depending on.
I spent the following week writing a post-mortem, which I shared with the fest committee. They did not ask for it. I gave it to them anyway because I felt responsible for what had happened and I thought understanding it might be useful for whoever ran this next year. The post-mortem is now part of the fest committee's technical onboarding document. That felt better than getting the site right the first time would have, probably.
Join Sparrow — written by college students, for college students
Read unlimited articles, spark the ones you love, and share your own voice.
Written by
Arpit JoshiCS at NIT Surathkal. Got into Amazon off-campus. Rebuilt the fest website and survived the roasting. Production broke me and made me.
22 followers
Responses
Sign in to join the conversation
Sign in