The Entrepreneur Forum | Financial Freedom | Starting a Business | Motivation | Money | Success

Welcome to the only entrepreneur forum dedicated to building life-changing wealth.

Build a Fastlane business. Earn real financial freedom. Join free.

Join over 90,000 entrepreneurs who have rejected the paradigm of mediocrity and said "NO!" to underpaid jobs, ascetic frugality, and suffocating savings rituals— learn how to build a Fastlane business that pays both freedom and lifestyle affluence.

Free registration at the forum removes this block.

Lessons learned: Or how to wake up to burning servers and loads of traffic

stephanduq

Bronze Contributor
User Power
Value/Post Ratio
78%
Apr 7, 2013
157
122
31
Hey all,

this weekend my app project suddenly went up to the top100 section of the Dutch AppStore. The cause of it was an app review on a leading iPhone website. I was out, and asleep when it happened. And since the website never contacted me, I had no idea of the massive trouble heading my way. After a very stressful weekend of server patching, reading 1 star reviews and generally no sleep, it's time to reflect:

My lessons learned:

• The servers stopped working, but kept accepting connections. As a result there was no message to inform users that they should try another time, or that the service would be slow. Until I discovered the server was misbehaving, there already have been hundreds of downloads. Potential evangelists have been lost, which could have been avoided with a simple alert screen.

• The server stopped because I implemented a new database connection technology the night before. I had old versions of the server ready just in case. The lesson here, keep backups around and ready.

• Don't rush your patching. Its horrible to see your log come by with tons of errors, and user sessions failing. But a hot patch has the potential to do more damage. Imagine when your app is not working, a user uses it again after a few hours, only to see it crash in a new screen. I started with very quick patch updates, before I rerolled the server. It just made everything worse, chaotic, and out of control.

• Some of the problems in the app where also caused because I failed to properly check the last submitted version. I figured everything would be fine, as the last patch was a simple bug fix. But somehow that fix cascaded down to the onboarding process and messed it up. I now have a checklist for everything that needs to be checked, before I can submit a new version, and its hanging next to my desk as a grim reminder

• I immediatly started tracking social media, and replied to tweets and comments made on the internet. Disgruntled users can be very understanding as soon as they know the context of what went wrong.

• Always test market a digital/server product on a small market. I'm happy this happened in the Netherlands, where everything is manageable. If a similar event would have appeared in a larger market, I wouldn't not have been able to cope and turn things around.

Now the good news!

Plenty of people seem to be coming back to check if the app might work. And it does! I expect that I can still salvage some of the users, and give them a valuable experience :)

I know some statistics that I need to rank in the app store, so I can start marketing more effectively.

These where my lessons learned, I hope they will benefit some of you :)
 
Dislike ads? Remove them and support the forum: Subscribe to Fastlane Insiders.

hellolin

Bronze Contributor
Speedway Pass
User Power
Value/Post Ratio
117%
May 27, 2015
358
420
38
Sounds like you need some common sense, sound IT management. Maybe hire someone to do this or learn how to host your app on the cloud. Since you are not in the US you might not caught on the cloud game that fast yet, it is very rare here in the US now that someone buys their own servers and host their stuff, unless they are a company like a warehouse who only have one 35M connection to the internet. I think Amazon Web Services do have a region in the EU, check them out and you should be able to host your app on that. If you need even easier hosting solutions, check out Heroku, https://www.heroku.com, they will even allocate the servers for you, all you have to do is program on their platform, or maybe you can even write your app on a cloud editor like Cloud9, which will save you even more time which speeds up your time to market and patching. It will be so much more easier to patch your app if you host it on the cloud, since you can build virtual servers that has the exactly same environment like the live servers, you can test your patched app on the test environment first, then after it checks out you move them to the live servers. To make sure the environment stays the same in the testing and live servers, you can learn to use tools such as Chef or Puppet, they will make sure all the cloud servers you create are exactly the same bit by bit.
 

stephanduq

Bronze Contributor
User Power
Value/Post Ratio
78%
Apr 7, 2013
157
122
31
Sounds like you need some common sense, sound IT management. Maybe hire someone to do this or learn how to host your app on the cloud. Since you are not in the US you might not caught on the cloud game that fast yet, it is very rare here in the US now that someone buys their own servers and host their stuff, unless they are a company like a warehouse who only have one 35M connection to the internet. I think Amazon Web Services do have a region in the EU, check them out and you should be able to host your app on that. If you need even easier hosting solutions, check out Heroku, https://www.heroku.com, they will even allocate the servers for you, all you have to do is program on their platform, or maybe you can even write your app on a cloud editor like Cloud9, which will save you even more time which speeds up your time to market and patching. It will be so much more easier to patch your app if you host it on the cloud, since you can build virtual servers that has the exactly same environment like the live servers, you can test your patched app on the test environment first, then after it checks out you move them to the live servers. To make sure the environment stays the same in the testing and live servers, you can learn to use tools such as Chef or Puppet, they will make sure all the cloud servers you create are exactly the same bit by bit.

Thank you, I have enough IT management and common sense in store myself though;) But some of these things I should have caught. I think this is partly the problem with being a solo developer, at some point you are so focussed on your project, and believe in what you have build, that you start taking shortcuts. And you lack a pair of fresh eyes.

The main problem I encountered is that the server software I wrote is incredibly complex (Partly homegrown deep learning AI and image recognition for example). The problem is not the server capacity itself, its completely scaleable and well designed in AWS. Most of the problems came from the calculation algorithms that cause a memory and database connection leakage on a scale that is invisible in a development environment, but in a production environment with a traffic peak they suddenly show up. You can forget to close one connection of the 500+ connections, and it won't ever show up in development or with low traffic.

The only way I feel that I could have prevented this, would have been if I created bot clients that simulate a usage spike. At the time I made the decision not to, it seemed to costly in time. And if I compare the time I needed to create those bots, compared to one mad weekend. I think I have made the right decision :)
 

tafy

Gold Contributor
Speedway Pass
User Power
Value/Post Ratio
116%
Aug 21, 2013
1,647
1,912
UK
You can spin this to your advantage easily tho, write some interesting articles on what happened and you will be better off than if your server was perfect.
 
Dislike ads? Remove them and support the forum: Subscribe to Fastlane Insiders.
D

Deleted21961

Guest
I would love to read whitepaper or watch YT talk about this problem you encountered. Could you at least tell more about technology you use? (I mean, your stack)
 

stephanduq

Bronze Contributor
User Power
Value/Post Ratio
78%
Apr 7, 2013
157
122
31
What app are we talking about?

http://welldressed-app.com/ (just made a new landing page)

I would love to read whitepaper or watch YT talk about this problem you encountered. Could you at least tell more about technology you use? (I mean, your stack)

Sure! I have several servers hanging in AWS. An RDS with MySQL, and two standard linux systems. One server handles processing of datafeeds I pull from affiliate networks, and the other one deals with all the logic for the users. Both servers are equipped with OpenCV, and its quite basic beyond this. The user server is load-balanced and part of an autoscale group. Oh and the server is written in Java.

The server that deals with users essentially has one goal. Create at least 100.000 outfits every second for every server thread. Too maximise efficiency, I wrote most of the server from the ground up, to prevent bloated framework causing performance issues. The problem I encountered was a ridiculously simple one, that I wouldn't have had if I used more readily available frameworks. I used a database connection pool of my own design to make sure there is always a connection to the RDS available. But it was very sporadic, and at random times there would be no connections available, causing a thread to crash. So I switched to HikariCP, which is very stable, and it would give me some breathing space to rewrite my own CP. As you can imagine, there are a lot of database connections being made, and closed. Somewhere there was one method that requested a connection, that I forgot to close. Normally with less traffic it would have timed out (Why I never saw it in the development environment) but with the sudden spike, it started pulling all the connections from the pool.

On top of this, I was working on a neural network that analyses users and finds users with similar tastes, looks and preferences. It tries to select garments for outfits, that fit the tastes of this usergroup. It was supposed to be turned off for the production server, but it wasn't, and it was leaking. Causing the server to run out of memory. It leaked like 3b per call, impossible to find in development, but in production, given enough time, it would crash the server. This one was easy to find, because of the efficiency demands I have I spend a lot of time on preventing leaks. So it had to be something I just created.

In my experience so far, most critical problems in a live environment tend to be silly human mistakes that don't scale well. Extremely easy to fix, but very annoying to find when stressed out. The good thing is that you can prepare for them with the right debug logging, and good style.
 
Dislike ads? Remove them and support the forum: Subscribe to Fastlane Insiders.
D

Deleted21961

Guest
You are using at least one separate thread per user? What is your average and max user load this java app can handle? You could easily make it 10x without separate threads.
 

stephanduq

Bronze Contributor
User Power
Value/Post Ratio
78%
Apr 7, 2013
157
122
31
You are using at least one separate thread per user? What is your average and max user load this java app can handle? You could easily make it 10x without separate threads.

Yeah,this was my first server project. And this is one of the oldest bits in there that really need replacement. A thread pool is on my todo list for this month, but since the traffic always was quite low, I didn't prioritise it much.
 

stephanduq

Bronze Contributor
User Power
Value/Post Ratio
78%
Apr 7, 2013
157
122
31
Ok, but tell me - why Java?

Hahaha, I follow the lean methodology. So I want to push out the mvp and iterations as quickly as I can. I was familiar with Java, so it seemed the fastest way to get the MVP out there, and start validating :)

Lean is at the same time a reason why I have made a lot of decisions that I know would not work on the long term. But doing things proper before validation, could result in a lot of wasted time if the idea might turn out to be a bad one.
 
Last edited:

Weaponize

Workin on it!
Read Fastlane!
Speedway Pass
User Power
Value/Post Ratio
185%
Nov 15, 2014
266
491
The cause of it was an app review on a leading iPhone website.

First off, congrats!

Second, thanks for sharing the gory technical details :)

You can spin this to your advantage easily tho, write some interesting articles on what happened and you will be better off than if your server was perfect.

^^^^ this

Use this as PR
 

Post New Topic

Please SEARCH before posting.
Please select the BEST category.

Post new topic

Guest post submissions offered HERE.

New Topics

Fastlane Insiders

View the forum AD FREE.
Private, unindexed content
Detailed process/execution threads
Ideas needing execution, more!

Join Fastlane Insiders.

Top