Often entrepreneurs ask me 'What technology should I build my startup on?' There is no right or wrong answer to this question. It's a decision every company makes for itself, depending on what it's trying to build and the skills of its cofounders. Nonetheless, there are a few rules that one should adhere to. We discuss them in this blog post.
What happens in your company when a production incident occurs? Usually in a typical startup, you will see engineers running around frantically trying to resolve the problem. However, as soon as the incident is resolved, they forget about it and go back to their usual business. A good incident response policy can help bring order into chaos. We provide a sample template in this blog post.
We discuss why software deadlines usually don't make sense.
We discuss a number of freely available online tools which can be used to analyze bottlenecks in your website.
In this article, we show that security is both important and achievable for smaller companies without breaking a bank.
Wednesday, November 28, 2012
Monday, November 26, 2012
- Can you draw a systems diagram for me?
- How will this work on 4 or more boxes? How will you load balance requests between them?
- What's the average latency for a request? What can you cache? (Again, if a person didn't think through this, then the systems isn't ready).
- How will you test this?
- What can fail? How can we build a system so that it degrades gracefully when failures happen?
- What are the security risks?
Sunday, November 25, 2012
- Over 50 million page views a month
- 50,000 hours of audio content created
- 15,000,000 media streams
- 175,000,000 ad impressions
- Peak rate of 40,000 concurrent requests per second
- Many TB/day of data stored in MSSQL, Redis, and ElasticSearch clusters
- Around a 100 hardware nodes in production.
- Production website is run from the data center in Brooklyn. We like to control our own destiny instead of relegating data to the cloud.
- Amazon EC2 instances are used mostly for QA and Staging environments.
- About 50 web servers
- 15 MS SQL database servers
- 2 Redis NOSQL key value servers
- 2 NodeJS servers
- 2 servers for elastic search cluster
- .NET 4 C# : ASP.NET and MVC3
- Visual Studio 2010 Team Suite as an IDE
- StyleCop, Resharper for enforcing code standards
- Agile development methodology, with Scrum used for large features and Kanban taskboard for smaller tasks
- Jenkins + Nunit for testing and continuous integration
- Sauce On Demand – Selenium for automation testing
Software And Technologies Used
- Windows Server 2008 R2 x64: Operating System
- SQL Server 2005 running under Microsoft Windows Server 2008 Web Server
- Equalizer load balancers: for load balancing
- REDIS: used as the distributed caching layer and for message pub-sub queue
- NODEJS for real-time analytics and updating studio dashboard
- ElasticSearch : for show search
- Sawmill + custom parser scripts: for log analysis
- NewRelic for performance monitoring
- Chartbeat for impact of performance on KPI (conversions, page views)
- Gomez, WhatsupGold, Nagios for various alerting
- SQL Monitor: from Red Gate - for SQL Server monitoring
- “Be brief, be bright, be gone” : Respect another person’s time. Don’t come with problems, come with solutions.
- Don’t go chasing hot technologies of the day. Instead ‘mitigate your top problems’. We adopt new technologies but do so, when the business case requires it. Appetite for Production outages decreases significantly when you have millions of users.
- Achieve “essential”, then worry about “excellent”.
- Be a “how team” instead of a “no team”.
- Build security into the software development lifecycle. You need to train developers on how to write secure software and make it a business priority from the start.
- Separate cluster of web servers is used to serve requests for regular users and requests for ad users, differentiated by a cookie.
- We are moving towards a service-oriented architecture where key pieces of the system, such as search, authentication, caching, are RESTFUL services implemented in various languages. These services also provide a caching layer.
- REDIS NOSQL key-value store (redis.io) is used as a cache layer before database calls.
- Scaleout is used to maintain a session state across a garden of web servers. However, we are considering switching onto REDIS.
- Text search in SQL server database doesn’t work well. It was clogging up the CPU so we switched to ElasticSearch (a Lucene derivative).
- The built-in session module by Microsoft is prone to deadlocks, so we ended up replacing it with AngiesList session module, storing data to REDIS.
- Logging is key to detecting problems.
- Reinventing the wheel can be a good thing. For example, initially we used a vendor product for bundling JS/CSS together which started causing performance issues. We then rewrote bundling ourselves, and significantly improved performance of our site.
- Not all data is relational, so database isn’t always a good medium. A good analogy is “Imagine you have water flowing down the pipe. The pipe is wide at the top but gets narrow towards the bottom.” The top is the web servers (there are many of them), the bottom is the databases (there are few and they get clogged up).
- Not using metrics in your development process is like trying to land a plane in a storm with your altimeter not working. Throughout your development process, compute metrics such as site throughput, time to fix Blocker/Critical bugs, code coverage and use them to gauge your performance.
The abstract is below. If you are in the San Fran, do stop by. Thanks to Vanessa Alvarez from Forrester for moderating:
Cinchcast is a cloud-based, enterprise solution for webcasts and conference calls of any size. On a monthly basis, Cinchcast powers 15 million audio streams and attracts over 36 million unique visitors. In this talk, we’ll discuss how Cinchcast development and production environments operate and the role of New Relic in scaling Cinchcast platform to meet event demands. Dr. Yampolskiy will explain how Cinchcast maintains agile release cycles, while monitoring for performance and security issues. He will give some concrete examples of how a drastic drop in page views was discovered through a monitoring tool, or how his team thwarted a DDOS attack through cloud provisioning.
Saturday, November 10, 2012
Friday, November 9, 2012
What happens in your company when a production incident occurs?
Usually in a typical startup, you will see engineers running around frantically trying to resolve the problem. However, as soon as the incident is resolved, they forget about it and go back to their usual business.
A good incident response policy can help bring order into chaos. There are a few best-practices that one should keep in mind when production outages occur:
- Having a procedure in place helps reduce the panic. Security incidents should be treated differently than production outages.
- In the report, explain a response timeline and how the problem was discovered.
- An incident report should be written the same day as an incident occurred. Otherwise, you risk forgetting what happened.
- It should have concrete follow-up actions, tracked as JIRA tickets. If you don't do this, then engineers will not follow up.
- Put up incident reports in a public location and compute metrics Are incidents happening less frequently this month than the previous? Is there any correlation between incidents? Are follow-up actions being addressed?
Attached is a sample incident response template that I've used.
Incident Analysis Report
Time of Incident:
Time of Recovery:
Date Issue first identified
Incident Report Prepared By:
I. Description of Incident:
II. AWS Statement
2:40 AM PDT We are investigating connectivity issues for EC2 in the US-EAST-1 region.
3:03 AM PDT Between 2:22 AM and 2:43 AM PDT internet connectivity was impaired in the US-EAST-1 region. Full connectivity has been restored. The service is operating normally.
III. Business Impact: Frustrated customers because the website ACME was unaccessible.
VI. Event Timeline:
VII. Lessons Learned:
- We need to know the business impact for each server on Amazon and put DR polices and procedures in place for outages. We could also leverage the California EC2 Cloud to potentially help outages in just Virginia.
VIII. Action Items:
1. Called EC2 and they are going to alert us of what they find out about the issue (INFRA-123)
2. Identify what we can and can’t do if EC2 goes down (INFRA-345)
Thursday, November 8, 2012
Icecast is a server program used to stream in MP3 or Ogg Vorbis formats, which is very popular in Internet radio community. Many CDNs including Limelight use it to stream live MP3 streams. I've been browsing the web for typical vulnerabilities afflicting Icecast. It looks like the trend is positive. According to CVEdetails  the last vulnerability in the database dates 2007 and the trend has been declining :
|5||2||3||2||1|| 2001 5|
|5||7||7||2||1||1|| Denial of Service 5|
Execute Code 7
Directory Traversal 2
Bypass Something 1
 Illustration from http://livestream123.com/wp-content/uploads/icecast.jpg
"Registered participants receive a unique PIN code to access the audio conferencing portion of corporate events hosted on the Cinchcast platform. Event participants do not have to wait on hold to be screened by operators prior to entering events. In addition, for users who may attend multiple corporate events (Employee Town Halls, Team Meetings, Earnings /Analyst Calls), once an individual has registered on the Cinchcast platform, their unique PIN code will always be the same." 
Now you no longer have to guess who is on the call because names of attendees are displayed in our studio. You will see in real-time the number of listeners on the web and callers on the phone.
Our player is HTML5 compliant and requires no browser plugins, works over regular HTTP port 80 so you don't need to poke holes in a firewall, and requires minimal bandwidth requirement (15x-20x less than a video stream). So it turned out to be a great product:
If you are interested to try it out, please drop us a line: http://cinchcast.com/contact/
There exist dev, qa, and staging branches. All development starts locally and then gets merged into the dev branch. After testing, QA team can merge it into qa branch. Finally, when the code is ready to be released it gets merged into the staging branch:
When we work on new releases, we follow one of two approaches:
1. Release branches. A separate branch is created for each release.
For example, FOO_3_1_2 branch would be created for all work done on release 3.1.2 of the FOO project.
2. Feature branches. A separate branch is created for each large component. Typically these components require isolated testing, and are merged into the main branch only at the end. The naming convention is AY_MODULE where AY is initials of a developer and MODULE is the name of the component.
All new branches are created off a staging branch, which should mimic the code that's running in production.
Any urgent hotfixes are typically made directly on a staging branch, and then backported into other branches.
Any load testing or security analysis is typically done during QA stage when the code has been merged into qa branch. We have a variety of scanners running 24x7 against our qa and production environments, such as Mcafee Secure scanning for dynamic security vulnerabilities and NewRelic continuously checking the performance of the application. If any issues are found, then the code is rolled back and cannot go into Production.
Note: We are always looking to hire great software engineers. So if you are one, and are looking for an exciting environment to work at, email us at firstname.lastname@example.org
Wednesday, November 7, 2012
I tried it out and within 30 minutes learned that :
- most of my emails have between 1-100 words (i do like to cut right to the point)
- i get a lot of emails (already knew that)
- i respond to 15% of my emails in under 5 minutes (now that's scary)
- and only 59% of emails are addressed directly to me
- number of emails i send spikes up after 6pm (logical with two little kids in the house)
Overall, GMailMeter seemed like a very useful tool and I recommend everyone else to try it.
Now I just need to figure out what to do with this statistics.
In the past month:
|660 were important|
47 have been starred
I have started 20.29% of them
and have replied to 6.12% of the others
2487 emails received
|received from 580 people|
59.07% were sent directly to me
739 emails sent
|to 138 people|
Saturday, November 3, 2012
2. 'Don't chase hot technologies of the day'.
Friday, November 2, 2012
Thursday, November 1, 2012
It's a touching documentary about an 85 year old sushi chef
Jiro Ono, and his quest for a perfect sushi. His hole-in-a-wall restaurant possesses the coveted 3-star Michelin rating because of his attention for detail, love for his work, and constant strive for perfection.
A great chef generally has the following five attributes.First, they take their work very seriously and consistently strive to perform at the highest level.Second, they aspire to continually improve their skills. To be better today than yesterday. To be better tomorrow than today.Third, cleanliness. If the restaurant doesn’t feel clean, the food isn’t going to taste good.The fourth attribute is impatience. They are not prone to collaboration. They’re stubborn and insist on having things their own way.What ties these attributes together is passion. That’s what makes a great chef.