(This article originally appeared on High Scalability website a few months back: http://highscalability.com/blog/2012/7/16/cinchcast-architecture-producing-1500-hours-of-audio-every-d.html)
- Over 50 million page views a month
- 50,000 hours of audio content created
- 15,000,000 media streams
- 175,000,000 ad impressions
- Peak rate of 40,000 concurrent requests per second
- Many TB/day of data stored in MSSQL, Redis, and ElasticSearch clusters
- Around a 100 hardware nodes in production.
- Production website is run from the data center in Brooklyn. We like to control our own destiny instead of relegating data to the cloud.
- Amazon EC2 instances are used mostly for QA and Staging environments.
- About 50 web servers
- 15 MS SQL database servers
- 2 Redis NOSQL key value servers
- 2 NodeJS servers
- 2 servers for elastic search cluster
- .NET 4 C# : ASP.NET and MVC3
- Visual Studio 2010 Team Suite as an IDE
- StyleCop, Resharper for enforcing code standards
- Agile development methodology, with Scrum used for large features and Kanban taskboard for smaller tasks
- Jenkins + Nunit for testing and continuous integration
- Sauce On Demand – Selenium for automation testing
Software And Technologies Used
- Windows Server 2008 R2 x64: Operating System
- SQL Server 2005 running under Microsoft Windows Server 2008 Web Server
- Equalizer load balancers: for load balancing
- REDIS: used as the distributed caching layer and for message pub-sub queue
- NODEJS for real-time analytics and updating studio dashboard
- ElasticSearch : for show search
- Sawmill + custom parser scripts: for log analysis
- NewRelic for performance monitoring
- Chartbeat for impact of performance on KPI (conversions, page views)
- Gomez, WhatsupGold, Nagios for various alerting
- SQL Monitor: from Red Gate - for SQL Server monitoring
- “Be brief, be bright, be gone” : Respect another person’s time. Don’t come with problems, come with solutions.
- Don’t go chasing hot technologies of the day. Instead ‘mitigate your top problems’. We adopt new technologies but do so, when the business case requires it. Appetite for Production outages decreases significantly when you have millions of users.
- Achieve “essential”, then worry about “excellent”.
- Be a “how team” instead of a “no team”.
- Build security into the software development lifecycle. You need to train developers on how to write secure software and make it a business priority from the start.
- Separate cluster of web servers is used to serve requests for regular users and requests for ad users, differentiated by a cookie.
- We are moving towards a service-oriented architecture where key pieces of the system, such as search, authentication, caching, are RESTFUL services implemented in various languages. These services also provide a caching layer.
- REDIS NOSQL key-value store (redis.io) is used as a cache layer before database calls.
- Scaleout is used to maintain a session state across a garden of web servers. However, we are considering switching onto REDIS.
- Text search in SQL server database doesn’t work well. It was clogging up the CPU so we switched to ElasticSearch (a Lucene derivative).
- The built-in session module by Microsoft is prone to deadlocks, so we ended up replacing it with AngiesList session module, storing data to REDIS.
- Logging is key to detecting problems.
- Reinventing the wheel can be a good thing. For example, initially we used a vendor product for bundling JS/CSS together which started causing performance issues. We then rewrote bundling ourselves, and significantly improved performance of our site.
- Not all data is relational, so database isn’t always a good medium. A good analogy is “Imagine you have water flowing down the pipe. The pipe is wide at the top but gets narrow towards the bottom.” The top is the web servers (there are many of them), the bottom is the databases (there are few and they get clogged up).
- Not using metrics in your development process is like trying to land a plane in a storm with your altimeter not working. Throughout your development process, compute metrics such as site throughput, time to fix Blocker/Critical bugs, code coverage and use them to gauge your performance.