Wednesday, May 22, 2013

Why is it important to follow "open standards"

An open standard is a standard that is publicly available and has various rights to use associated with it, and may also have various properties of how it was designed (e.g. open process). There is no single definition and interpretations vary with usage. --- Wikipedia

Why do I promote "open standards" in my team?

1. To avoid reinventing the wheel.

If a public solution works well enough that many people have already adopted to it, there's no reason to rebuild it.  For example, major programming languages have already implemented merge sort, or quick sort.  I would not want my developers to rewrite those sort algorithms other than during interviews.

2.  Free test coverages.
Open standards are usually used by many individuals and organization.  If I write a library function that closes file handles "silently", I have to write a lot of unit tests.  The test team will have to write a lot of functional tests.  We all have to run tests in multiple environments (linux, mac, windows, etc).  We also need to create many stress testcases to cover failures.  Apache common IO util offers this function, which has been tested and used by many developers.  I don't need to test it as much.

3. Open standards usually work well with each other.
Unlike "proprietary standard", "open standard" is meant to be shared.  Therefore, many open standards work very similarly.  Recently, one of my devs ported a traditional web services application over to a smart compute grid infrastructure.  He made less than 500 lines of code change , only because both of the original web services container and the compute grid share the same "open standard".

4. Open standards expand one's horizon.
A lot of developers who have been working with certain proprietary technologies are fairly narrow-minded.  I have seen enough .Net developers who believe that every data storage related problem should be solved by SQLServer.  And many C#-only developers do not know IoC (inverse of control).  SQL Server is a solid relational DB, and C# is a useful language.  However, there are plenty of alternative open technologies are equally good, or even better in certain cases.

5. Knowing open standards make you more marketable.
For all the reasons above, and from my previous post, everyone should learn open standards to make them more marketable.

Sunday, May 5, 2013

Test in Production!

Anyone has a problem with the picture above?  I certainly don't!  However, I'll add this, "when I test on production, I test it carefully".

In software service development, testing is so critical.  Services often go through multiple tiers of testing environments before the bits are finally released to the customer.  But why can't we do a one-stop testing directly on production?

The typical answer to this question is "NO, YOU CANNOT IMPACT CUSTOMERS!".   This is because the traditional practice of "testing on production" is to take a slice of the production traffic, and to put a pre-release version service behind the VIP along with all other current version services.  This way, only a small portion of customers are impacted.  But some customers ARE impacted!

This is where "port forwarding" comes in handy.  To ensure that NO customers are impacted at all, we can "clone" a slice of the production traffic, and send them over to a pre-release version along with all current version services.  The cloned traffic is only one-way: it goes into the pre-release version service, but never returns back to the customer.  This way, NO customers are impacted by the behavior of "un-tested" pre-release service.

What do we get out of this?

  • Free stress test!  Instead of trying to setup stress testing environment that simulates production volumes and requests patterns, you will have a service running in production, getting production requests and throughput.
  • Free regression test! If you compare the responses from both current version service and pre-release version service, you get yourself a simple regression test suite.
  • Frequent regression test!  Hook this into your favorite CI framework, you get to run regression tests 24/7, given you have production traffic 24/7.
There are multiple ways to do "port forwarding", advanced load balancers have built-in features to clone a subset of requests.  If you are running linux, you can just configure IPtables to forward specific ports.  Since I'm not a sysadmin, I prefer software solutions that I can modify and manage. 

Node.js has a nice plugin node-proxy.  You can run a proxy service that bridge traffic to a "target" service, and a "forwarding" service.  The target service is the one that handles real traffic, its responses are sent back to the customer through proxy.  The forwarding service is only one-way.  It gets the same requests as the target service, but never returns anything back to the customer.  With this setup, you can TEST on PRODUCTION!

Sunday, December 23, 2012

Create FilterChain in node.js

I often try to learn something interesting (mostly programming related) whenever I get a long break from work.  Last Thanksgiving, I wrote an iPhone app that syncs photo among S3, Flickr and Facebook.  This Xmas, I took on writing my first node.js app.

I used as a starting point.  5 minutes into the tutorial, I encountered a strange problem.  For every request sent from chrome browser to my node.js server, my service recorded TWO requests.  A quick google search indicated that chrome ALWAYS sends an additional "/favicon.ico" request to an HTTP server if it cannot locate an icon for that server.

This was really annoying because it messed my global debugging counter.  The solution was simple: just ignore all requests in the format of "/favicon.ico".  But "SIMPLE" solutions are no fun, especially during learning process.  If I were to do this in java, I'd use ServletFilter to "preFilter" out all unqualified requests.  So I put my javascript and java skills to the test, and wrote this simple FilterChain function in node.js.  Enjoy!

var http = require("http");

* Manage all filters.
var filterChain = {
  filters: new Array(),
  add: function(filter) {
  applyAll: function(request, response) {
    this.apply(request, response, 0);
  apply: function(request, response, i) {
    if (i == this.filters.length) {
      return processRequest(request, response);
    var filter = this.filters[i];

    // call preFilter and exits if fails
    console.log( + ".preFilter");
    success = filter.preFilter(request, response);  
    if (!success) {
      return false;
    // call next filter and exits if fails
  success = this.apply(request, response, i+1);
  if (!success) {
      return false;
    // call postFilter and exits
    console.log( + ".postFilter");
    success = filter.postFilter(request, response);
    return success;

* Filters
* All filters must implement 3 things:
* name - String unique name for this filter
* preFilter() - Executed before processing request
* postFilter() - Executed after processing request
var faviconFilter = {
  name: "favicon",
  preFilter: function(request, response) {
  if (request.url === '/favicon.ico') {
    response.writeHead(200, {'Content-Type': 'image/x-icon'} );
    console.log('favicon request, filtered out!');
    return false;
  } else {
    return true;
postFilter: function(request, response){return true;}
var latencyFilter = {
  name: "latency",
  timer: null,
  preFilter: function(request, response) {
    this.timer = process.hrtime();
    return true;
  postFilter: function(request, response) {
    diff = process.hrtime(this.timer);
    console.log("<%s>%ds%dns", request.url, diff[0], diff[1]);
    return true;


* This is the actual function that processes the request
function processRequest(request, response) {
  response.writeHead(200, {"Content-Type": "text/plain"});
  response.write("Hello World");
  console.log("Response send.");
  return true;

function onRequest(request, response) {
  filterChain.applyAll(request, response);


console.log("Server has started.");

Thursday, July 26, 2012

How Marketable Are You?

Programming Job Market Comparison Based on Data

Database Job Market Comparison Based on Data

Tuesday, June 5, 2012

MVC now and then.

The history of MVC can be traced back to the early 80's. It was a key component of Smalltalk.

In the past 10 years, MVC has became a standard way to write web applications. Take Java for example: the usage of Struts, a popular Apache project for writing web applications, reached its peak in 2005. I still remember that everyone with struts experience on their resume would easily get interviews around that time.

Fast forward to year 2007/2008,  RoR became a mainstream MVC framework, partially because it's made available on Mac.

With the latest hype of HTML5 and JavaScript, a new breed of browser level MVC frameworks have emerged.  The general concept is to treat client browsers as full "applications" instead of just frontend UIs or views.   AJAX/JSON is the new model, HTML and CSS are the view, and JavaScript is the controller.  There are several popular JS level MVC frameworks, SproutCore, Backbone.js, and etc.  I haven't had a chance to play with them.  But in all the frontend projects that I've done in the past 12 months, I definitely tried to push model and controller all the way to the browser level.

Another revolution is that with transitioning MVC to the browser, the backend data layer is also evolved into its own MVC.  Backend data are still models, JSON data is the new view, and Java/php/C# or other backend logic is the controller.


There has been a cross organization initiative of defining and committing to TP99 based SLAs.  Looking back at the post I did last Sept, I really wanted my team to understand our SLAs, and to communicate with clients using proper SLAs and monitoring tools.

Before this initiative, most of the teams track their performance (latency) using average processing time.  The problem is that if performance has large variance, poor latency is hidden by mean or median.  Max latency is also not very relevant.  Imagining a Java based service that does GC once every 10 minutes and a full GC once every few hours, max latency only reflects the worst latency during GC.

That's why "top percentile" or TP based latency makes more sense.  When you have 100 requests to your service, you can sort all the request time in ascending order.  The 99th reuqest in the list is your TP99.  

To design TP99 SLAs, you need to keep few things in mind:
  1. Define a time span -- You have to get latency data for every single request, sort them and find out the top 99th percentile data point.  So you want to have a reasonable amount of data collected for the time span.  If you have a low volume service which gets less than 100 request per 10 minutes, you do not want to define a 10 minutes based SLA.  If you do, you'll hit all worst cases.  If you have a very large volume service, you don't want define a daily TP99, because you'll end up hiding the real problem.
  2. Watch out of extract code of logging information -- In order to calculate TP99, you have to log every single request.  If the logging system is not designed properly, you actually might degrade the overall system performance by logging too much.  My recommendation is to truly separate out core business logic and the operation/system level logic.  So the application doesn't have to worry about logging or calculating.  I've seen existing solution of logging into SQL Server.  Perf data have very little or none relationships.  So writing to SQL server just doesn't make any sense.  I recommend to simply write to a local file or  local service, and do offline/off hour data aggregation.

Tuesday, September 13, 2011

Update on the New Job

Time flies. All the sudden, I'm well into the fourth week of my new job. Here is a quick update.

Changes are difficult, job change is no exception. I have switched job many times in the past, and I found coming in as a dev manager is especially difficult. Here is a list I made for myself before I took the job.
  • Technology
  1. Code base
  2. Infrastructure (system and hardware)
  3. Production SLA and monitoring

  • Process
  1. Development life cycle
  2. Deployment process
  3. Troubleshooting process
  4. Support and escalation
  5. Any relevant company policies

  • Domain knowledge
  1. High level business logic for all key components
  2. Product/service in relation to revenue

  • My team
  1. Skill set
  2. Career goals
  3. Interests

  • Management
  1. Who has influence over my team, directly or indirectly

  • Peer teams
  1. Whom my team need to work with
  2. History between my team and peer teams

Checking my progress again the list, I still have a long way to go.

Tuesday, August 23, 2011

New Beginning

Startup is exciting! Startup is hard! Startup is crazy!

After spending the past 4 years at Livemocha, I'm finally taking a break from the startup world. Yesterday, I started my new job at Expedia.
It's a new company, a new team, new technology stacks, and new business domains. It's going to be exciting, it's going to be hard, and it's also going to be crazy!

Looking forward to the new beginning!

Monday, August 1, 2011

More on Amazon Cloud

I recently started working more closer with Amazon Cloud. I got the opportunity to play Elastic Load Balancer and Relational Database Service. Here are some thoughts:

  • ELB just works. It's very similar to the Amazon internal LB tool that I have used in the past, at least from the self-managing point of view.
  • I need to figure out how to add CNAME to ELB.
  • You can only each EC2 instance to one ELB, not multiple ELB. This has some limitation. For example, if I want to congiure virtual hosting in apache to use the same EC2 to serve two different web instance, I simply can't do it.
  • You can only do direct URL mapping ( => internal.server1/eft). You can't rewrite URLs on the LB level. Mod_rewrite is your friend. :P

  • RDS is very easy to setup. However, to link EC2 to RDS is not as straightforward as it should be. You have to remember the name of the security group that you want to grant access to. Why not just use a dropdown list?
  • The setup process only supports one root admin account. I'm sure that once you create you database instance, you can create more users. But it's an extra step. You also can't create access rule with different users.

Wednesday, July 28, 2010

To "cloud" or not to "cloud"

Lately I have interviewed many candidates for our engineering positions. A common question from the interviewees is always "why don't you host your service on the cloud"?

This is actually a question that we often ask ourselves. We love the idea of hosting all of the services on the cloud, so we don't have to manage hardwares. But why haven't we done so?
  1. We started Livemocha in early 2007. Cloud computing wasn't mature enough. The only cloud service out there was Amazon S3. We simply could not setup an entire DC on the cloud.
  2. Better control of hardware specs. Most of the cloud computing service use VM. We still can't have full control of the hardware spec, number of CPUs, side of hard disk, speed of hard disk, memory size and etc.
  3. NFS solution. S3 is the most mature cloud file storage system. Up till today, it still can't replace the good old simple NFS.
    1. S3 can be mounted to multiple EC2 instances, but it's slow. You can't stream data to S3 drives.
    2. There no good solution to backup S3 data. With tradition NFS, we can both hardware or software solutions to back an entire disk at real time.
  4. No LB support. Amazon just started offering LB last year. But its LB configuration is very simple. There are nothing much you can do besides simply round robin load balancing. We use F5 LB, which can be configured to do hardware based https acceleration, reverse proxy, and dynamic caching.
Here is a list of things that we do use on the cloud
  1. EC2 computing on demand. If we want to generate tons of PDFs or video, we request new instances of EC2 and schedule jobs there.
  2. S3 as secondary storage. We keep a copy of all user data on our NFS, then transfer duplicates to S3.
  3. CloudFront. CloudFront is awesome. It's cheap, and it's faster.
  4. SQS. We have more than 1000 queues running in SQS. They are persistent, and guaranteed delivery.