|
Saturday, October 15, 2005
Scalability Story
Here at SourceGear, I wear several different hats. One of
my roles is that I'm "the guy who talks to the prospective Vault customers with
large teams". If you've got a team of 25 people, you can simply come to our
online store and buy the product. But if you want to use Vault for a team of
300 people, I want to have a conversation with you.
Don't call me a Sales Guy or any other
vulgar names. I don't make cold calls. I don't pester anyone. I just make
myself available to talk with larger customers who are trying to make a
decision about Vault. I like to understand the customer's situation, make sure
Vault is the right choice, and help ensure a smooth transition.
#ifdef
marketing_digression
Sometimes I say "no". Earlier this year, I had a
conversation with a prospective Vault customer who wanted to put several thousand
users on a single Vault server. In this case, saying "no" was the right
decision. That was a perfect example of a customer we don't want.
Right now you might be thinking how ridiculous that sounds. Don't
we want every customer we can get?
Actually, good marketing usually results in a clear
understanding of which customers you want and which customers you don't. It's
very difficult to be the best product for your target niche if you are not
genuinely prepared to say "no" to everybody who is outside that niche. If you cannot
easily identify the customers that you don't want, then you probably need to
give your strategy more thought.
Consider McDonald's. They know what they do, and they do it
very well.
Suppose a customer comes in to McDonald's and says, "I'd
like a burger and fries and a pint of Guinness and I want to be seated at a
table in the smoking section with a big-screen TV so I can watch the game."
Does the manager worry about turning away a customer? Perhaps he goes back to
his office and wonders if the entire McDonalds strategy is all wrong?
Certainly not. He simply tells the customer no: "That's
not what we do here."
And so it is with SourceGear as well. Our products are
designed to meet the needs of the software teams in the professional segment,
extending up somewhat into the small enterprise. Most of our sites have less
than 100 developers. Our larger sites have teams numbering in the hundreds,
not in the thousands. Customers with 5,000 developers tend to have a very
different class of needs. That's not what we do here.
#endif
Anyway, since I am often talking to our bigger customers, I
am often thinking about Vault's level of scalability.
Scalability
I define "scalability" as "the ability of a software system
to cope as the size of the problem increases".
Note that scalability and performance are related, but
different. Explaining these terms in the language of calculus, if performance
is a function, scalability is about the first derivative of that function. A
piece of software can be very scalable, even if its performance is poor for
small values of n. For example, I've heard people describe Exchange that way.
Apparently an Exchange server with 5 users is kind of slow, but an Exchange
server with 5,000 users is no slower. (I've never used Exchange so I can't
really say if that's true.)
When I say "the size of the problem", I am speaking very
generally. Continuing my notion of a mathematical curve, scalability can
involve different variables on the X axis. A product might scale very well for
large numbers of users but very poorly for large quantities of data. A
database might scale very well for large numbers of rows but very poorly when
the individual rows are large.
Finally, let's not confine scalability to things which can
be measured with a stopwatch. Sometimes scalability problems are a bit more
qualitative. For example, the Vault Admin tool presents the list of users in a
regular Windows listbox control. That's fine with 100 users, but it's not
exactly the right UI for a system with 5,000 users.
Scalability Issues for Version Control Tools
Thousands of teams are using Vault, and the vast majority of
them are very happy. However, we've had a few customers over the years who
have struggled with Vault, and scalability has been a common theme. It turns
out that for a version control system, scalability can bring some very
challenging problems.
Simple things become grotesque. For example, one of the
fundamental things a version control system must do is figure out if the user
has made any modifications to their local copies of the files. Sounds simple,
right? It mostly is. But if you implement a simple solution and test it on a
tree with 100 files, you will probably find later that it doesn't work quite so
well when somebody tries it on their tree with 25,000 files.
Back in the days of Vault 1.0, we were courting a certain
banner name customer who was interested in migrating from SourceSafe to Vault.
We spent a lot of time and effort, including trips to their location. In the
end, we lost the deal, largely because our product just didn't work very well
for a team their size. This team had 75 developers working on a tree with
around 70,000 files. Today, Vault 3.1 can usually handle problems like that
without breaking a sweat.
The thing about source control is that there are so many
different scalability factors to consider. Some people have lots of users.
Some have lots of files. Some have really big files. Some teams lock
thousands of files even if they don't plan to edit them. The variable on the X
axis varies from customer to customer.
We have worked to make our product more scalable with every
release. In fact, Vault 3.1 was a major step forward from Vault 3.0. Since I
played a part in this, I would like to tell the story of how it happened. I'm
mostly just a marketing guy these days, so on the all-too-rare occasion that I
break out Visual Studio for some development work, I want to talk about it. :-)
We love automated testing
We are big believers in automated testing. We've got unit
tests and "smoke tests" that run on every build. We've got a test repository
with 500,000 files in it. We've got tests that randomly retrieve old versions
to make sure they're still okay. We have a "Combinatorial Test" which randomly
performs weird operations and finds ridiculously arcane edge cases. We like
automated test apps. Every so often we write another one.
During the 3.1 development cycle, I was feeling bored and a
little ornery. I didn't have much involvement in the 3.1 code, and I was
itching to use a compiler, so I took some time and wrote a test application we
call "Crowd Test". My goal was to exercise the Vault server under seriously
heavy load from many simultaneous users. I wanted Crowd Test to be sadistic
and cruel and unfair.
How it Works
Crowd Test is a Vault client which runs in an infinite loop that
looks something like this:
- Randomly choose an action to perform.
- Sleep for a random amount of time.
- Go back to step 1.
The list of known actions includes most everything that can
be done to a Vault server. For example, there are actions to add files, create
folders, delete things, run history queries, and apply labels.
The algorithm which randomly selects an action uses weights
so that some actions are more likely to happen than others. The weights are
tuned to roughly simulate the usage pattern of real users. For example, modifying
a file is more likely to happen than deleting a folder or creating a branch.
We usually set the sleep time fairly low, averaging around
30 seconds between actions. , Real users don't perform a source control
operation that often. However, with a low idle time between actions, when we run
a simple test with say 10 Crowd Test clients simultaneously, we are actually
burdening the server with a load that is higher than that which would be caused
by 10 human users.
The actions are also tuned to ensure that the size of the
repository is always increasing. The longer Crowd Test runs, the larger the
repository gets.
Each Crowd Test client records the elapsed time for every
action and plots the resulting data with simple line graphs. In this way, we
can get a visual depiction of just how the performance of each operation
changes as the repository grows. (BTW, for the plotting I used ZedGraph, which is very cool.)
Here's a screen shot of a Crowd Test client:

I spent a couple of weeks writing the initial version of Crowd
Test. All of my early test runs took place on my main development machine. I
remember running one Crowd Test client overnight, returning the next morning to
find 14 hours of results, nicely graphed. Everything was going exactly as
planned, so I was ready to try a test with multiple clients hitting a real
server. For my test server machine, I grabbed a dual-P3 with a gig of RAM, not
too beefy, not too wimpy.
I had high expectations. Nobody had ever tortured a Vault
server this badly before! I started a bunch of Crowd Test clients and went to
lunch. After lunch, I would start examining the graphs and beginning
identifying all the tweaks which could be made to improve scalability.
The actual results were not at all what we expected. When I
designed Crowd Test to be sadistic, I succeeded.
Nightmare on Farber Street
The Vault server died before I ever got my chicken
sandwich. When I returned from lunch, I discovered that the server was FUBAR not long after starting
the test. From that point on, it was still listening to requests but it was
immediately returning failure codes for every action.
I stopped the test and started looking for some kind of
configuration problem. At first, I didn't really consider the possibility that
I had found a bug in the server. But the more I investigated, the more
horrified I became. Within a couple days, the truth was painfully clear:
Vault 3.0 had a terrible problem.
I recruited some help from Jeff (a Vault server developer)
and we got that bug fixed in reasonably short order. Embarrassingly, it was an
issue of thread safety. During the development of Vault 3.0, we did several
things to increase concurrency from 2.0. Apparently we left a couple of places
where the server wasn't thread-safe. Under typical usage, the problem rarely
if ever appeared, but Crowd Test created conditions which made this bug appear
in minutes. (This problem was fixed for 3.0 users with the 3.0.7 maintenance
release.)
We returned to the Crowd Test effort, assuming that we were
now ready to proceed with the fine tuning and optimization work. Once again,
things didn't work out the way we expected.
I'll spare you the play calling and give you the box score:
We dragged Ian (another Vault server developer) into our little team, and the
three of us spent most of April, May and June working with Crowd Test. Mostly
I stayed on the client side and let the other two guys do all the server work,
but I still ended up learning a whole lot more about SQL Server than I ever
wanted to know. We found and fixed a whole bunch of problems, including the Big
Ugly Thread Concurrency Bug and a Huge Memory Leak and a Bunch of Deadlocks in
the SQL Layer and a Big Performance Problem with Folder Security Checks and
more.
I don't know if this experience really fits the definition
of "Test Driven Development", but we were definitely feeling "driven" by Crowd
Test. :-)
It was bittersweet experience. On the one hand, we were
horrified each time we discovered another hideous scalability problem in our
shipping product. On the other hand, the improvement we were making in the
quality of Vault was incredible.
The Bottom Line
Fortunately, the actual negative impact of all the problems we
found was rather small. None of our customers abuse their Vault server as
badly as Crowd Test does. Our product was taxing the patience of some
customers, but nobody lost any data or anything like that. Most users were
completely unaffected.
On the positive side, as our larger sites upgrade to 3.1,
they are getting much happier.
The efforts we started here will continue. In fact, as I
was proofreading a final draft of this article, Ian walked in and asked for
more hardware for Crowd testing. :-)
That's good, because we will probably always have more work
to do. For example, in Vault 4.0 we really need to speed up the branch
command. It's not a problem for most users, but we've got one medium-sized customer
whose development process has them creating branches extremely often, like every
day. Our branching design is sound, but the implementation was done with
the assumption of it being a rather infrequent operation. A bit of localized
surgery on the branch command will help a lot.
Today Vault scalability is very good, better than it has
ever been. When I talk with prospective customers, I've got more confidence
than ever, but the journey is a little bumpy sometimes.
These experiences allow me to feel a certain measure of identification
with the developers working on Team Foundation Server at Microsoft. They've
accomplished a lot
already, but to some extent they are just now embarking on this journey, especially
because their scalability goals aim much higher. Unlike SourceGear, the Team
System folks actually do want to create a product which is well suited
to teams with 5,000 developers. Perhaps more importantly, I daresay they want their
product to eventually be used throughout Microsoft itself, especially on the
massive teams which develop Office and Windows. With their 1.0 release
apparently only a few months away, they have some very interesting times ahead
of them. Best wishes and a pleasant journey to Brian Harry and his team.
So Eric, are you trying to make a point here?
Nah, not really. I'm just telling stories. I believe in transparency.
|