10/26/2009

building all the time

Recently, when Patrick Galbraith and I put together the next moxi release, we spent a bit of time getting the build clean on a number of platforms and with a number of compilers.  Continually building and testing on multiple platforms helps ensure the usefulness, quality and longevity of the code.
This is something all of us at NorthScale believe is good for the projects we lead and contribute to.  Dustin Sallings has long been doing this for memcached, as you can see from the memcached wiki and the build farm itself.  All of us at NorthScale have continued this effort joined by community contributions.  As you can see, it's quite a comprehensive list.  We do this other projects too.  For the memcached proxy, moxi, we have another build farm.
For those not familiar with continuous integration, buildbot allows us to shorten the time between new code and issues found in various build environments.  Every time a developer commits a change*, all of these platforms will try to build and test memcached.  If there is a problem, we can spot it right away and fix it, so build problems on other platforms don't linger.  There are many benefits: it keeps the code 'ready to release', some platforms may catch errors at compile time other platforms do not catch, etc.
To give an example of how this has worked for us in practice, if someone happens upon a platform with an issue, we of course address the issue itself (usually by asking them to file bugs and provide a test case), but we also ask for help adding to the build farm.  The Gentoo project, for instance, bundles memcached and covers many, many platforms.  An issue came up recently with Gentoo for the ARM architecture.  In the process, Dustin asked the contributor, Robin Johnson, if they could set up a builder, so we now have a builder for Gentoo on ARM.
For Patrick and I in this last moxi release, this meant adding some new builders to moxi's build farm, turning up the compiler warnings and fixing a number of bite-sized bugs.  While many of these bugs are more along the lines of warnings that only compilers in a pedantic mode would complain about, some others pointed to type safety which are those kinds of places in software bugs like to hide.  Some caulking and sealing will keep the bugs out.
In the end, NorthScale's goal here is to maintain high quality output for memcached, moxi, libmemcached and other projects we either lead or contribute to.  As Dustin's report card blog post shows, we've made progress against our goals already; buildbot's steady watch on our tree should help keep us there.

* Technically, the process we try to use is to use a "buildbot try" (also contributed to by Dustin) first when making changes, so we keep a tree which is buildable on all platforms all of the time.  It allows a developer who hasn't even committed a change to test it.  It just works with git to generate a diff, then patches the tree against some common history.

memcached 1.4.1 on Windows!

This past week at Northscale has been a great week of productivity! Dustin released just this weekend SASL support for memcached so the statement that I put in my previous book about memcached having no authentication is now no longer valid! I have been busy working on building memcached on Windows using the mingw compiler. Alan (Dormando) has a good friend who was kind enough to put his Windows development expertise into providing a patch to help get memcached to run. Alan and I were both trying to get a cross compile to build a Windows binary on Linux. We both had it working but there is some problem running the binary on Windows. Interestingly, you can run this windows binary on Linux-- and not using Wine, something I was surprised about. That is because all the libraries are there that are needed for it to run, despite being for a differnt OS. What I then ended up doing is trying the same tree, but compiling it on Windows (along with some changes to the Makefile) and had great success! This image shows memcached 1.4.1 happily running on a Windows AMI I used for my testing.

The git repository for this can be found at:

git://github.com/CaptTofu/memcached.git

and check out the win32 branch:

git checkout -b win32 origin/win32

To build this, you will need to install mingw, libevent and pthreads. Some kind soul (Dustin) had pre-installed all this for me on this Windows instance, so you will need do a Google search for instructions on these installations. Also, you have to examine the file Makefile.mingw and edit it to be able to find the path to both libevent and pthreads includes and libraries.

Our goal at Northscale is to keep this moving forward. Steve Yen is busily working on an installer that will further simplify running memcached on Windows!

10/23/2009

Third Stage (and not the repeatedly-delayed album from Boston)

As a nerdy adolescent boy growing up in Baton Rouge, Louisiana I discovered that I really loved computers – and computer software in particular. Creating something from nothing and instantly seeing the results in all their black-and-green, 64 column-by-16 line, ASCII-character glory gave me a level of satisfaction very few other things did. While I was the last kid picked for sports teams in junior-high school PE class (behind the girls – and I’m not exaggerating), put a keyboard in front of me and I could do things few others could. I dreamed of one day living and working in Silicon Valley – helping to create the technology that brought much joy to my life.

Fast forward about twenty-five years, and I am living the dream. I can remember like it was yesterday the first time I drove down Interstate 280 in Cupertino and saw the Apple campus out the driver’s side window; then looped up highway 85 and down 101 to see Intel’s headquarters. I consider myself one of the luckiest guys alive. I get paid to do what I truly love doing – creating great software and building great software companies that allow others to do the same.

Over the course of my career, I’ve been fortunate to directly participate in two fundamental shifts in computing technology.

The first was the transition from mini- to micro-computing. Although I logged my fair share of VAX and Data General Eclipse time, I joined the fray in earnest as microcomputers were already entering the scene. I was less a creator, and more a participant in and beneficiary of that transition.

The shift from character-based to GUI-based user interfaces; the emergence of PC LANs and networked storage; the adoption of object-oriented design and programming languages; the emergence of client-server and subsequent shift to n-tier application architectures; and the proliferation of virtual machine software for Intel-architecture platforms were all what I would consider evolutionary steps in computing technology. Very cool stuff - enjoyed helping drive those transitions - but evolutionary.

The second fundamental shift was the emergence of the Internet, enabling global connectivity of computing devices through simple, open network protocols; and the establishment of the World Wide Web which rides on top. Having made the utterly ridiculous decisions to return to graduate business school and to pursue a sideshow career in investment banking just as all that good stuff was going down, I was once again more participant than instigator. Hindsight is 20-20, I suppose.

But a third major shift is happening; a shift that will mark the third stage of my career. And I’m not missing this one. Fortunately, I’ve been at the right place, at the right time, to help play a leadershAlbum-third-stageip role in driving this transition. The emergence of “cloud computing” will be bigger and more impactful on the computing landscape than all the previous transitions above combined. As a jaded, buzz-word overloaded, skeptical, long-term member of this community, I actually believe that assertion right down to my core.

In 2004 I started a company, with Xun Wilson Huang, called Akimbi Systems. Acquired in 2006 by VMware (where I remained for a couple years), the technology we built is now being used as the foundation of VMware’s cloud provisioning platform. Virtualization (server, storage and networking) is a key enabling technology making the drive to cloud computing possible, but there are other, key missing ingredients – mostly “up the stack.”

About a month ago I joined the team here at NorthScale to help build and bring to market some of those missing ingredients; and to help enterprise IT organizations understand and embrace the cloud computing model. I’ve been preparing my whole life for this opportunity and we are going to do it right.

We aren’t yet ready to fully detail our vision, strategy and products to the market, but we are doing some pretty amazing things here and we can’t wait to tell the world.

“but thats not what I came to tell you about.

Came to talk about the draft.”

The development of critical infrastructure software using the open source model is the future, and increasingly the present, of software development. We plan to embrace and actively support a number of open source projects in our work here at NorthScale, and memcached is one of them.

We believe deeply in the power of the open source software development model and we are going to do everything in our power to respect, support, contribute back to and enhance the vitality of the memcached community; and the same goes for any other project we participate in.

The memcached community deserves credit for creating and enhancing a software system that is currently used by thousands of web applications, including substantially all of the top 20 web applications (by traffic count) on this planet (including Facebook, MySpace, New York Times, Google, Yahoo!, LinkedIn, Craigslist, eBay, salesforce.com). Rather than try to co-opt or claim credit for the work of the community, our goal is to recognize that great work; and to continue to do our part to support the efforts of these people while helping to improve the software and contribute those improvements right back to the project. It is the right thing to do.

I can’t wait to share my experiences with you in the coming months and years and I’d love to hear from anyone that shares our passion for cloud computing and open source software development.

10/21/2009

More memcached

Our intrepid NorthScalers have been doing some interesting work recently in memcached land...

Last week, Dustin Sallings announced his memcached server implementation in Erlang, called EMemcached.  Besides being a cool project, there's a surprising amount of interest in the mixture of memcached and Erlang, as you can see from the comments on Dustin's announcement post.

Today, Patrick Galbraith announced that the memcached UDF's are now integrated into the Drizzle project's mainline.  Drizzle is an interesting fork of MySQL and these memcached UDF's (which were originally inspired by the memcached UDF's for MySQL) makes it easier than ever to work with memcached from the Drizzle RDBMS.

As a last note, we're heads down cranking on incredibly cool Scale Out Data infrastructure.  Stay tuned!

10/16/2009

moxi 0.10.0 Released!

I'm pleased to announce the release of moxi 0.10.0, now available at http://labs.northscale.com/moxi/moxi-0.10.0.tar.gz or via Github using git:

$ git clone git://github.com/northscale/moxi

If you don't already know, moxi is a memcached proxy with several features which can help keep the memcached contract whole in complicated environments by making it easier to deal with the complexity of multiple servers in a pool from a single connection. For more information, see http://labs.northscale.com/moxi/.

This release has some new features such as:

  • Move to new internal hashtable , removal of glib requirement
  • Simplification of build
  • Updated and synced with memcached 1.4.1
  • Cleaner support for new platforms

 

Special thanks go to the Northscale team for this release, including Aliaksey Kandratsenka and Matt Ingenthron.

Go forth now, and get the code! And thank you for using moxi.You'll have a lot of enjoyment using it.

10/12/2009

scaling data at Los Angeles CloudCamp

It's just over a week ago now, but I had a good time and learned a few things from both the session and the discussions at the Los Angeles CloudCamp.

I proposed, and ultimately lead a session on scaling data.  Actually, I proposed it as "Scaling Your Data: Both Before and After you Need to".  Part of the reason I proposed it is that there was discussion of the CAP theorem during the proposal of sessions, and what was stated was a bit off.

CAP is well discussed in a lot of places, (one good blog is from Jonathan Gray, CTO and co-founder of Streamy.  CAP stands for Consistency, Availability and tolerance to network Partition.  Interestingly, many folks kept trying to make the last P performance or introduce performance into the mix.  Performance (or at least the tradeoffs involved) are better discussed in another set of letters W+R>N. 

We had one room at Microsoft's downtown LA digs pretty much dedicated to scaling data.  The first session was lead by .... and covered the CAP theorem.  It was a bit contentious at first, as there were some assertions you could get all three with some judicious use of Microsoft technology, but through the discussion in the room we ended up in the right place and in agreement with each other.  That session was lead by Abhijit Gadkari from ZimbaTech.

After that, Microsoft's own SoCalDevGal (Lynn Langit) did a presentation on SQL Azure.  I certainly learned a bit about what Microsoft is delivering and what their developers are asking for.  Heck, I even learned a few new three-letter-acryonyms.  :)

SQL Azure is Microsoft's hosted relational database as a service.  From the presentation, it became clear that initially Microsoft was initally not planning a full SQL compliant hosted data store (which would make them more like Amazon's SimpleDB or Google's BigTable as accessed through Google's App Engine).  In the end, customer feedback lead them back to providing as much SQL as they could, including integration with all of the Microsoft developer tools.

Interestingly, Microsoft is suggesting data sharding for scale.  This is in part because the initial product is limited to 10GByte databases.  SoCalDevGal was really clear that this is just an initial limitation, and a few things had to give to meet their schedule.  The other really interesting bit about SQL Azure, which I thought was pertinent given the rest of the discussion, was the seeming decision to make SQL Azure a CA system, in regards to CAP.  I'd even asked to verify, and that was in fact the intent.

The reason this so significant to me is that the design decision here, coupled with the replication across datacenters, means the only way to do this correctly is offline the system (or a portion of it's safety) in the event a datacenter is cut off from the 'net.  Since all data is in at least two datacenters, and you don't know which two, it's possible the datacenters are more than "pairs", meaning a complete partition of a single datacenter could mean an outage for at least two, and maybe more.  This is good for their users in that it's completely compatible with their applications, but it also means there could be a "CNN moment" someday meaning a couple of telco failures will take offline a large number of SQL Azure customers.

Just so this isn't misinterpreted, this isn't a criticism of Microsoft's technology or some deficiency they should have overcome.  It was just a design choice made by the SQL Azure team.  This is where the "engineering" is in our discipline... there are tradeoffs in any design.

The final session in the room in the evening ended up being mine.  I did a quick introduction of myself and a quick assessment of who had ever even looked into stuff like memcached, gearman, Haoop HBase, Cassandra and the like.  It turned out almost no one had thought about this.  I think this is, in part, because Microsoft did a great job putting butts in seats for this event.  Because of the audience experience with such things, I ran the session as a bit of a cerebral exercise in understanding why someone may want to look into different ways of storing data. 

I'm sure I didn't cover everything, but I did assert a few reasons:

  • Scale - Moving some of the intelligence to the clients in a distributed computing architecture makes it simpler to split components, and thus scale them.  It also becomes much easier to move caching closer to where the data is used, helping scale.
  • Developer Productivity - This cuts both ways in that having to learn a new way of storing and retrieving data is a productivity sapper, but being able to evolve data structures and persistence mechanisms without ever having to declare changes to a data model (get DBAs involved, etc.) is liberating.
  • Availability - For some classes of applications, it may entirely make sense for most of the data to be available, even if a small portion is currently unavailable.
  • Geographic Distribution - If you begin to think about the data a little differently, it becomes easy to geographically distribute the data.

I'll have to write more about this in the not too distant future.  I've been looking into a lot of previous work (even some academic papers) and there are some interesting ideas out there.

p.s.: the TechZulu guys recorded the session.  It may be on their site at some point... I checked to link but it's not there yet.

10/02/2009

memcached talk at SV Code Camp

Just a quick note, I'll be at Silicon Valley Code Camp this weekend, giving a talk on Using memcached to Scale Out Your Website.

10/01/2009

NoSQL is a horseless carriage

There's a lot of buzz on the NoSQL, or anti-RDBMS, meme lately.  But, I don't like those category names.

NoSQL?

It's a backwards ref, instead of a forwards-looking connotation.

Folks used to use the words "horseless carriage" when refering to those fancy new contraptions that you rode around in without using a horse.  Before long, though, new terms like automobile and "car" popped up.  Fast forward to today, and it's now safe to call these things Mustangs, Broncos or Pintos without confusing anyone.

I would already like to get a new category name for all these new kinds of datastores. I've heard of "AltDB", but it's still a backwards ref.  Alternative to RDBMS?  But, the truth is, you'll probably never get rid of your RDBMS.  These new things are complements to RDBMS databases.

The common refrain to me is this: you can scale out these systems by just adding more nodes or servers.

How about a category label that connotates that benefit: JustAddMoreNodesDB -- Jam-N-DB.  Jam-N-Database.

By the way, there's one more reason why I dislike the NoSQL/anti-RDBMS labeling...

In the long term, my bet is these just-add-more-node systems will pick up more and more little bits of relational and SQL-like functionality. 

For example, that's already happening in Hadoop-land, which plays in the batch processing, analytics world.  In Hadoop, you write map-reduce programs instead of SQL queries.  But, additional pieces like Hive let us write SQL-like queries, which get automatically converted into map-reduce programs underneath the hood (a car analogy!).  Long live SQL.

09/24/2009

spymemcached release candidate has 2x faster bulk loading

Memcached and Java NIO master, Dustin Sallings, has put out a new release candidate of the world's best Java memcached client library, spymemcached.  The most exciting headliner here is the 2x performance improvement for bulk data loading into memcached, by leveraging the new binary memcached protocol.  Details at: http://dustin.github.com/2009/09/23/spymemcached-optimizations.html

Way to go Dustin!

09/20/2009

power in the protocol

As I've said in other venues, I believe a lot of value in the memcached ecosystem/community stems from well defined client APIs and a well defined over-the-network protocol.  The protocol just makes sense for a large class of things you may need from (as NorthScale's Dustin Sallings calls it) a sloppy LRU distributed hashmap.  There are clients for every platform and language I can think of, and implementing the basics of the protocol for another client are so straightforward, there are a lot of partial clients out there. 

On the server side, the same is true.  Here in the last few years, we've seen quite a number of things pop up which speak either all, or a subset of, the memcached protocol.  In some cases, these things are pretty explicit about the fact they don't speak the full protocol and in other cases there may just be parts which weren't considered.  What was needed was a quick way to determine if something which speaks memcached protocol supported all of the operations and options.  It was possible to do this by grabbing spymemcached or libmemcached and running the tests, but those felt more like tests to validate spymemcached and libmemcached functionality... you'd have to know a lot about them to dig into the weeds to find out based on the failures what was due to protocol support not being there.

I had suggested something like this at the Drizzle Developer days in Seattle not so long ago.  In a later conversation between Brian Aker and I, I made the suggestion that memcached is something that pretty much any developer could implement the basics of in hours or days.  That being the case, it'd be really useful to have a simple tool to point at something to see to what degree it's protocol compliant.  He agreed, and then between he and Trond, and separately between Trond and I we came up with memcapable

Trond went and implemented the first version.  My contributions were the idea and some discussion with Trond, plus the name itself (which is a great name, if I do say so myself).

memcapable has just been merged into the trunk of libmemcached as one of the utilties.  This means it should be in the next release.

NorthScale's own Patrick Galbraith (a.k.a. CaptTofu) recently picked up Tokyo Tyrant to put it through it's paces.  This is exactly the kind of thing that memcapable should help memcached users with.  Given a new thing implementing memcached protocol, can you point memcapable at it and quickly find out which portion of the protocol it supports!