Tag Archives: systems

Distributing biodiversity data globally

My current project at work will take me far into next year, and that’s good because I’m facing an unprecedented amount of data, that will only continue to grow.  Because of this I’m finally getting to put my money where my mouth is.  For years I’ve talked about my ideas and theories about how I could network disparate systems together and have them leverage each other to keep everything in sync.  So, while working with Open Source to push boundaries I seem to find more ways to do more complex things.  One basic idea that I’m working on now is that data sets are huge, and are only going to get huger (and hugerer) as time goes on, how to handle this has been solved a few different ways.  Usually it’s someone like the Internet Archive who have 1000s of computers networked together to share the data (they are using some parts of hadoop for the distributed file system, and then nutch for search indexing) – but it’s still working from one central point of failure.  I started doing research to find out how this has been solved before, and if my idea of building a BitTorrent network was sound – and I found some great information to build on.  As I’m setting up my demo BitTorrent tracker in Debian, this info keeps me thinking of the best ways to implement my ideas.  Much of my progress is due to the very helpful advice of Paul at Geograph Torrent Archive, a project that has somewhat similar goals.

Meeting Moore, Internet Archive, PLoS, Flickr in San Francisco

I’ve gotten my pictures online from my San Francisco trip.  The city was everything I always hoped it would be, and I really loved it there.  I had the opportunity to meet with diverse people that all intersect with various aspects of my job (now being refered to as my career).  From The Moore Foundation (the most amazing workspace I’ve ever seen) that provide us grant money to do our research to other non-profits partners like Internet Archive, The Smithsonian, Califonia Academy of Science, Public Library of Science to some of the folks that run the servers and dream up new ideas at Flickr (they use MySQL shards, Squid and memcached all over the architecture to navigate all that data – so I’m on the right path!)  The best part was meeting more people like me who are learning how to deal with and distribute all of this life data that just increases daily, the fact that I’m using my skills that I learnt by doing things like…running this blog, to do things on such a global level is an honor.  And fun, lots of fun!