Distributed scans with HBase

HBase is by design a columnar store, that is optimized for random reads. You just ask for a row using rowId as an identifier and you get your data instantaneously. Performing a scan on part or whole table is a completely different thing. First of all, it is sequential. Meaning it is rather slow, because it doesn’t use all the RegionServers at the same time. It is implemented that way to realize the contract of Scan command - which has to return results sorted by key. So, how to do this efficiently? ...

December 10, 2013 · 3 min · Marcin Cylke

Simple HBase ORM

When dealing with data stored in HBase, you are quick to come to a conclusion, that it is extremaly inconvenient to reach to it via HBase native API. It is very verbose and you always need to convert between bytes and simple types - a pain. While I was working on a project of mine, I thought, why not to easy those pains and fetch real objects from HBase. And that’s how this simplistic, hackish ORM came to life. It is no match for projects like Kundera (a JPA compliant solution), or n-orm. Nevertheless, it just suits my needs :) ...

December 8, 2013 · 2 min · Marcin Cylke

Zookeeper + Curator = Distributed sync

An application developed for one of my recent projects at TouK involved multiple servers. There was a requirement to ensure failover for the system’s components. Since I had already a few separate components I didn’t want to add more of that, and since there already was a Zookeeper ensemble running - required by one of the services, I’ve decided to go that way with my solution. What is Zookeeper?Just a crude distributed synchronization framework. However, it implements Paxos-style algorithms (http://en.wikipedia.org/wiki/Paxos_(computer_science)) to ensure no split-brain scenarios would occur. This is quite an important feature, since I don’t have to care about that kind of problems while using this app. You just need to create an ensemble of a couple of its instances - to ensure high availability. It is basically a virtual filesystem, with files, directories and stuff. One could ask why another filesystem? Well this one is a rather special one, especially for distributed systems. The reason why creating all the locking algorithms on top of Zookeeper is easy is its Ephemeral Nodes - which are just files that exist as long as connection for them exists. After it disconnects - such file disappears. ...

June 24, 2013 · 5 min · Marcin Cylke

Operational problems with Zookeeper

This post is a summary of what has been presented by Kathleen Ting on StrangeLoop conference. You can watch the original here: http://www.infoq.com/presentations/Misconfiguration-ZooKeeper I’ve decided to put this selection here for quick reference. Connection mismanagement too many connections WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@247] - Too many connections from /xx.x.xx.xxx - max is 60 running out of ZK connections? set maxClientCnxns=200 in zoo.cfg HBase client leaking connections? fixed in HBASE-3777, HBASE-4773, HBASE-5466 manually close connections connection closes prematurely ...

March 21, 2013 · 3 min · Marcin Cylke

After WHUG meeting

Here are the slides from the talk a gave yesterday. If you have any questions, please ask.

November 30, 2012 · 1 min · Marcin Cylke

WHUG 8. Beyond Hadoop - checking other options

W najbliższy czwartek - czyli 29.11.2012 - poprowadzę prezentację w ramach Warsaw Hadoop User Group. Swoją obecność można odklinąć tu http://www.meetup.com/warsaw-hug/ A o czym będę mówił? Przeklejka ze strony WHUG: Marcin skupi się na współpracy ekosystemu Hadoopa z innymi narzędziami. Pokaże jak prosto i wygodnie przetwarzać grafy i jak stosować podejście Big Data, w czasie rzeczywistym. Poruszy również temat łatwiejszego tworzenia algorytmów Map-Reduce Będzie to nieco mniej technicza (ale wciąż praktyczna) wycieczka po obrzeżach tematyki, która jest zwykle poruszana w połączeniu z Hadoop-em. ...

November 26, 2012 · 1 min · Marcin Cylke

Hadoop HA setup

With the advent of Hadoop’s 2.x version, there finally is a working High-Availability solution. Even two of those. Now it really is easy to configure and use those solutions. It no longer require external components, like DRBD. It all is just neatly packed into Cloudera Hadoop distribution - the precursor of this solution. Read on to find out how to use it. ...

October 30, 2012 · 4 min · Marcin Cylke