Hadoop’s usage as a big data processing framework gains a lot of attention lately. Now, not only big players see, that they can embrace the data their sites or products are generating and develop their businesses on it. For that to happen two things are needed: the data itself and means of processing really big amounts of it.
Gathering data is relatively easy. These are not necessarily structured data, you don’t need to plan their usage at first. Just start collecting them and than you may experiment with their potential usage. If they’ll come out as useless rubbish - deleting them won’t be hard But imagine the values it may contribute to your business:
- faster services - working on optimized data
- more clients - because of more relevant search results
- happy clients - your service can “read their minds”
There are many companies that utilize Hadoop ecosystem for their own needs. You can read about some of them here: http://wiki.apache.org/hadoop/PoweredBy But since that page lacks insight into specific applications of Hadoop I’ve tried to delve into
details of how Hadoop helped tame some companies’ big data sets.
Being a social network provider, a widely used one, they require no introduction. However if you’ve lived under a rock for last couple years just visit their website http://facebook.com
Their main usage is data warehousing. Since they require to be able to access the data fast and reliably they had a need for real-time querying of their huge, and always growing data set. Their switch from MySQL databases was required due to the increasing workloads they experienced with standard databases. What they got “out of the box” with Hadoop was all the benefits of distributed file system (HDFS features). They expanded the ideas behind that even further and implemented truly Highly Available file system without Single Point of Failure.
Facebook has 3 interesting usage scenarios in which Hadoop plays a major role:
- Titan - is Facebook’s messaging system. It processes messages exchanged between users. Ensures that it happens fast and without glitches. Here Hadoop is used mainly as a huge, unlimited storage.
- Puma - Facebook Insights - a tool providing page statistics for advanced Facebook users. Based on streams of data (clicks, likes, shares, comments and impressions) it graphs those data and makes it available near instantly.
- ODS - Operational Data Store - which stores Facebook’s internal metrics - collections of OS and cluster health metrics. And it facilitates multiple accounting solutions.
This popular micro-blogging platform, where you can register your account and follow friends and celebrities for their micro-messages does some pretty interesting things with their Hadoop cluster.
One of their motivations is to speed up their web-page’s functionality. That is why the compute users’ friendships in Twitter’s social graph with Hadoop. Using connections between users they calculate their relationship to each other and estimate groups of users.
Since this service’s users generate lots of content, the company conducts researches based on natural language processing. They probe what could be told about a user from his tweets. They use tweets’ contents for advertisement purpose, trends analysis and many more.
From tweets and user’s behaviours they characterise usage scenarios. Also, they gather usage statistics, like number of searches daily, number of tweets. Based on this seemingly irrelevant data they run comparisons of different types of users. Twitter analyzes data to determine whether mobile users, users who use third party clients or power users use Twitter differently from average users. Of course theses seem like really specific applications but nevertheless they are very original and base on the data that Twitter has been gathering for some time now.
Being the biggest auctioning site on the Internet, EBay uses Hadoop processing for increasing search relevance based on click-stream data, user data. This seems pretty obvious, considering their area of operation.
However the also have one other interesting thing - they try hard to automatically fill auctioned objects’ metadata, based on the descriptions and other data provided by users. They employ data mining approach for this tasks and judging from their constant growth it seems to work
Social network for professionals, thou a lot smaller than Facebook. Based on click-streams they discover relations between users. All the data concerning latest visits on your profile or people you may know from other places - this comes from Hadoop based analysis of those clicks people make all the time on their sites.
Also a very neat feature, called InMaps (http://inmaps.linkedinlabs.com/) analyse declared schools and companies and generates data for graph with clustered friends of yours.
This on-line radio site, praised by many for its invaluable recommendations’ system seems like a rather small and simple service. But behind the facade of simple web page there are lots of data being processed, so that their services could match a certain level of perfection.
Such large volume of their data comes from scrobbles. Each users of their service listening to a song generates a note about this fact - called scrobble. Based on that and user profiles they calculate global band popularity charts, maps of bands’ popularity and many more usage statistics and timeline charts.
They just try to detect and trace new patterns in seemingly chaotic data sets. Perhaps you could also do the same? Analyze your data and expand your business value?