Tech ramblings by Marcin

Hadoop for Enterprises

2012-06-18 09:08

Hadoop's usage as a big data processing framework gains a lot of attention lately. Now, not only big players see, that they can embrace the data their sites or products are generating and develop their businesses on it. For that to happen two things are needed: the data itself and means of processing really big amounts of it.

Gathering data is relatively easy. These are not necessarily structured data, you don't need to plan their usage at first. Just start collecting them and than you may experiment with their potential usage. If they'll come out as useless rubbish - deleting them won't be hard But imagine the values it may contribute to your business:

  • faster services - working on optimized data
  • more clients - because of more relevant search results
  • happy clients - your service can "read their minds"
  • etc.

There are many companies that utilize Hadoop ecosystem for their own needs. You can read about some of them here: http://wiki.apache.org/hadoop/PoweredBy But since that page lacks insight into specific applications of Hadoop I've tried to delve into
details of how Hadoop helped tame some companies' big data sets.

Facebook

Being a social network provider, a widely used one, they require no introduction. However if you've lived under a rock for last couple years just visit their website http://facebook.com

Their main usage is data warehousing. Since they require to be able to access the data fast and reliably they had a need for real-time querying of their huge, and always growing data set. Their switch from MySQL databases was required due to the increasing workloads they experienced with standard databases. What they got "out of the box" with Hadoop was all the benefits of distributed file system (HDFS features). They expanded the ideas behind that even further and implemented truly Highly Available file system without Single Point of Failure.

Facebook has 3 interesting usage scenarios in which Hadoop plays a major role:

  • Titan - is Facebook's messaging system. It processes messages exchanged between users. Ensures that it happens fast and without glitches. Here Hadoop is used mainly as a huge, unlimited storage.
  • Puma - Facebook Insights - a tool providing page statistics for advanced Facebook users. Based on streams of data (clicks, likes, shares, comments and impressions) it graphs those data and makes it available near instantly.
  • ODS - Operational Data Store - which stores Facebook's internal metrics - collections of OS and cluster health metrics. And it facilitates multiple accounting solutions.

Twitter

This popular micro-blogging platform, where you can register your account and follow friends and celebrities for their micro-messages does some pretty interesting things with their Hadoop cluster.

One of their motivations is to speed up their web-page's functionality. That is why the compute users' friendships in Twitter's social graph with Hadoop. Using connections between users they calculate their relationship to each other and estimate groups of users.

Since this service's users generate lots of content, the company conducts researches based on natural language processing. They probe what could be told about a user from his tweets. They use tweets' contents for advertisement purpose, trends analysis and many more.

From tweets and user's behaviours they characterise usage scenarios. Also, they gather usage statistics, like number of searches daily, number of tweets. Based on this seemingly irrelevant data they run comparisons of different types of users. Twitter analyzes data to determine whether mobile users, users who use third party clients or power users use Twitter differently from average users. Of course theses seem like really specific applications but nevertheless they are very original and base on the data that Twitter has been gathering for some time now.

EBay

Being the biggest auctioning site on the Internet, EBay uses Hadoop processing for increasing search relevance based on click-stream data, user data. This seems pretty obvious, considering their area of operation.

However the also have one other interesting thing - they try hard to automatically fill auctioned objects' metadata, based on the descriptions and other data provided by users. They employ data mining approach for this tasks and judging from their constant growth it seems to work

LinkedIn

Social network for professionals, thou a lot smaller than Facebook. Based on click-streams they discover relations between users. All the data concerning latest visits on your profile or people you may know from other places - this comes from Hadoop based analysis of those clicks people make all the time on their sites.

Also a very neat feature, called InMaps (http://inmaps.linkedinlabs.com/) analyse declared schools and companies and generates data for graph with clustered friends of yours.

Last.fm

This on-line radio site, praised by many for its invaluable recommendations' system seems like a rather small and simple service. But behind the facade of simple web page there are lots of data being processed, so that their services could match a certain level of perfection.

Such large volume of their data comes from scrobbles. Each users of their service listening to a song generates a note about this fact - called scrobble. Based on that and user profiles they calculate global band popularity charts, maps of bands' popularity and many more usage statistics and timeline charts.

Conclusion

They just try to detect and trace new patterns in seemingly chaotic data sets. Perhaps you could also do the same? Analyze your data and expand your business value?

Comments

We stumbled over here from a different web address and thought I might check things out.
I like what I see so i am just following you.
Look forward to checking out your web page yet again.

I like what you guys are up too. This type of clever work and reporting!

Keep up the awesome works guys I've added you guys to my own blogroll.

Greetings from Florida! I'm bored at work so I decided to browse your site on my iphone during lunch break. I enjoy the info you present here and can't wait to take a look
when I get home. I'm surprised at how quick your blog loaded on my cell phone .. I'm not even using WIFI, just 3G .
. Anyways, very good site!

Comfortableness <a href="http://www.salethenorthfacejackets.com">north face jackets</a>
is crucial when they get it that will <a href="http://www.salethenorthfacejackets.com">north face outlet</a> get the best school bags pertaining to going camping <a href="http://www.salethenorthfacejackets.com">north face sale</a>. Your easiest guarantee in the case of even larger delivers has become One with an inner metal framework, one that can wind <a href="http://www.salethenorthfacejackets.com">cheap north face</a> up being aligned to help you appropriately fit your <a href="http://www.salethenorthfacejackets.com/the-north-face-women-1">north face women</a> body. They should be now have http://www.salethenorthfacejackets.com secure which were wholly flexible, because essentially in the form of midsection belt to get more aid.

I never imagined how much stuff there was out there
on this! Thanks for making it easy to get the picture

What Programming Languages Do Jobs Require? | Regular Geek regulargeek.com/2009/07/21/what-programming-languages-do-jobs-require view page cahecd As a software engineer, you need to keep your skills sharp and current. This is a general requirement of the job. In addition to this, in the current economy you do not want to be without a job. Obviously, this means learning more about what your current company uses for all of its development. What if you do not have a job or you are looking to leave? What technologies or programming languages should you be looking into? From the page

Howdy are using Wordpress for your site platform? I'm new to the blog world but I'm trying to
get started and create my own. Do you need any coding expertise to make your own
blog? Any help would be greatly appreciated!