Marcin bloguje

Reading research papers for fun and profit

2016-10-09T23:00:00+02:00

Research papers seemed to exist there for others to read. They were that strange thing some people produced and no one ever read. At least they looked to me like this for a long time. But that changed. Gradually, when I became interested first, in distributed systems, than in machine learning, they started appearing frequently along blogposts and tweets. With mentions all over internet. This made me think about their usefulness in solving general, everyday problems in those areas. They were no longer that unapproachable, academic papers I read at school. Nope, they became something else entirely.

Over time I’ve learned to gather insights, ideas and understanding from them. I’ve developed some technique and would like to share it. Is it really helpful? I don’t know. Perhaps it just works for me. Can’t say for sure.

How to read a research paper?

The most basic paper that shows how to efficiently read research papers is “How to read a paper”. It’s great, short and succinct but pinpoints exactly the things you should do to read and comprehend such content. At least it works for me.

The whole procedure outlines reading a paper in multiple papers. It also states that reading from paper is a lot more efficient that from screen. At least it gets all the online distractions away.

I’ve decided to go with just two passes. First pass is just to quickly scan the paper and see if it’s actually relevant. Second pass should give me sufficient understanding of the paper to decide should I dwell on it longer (ie. to implement or use in some project).

Ok, so, here’s what I get from each pass:

1st pass

read the title
read headlines
look at math formulas
reat the conclusion
glare over references

This pass should be sufficient to assign the following:

category - what type of paper is it?
context - which other papers are related
correctness - do the assumptions appear to be valid
contributions - what are the paper’s main contributions
clarity - is the paper well written?

2nd pass

This pass should be enough to:

grasp the content of the paper
be able to summarize the main theme of the paper

Bad Hackers Copy, Great Hackers Steal

Hey, it’s not me “going loco”. Check this guy - Avi Bryant talking about his experience with reading research papers. The video is here. He shows his product for introducing mass edits in spreadsheets by generalizing editing with a groups of algorithms. There are no details of the algorithms, but please watch this just for his great motivation and passion.

What I’ve already read and find important?

Since I’ve started tracking my progress with reading research papers I’ve already read quite a number of those. Here, have a look at some extremely interesting papers, at least things I find this way:

On designing and deploying internet scale services - great paper. A concise compedium of guidelines to follow when building big, distributed systems. Goes through different areas of a project - functional and non-functional requirements
Realtime data procesing at Facebook - how facebook manages processing all that data, how they move fast at such scale? This paper describes an ecosystem that exists at Facebook. Great to see they actually use multiple streaming solutions, just for the sake of “right tool for the job”
Design principles behind Smalltalk - very short one, but informative and great to read. Basically lays down a couple simple principles on building systems
Goods: Organizing Google’s datasets - not universally useful or entertaining, but if you tackle heaps of unstructured data on daily basis - ideas presented here are enlightening

Start small

Of course this is not for everyone. Just try and decide whether you like that kind of brain muscle stretching, or not. If the things described above scare you, perhaps try with something smaller. There is a delightful Youtube channel 2 minute papers, which goes beyond computer science and shows great scientific innovations in just two minutes. This in itself is just too short to actually get all the details, but is just enough to get you interested in a specific subject. You can later dig deeper into specific areas.

Conclusion

Reading research papers changes perspective. It’s great, do this as frequent as you want, but just ingest new ideas, or read papers that are effectively building blocks of specific industries. They are really good and reading the classics is always in fashion!

IT minimalist

2016-07-24T23:06:00+02:00

This is me at this moment in time:

164 rss channels I’m subscribed to
47 opened browser tabs, in two different browsers (Chrome, Firefox), across two different machines,
29 movies to watch -
44 notes in google keep
78 books to read

This is information overload.

Disclaimer: Please treat this entry as a mind dump - the description of my approach at this moment. These are ways of dealing with information overflow, and also threads. If you’d like to discuss anything related, feel free to write about that in comments’ section.

You can always say I’m just clicking like a madman, or suffering from ADHD or FOMA. Sure, this would be the simplest explanation.

But I’d argue, that my behavior isn’t that rare. People just tend to get lost in this abundance of information. That’s why I’ve tried introducing minimalistic approach in my digital activity.

What is minimalism

There is actually a trend that says - leave only the most important things, have only a hundred of them, this should suffice. You can read about it all around the internet.

In general this is more about simplifying your living, not only its physical aspects. There are lots of advises how to have less on your mind, let go of unimportant things. And this all really helps.

Some of those I’ve tried. And it helped. I’ve stopped blindly gathering things. I’ve thrown away, sold or swapped some others I know I won’t look into again. I’ve somewhat learned how to at best try to be a minimalist in physical world. I’m still struggling to set my mind onto minimalist’s tracks.

What I’ve been lacking was following those rules with my thought processes and my digital activity. The way I can most precisely describe my mind, or what’s happening in it is indycar race with new cars constantly joining without the old ones retiring. This means all the time I get all the weird ideas, some small, like clean the fridge, some bigger, like I need to check this new awesome hardware or software platform, and sometimes they’re just overwhelming but tempting. The kind of bold ones - “get rich young or die trying” kind.

How I deal with this constant stream of “joy”?

bullet journal - one journal instead of myriads of little notes all over the place . The idea is really neat, cause it assumes you spent some time reviewing your past, not finished goals and verify whether they’re still things worth pursuing or not,
google keep - nice but it is just a virtual bunch of little notes. Great for fast note-scribbling on a mobile. Also they have one huge advantage over paper ones - they’re searchable,
pocket - I use this simple app a lot. It allows me to channel all the things that seem like a worthwhile reads to mobile - thus they won’t pollute browser windows. And there is a chance I’ll read them later. There is also another thing about cluttering my pocket account, but by postponing to Pocket you gain time to reconsider if that piece of information is still relevant after some amount of time.

Distractions / sources of “joy”

twitter, facebook - sifting through meaningless entries on some of those services try to limit it’s usage, to some amount per day/week
newsblur - lots of RSS feeds - I read only the most interesting ones, which means I skim through majority of the feeds. Also I try to follow simple rule - if I want to add a new feed, I look through those already there and try to decide which ones I don’t actively read, or don’t enjoy. It is not obligatory to remove feeds, but the simple act of reconsidering is sometimes the best solution.

Conclusion

Not doing anything isn’t helping. Advising at least some solutions to this data influx is the least we can do. I’ve mentioned just the most useful to me. If you have yours, please share those in the comments.

Should you switch to Mac?

2016-07-07T23:22:00+02:00

by Flick user raneko

Who would have thought? Me switching to Mac? But it actually happened. I’ve been a long time Linux user, so why actually do the switch now? Or ever? Have a look at my thoughts and insights of novice Mac user.

In the beginning

My humble beginnings were with RedHat 6.0 on a AMD Duron 600Mhz PC machine. With a dialup connection. And setting that thing to connect into Poland’s national provider TPSA was rather painful. After some time using Linux, going from RedHat to Slackware I’ve decided things weren’t tough enough and switched to BSDs. That was really fun! I’ve used Free and Open flavors for a few years which very really great. Than switched back to Linux. I’ve used those *nix systems for all-things-computer. At home and at work.

The switch

And than came the new job and I had to choose: regular Dell or Macbook? Blue pill or red pill? I chose the later.

The switch wasn’t that painful, but mainly thanks to superb hardware Apple offers. Having to get used to a completely new OS was quite another story. It’s not that I was shocked by it. As a long time Linux user all the concepts are known to me. It’s the other way around - there are a lot of things missing that I learned to require from my tool of work.

What I really appreciate is the known userland tools - it’s BSD at the heart of it. Well, of course the kernel is some whacky Mach microkernel, but as for the userland I’m happy :) You can read more about the kernel and system’s design history on wikipedia (Os X, Darwin) XNU)

One of the biggest disappointments was the filesystem and the way it’s presented to users. Mac OS X comes with HFS+ (link) filesystem which is case insensitive!!! To me this seems like an abomination. Plus there are multiple shortcuts taken by Apple engineers, like its endiannes: primarily Macs used PowerPC chips which are BigEndian by desing, but after switching to Intel processors everything is LittleEndian now. AFAIK HFS+ has to still switch bytes when reading metadata.

Really neat thing I’ve found out recently is that under the hood OS X uses PF (OpenBSD’s packet filter) as a firewalling solution. I don’t know which version is in the current release and how does it compare against the original implementation, but since PF has such a nice syntax and performance it’s great to have it on board. There are numerous blog posts about setting a decent firewall on OS X with PF so go have a look. Also you can play with PF by means of a set of apps called Murus.

Useful tools

Brew - basic application provider, offers all the things I’ve become used to when on Linux,
Amethyst - allows window tiling with keyboard shortcuts and has focus-follows-mouse :D I love this feature, although with all the windows popping all over the place I must admit it sometimes gets messy.
MenuMeters - have a bunch of those geeky meters all over the place (no longer usable with El Capitan)
Alfred - seems like a nice app, Spotlight on steroids. It’s free and you can download it via ITunes
Flashlight - add more providers to Spotlight - unfortunately with recent introduction of rootless mac partition it’s no longer possible to use this tool

Hacking Spotlight

Spotlight is just that neat little thing out there that seems like indexing all the things and runs installed programs. But it can also serve as a calculator and …

You can also read the contents of its cache file.

Up until version 10.10.4 of OS X it was possible to have additions to spotlight, but now this behaviour is blocked by the system. http://mac-how-to.wonderhowto.com/how-to/customize-spotlight-search-mac-os-x-yosemite-0160786/

Script setup

Setup your mac with this shell script - but be very careful and read through this file first! https://gist.github.com/brandonb927/3195465

by Flickr user blakespot

Neat things

And it is quite positive, that Mac OS X developers care about small, but extremely important things - like building OpenSSH with LibreSSL support!

DTrace

Recently I’ve experienced a huge slowdown on my Mac. The It support’s solution was to reinstall the OS. I refused, this seemed like a barbaric method and also I just wanted to use this opportunity to delve deeper into internals of this OS.

I’ve started with analysing system behaviour with DTrace - probing interface originating from Solaris

Useful introductory links are here:

http://dtrace.org/blogs/brendan/2011/10/10/top-10-dtrace-scripts-for-mac-os-x/
http://dtrace.org/blogs/brendan/2012/11/14/dtracing-in-anger/

Conclusion

All in all the switch was relatively painless, and with tools and tricks described here I feel very comfortable using this system.

Scalar Conf 2016

2016-04-29T22:22:00+02:00

Scalar Conf 2016

This small conference had its third installment in (http://www.polin.pl/en)[Polin - Museum of the History of Polish Jews]. Organizing Scala conference isn’t that obvious and having that kind of attention is really great - all the foreign speakers, etc.

Here is a quick recap of presentations I’ve attended.

Having a cake … - by Paweł Szulc

A keynote about typeclasses. I’ve read the slides before the presentation -and they are available here, so it wasn’t new to me. But I’d strongly recommend to watch this talk.

Also, there is another nice talk by Paweł: Category theory is absolute general nonsense! This talk is in Polish only.

EFF Monad - by Eric Torreborre

Different way of monadic composition, implemented in a separate library: eff-cats (there is also scalaz alternative). Technically it was a bit over my head, but after experimenting with the concept at home it seems interesting.

I liked that Eric, being Zalando employee, developed that for their internal usage. Also the beauty of easily constructing different technological stacks with consistent API is very tempting.

Interesting things used/mentioned were:

non/kind-projector- for projections
Monad Transformers and Modular Interpreters

Incremental compiler - by Krzysztof Romanowski

Deep dive into scalac incremental compiler implementation. While the presentation itself, the way it was presented, was very nice, I didn’t particularly enjoy the topic. It seemed more appropriate for tools’ developers.

Akka cluster with Etcd - by Maciej Biłas

This one looked promising. Joining multiple exciting technologies seems often that way. Unfortunately it was rather tech preview of Kubernetes app setup rather than Scala presentation.

I’m a bit disappointed, because looking through the sources of the project akka-cluster-etcd there are lots of things to describe. Kubernetes not being one of them ;-)

Shapeless? Easy! - by Valentin Kasas

Well, that was a clear introduction to shapeless. Till this presentation I’ve treated shapeless as only some vague thingy you use when writing libraries. Valentin proved otherwise.

He gave a clear and consistent tutorial on the basics of shapeless and its potential use in some example applications. He created a simple diff tool. Not that useful in real life application, but definitely helped make a point.

Will look into real life examples in near future.

Fun fact: The speaker had his presentation in REPLesent, a sbt plugin for presentations - and performed most of the slide-jumping-fu with just one hand, other held in his pocket :D Looked quite hilarious.

Valentin also mentioned shapeless’ typeclass derivation - this actually looks nice! Have a look at this question

Cool Toolz in the Scalaz and Cats Toolboxes - by Jan Pustelnik

Functional Design Patterns - enjoyable talk. With a bit of theoretical background, lots of useful use-cases. The talk was about cats and scalaz - both being very similar, and patterns they contain.

Key points being:

Functor pattern for abstracting Big Data computations, to achieve easier testability
composing functors - |@| scream operator - also know (officially) as Applicative Builder
Applicative - a list of computation into a computation of lists, list traversal
Applicative pattern for reading config

There is this nice video called “A purely functional approach to building large applications”. Go see it.

Conclusion

To conclude - the conference was really great. Both the talks and social side of the event were enjoyable. I’ve been blown away by the great architecture of the museum itself. Been there a couple times and enjoyed all of them.

The much needed change would be to bring more tracks, because sometimes topics were too specific and I’d gladly change lecture to some other.

As for the usual conference merch, there were no bags with leaflets and pens etc. It came as a surprise to me, but I also really liked it. People infact throw away most of the contents from such bags. Also, no t-shirt as a default - great idea! If you like it - just buy one, otherwise don’t bother.

Overall - great job Softwaremill! Keep up the good work!.

Lambda Days 2016

2016-02-20T23:22:00+01:00

Lambda Days 2016 - Kraków

This last two days I’ve spend on a conference in Kraków. The topics revolved around functional programming, with all the experimental stuff, popular languages, etc. Here are the talks I’ve been to and short summaries of those:

Day 1:

Propositions as types - the keynote delivered by Philip Wadler - seen it already on YouTube. But the thing with Lambda t-shirt looked funny again :D
Using “Program shaping” and algorithmic skeletons to parallelise an evolutionary multi-agent system in Erlang - pretty long title, huh :) The presentation, thou very academic was also quite interesting. The whole concept not being particularly new, but the set of tools used for refactorings into parallel code seemed interesting.
Static analysis to identify divide-and-conquer algorithms - how to find particular class of algorithms in existing programs? Interesting but shallow, due to time limitations.
The Mysteries of Dropbox - John Hughes of QuickCheck fame showed how to test a big system using blackbox approach. Nice presentation, with all the great insight he always brings into his talks.
Muvr - Jan Machacek - that was a fun presentation. About how to build an app that detects your exercise routine and counts how many repeats you’ve done. There’ve been all:
- spark pipelining
- Swift code demo (it seemed like Swift IDE offers something similar to notebook-like programming, is it?),
- live coding Unfortunately due to WiFi’s proxy settings the live demo could not be finished, but the code for all this is available somewhere on Github, on this account.
Embracing change - how to introduce Clojure into your company’s technology stack seamlessly - by Artur Skowroński. Hilarious presentation about adopting new languages. It wasn’t that groundbreaking, at least for me, but the way it was presented, and the slides with all the Cthulhu references in them, were great!

Day 2:

Things that matter by Bruce Tate - great energetic talk, with all the ideas about learning new languages. Bruce is the author of two excellent books: “Seven languages in seven weeks” and “Seven more languages in seven weeks” :) He went through languages mentioned in his books and summed up principles associated with their inception.

I’ve almost decided to not come to this presentation, weighting between it and sleeping a little bit longer in my hotel room.
The Zen of Akka - delivered by a resident hakker at Typesafe - Konrad Malawski. Nice talk about Akka pitfalls, there were lots of recommendations and even some previously unknown to me features of Akka toolkit. Plus I loved the Japanese elements.
Creating reactive components using ClojureScript React wrappers - by Konrad Szydlo introduced Rum as a wrapper for React, described part of the ecosystem and explained mechanics behind the technology. Fine presentation, very intense on content. Konrad had 111 slides, but managed to show all of them without being hasty.
Getting started with Frege - Lech Glowiak, Frege seems to be a Haskell on the JVM, which in itself seems like a nice thing. But I treat such languages rather as nice experiments, than something useful. Especially its Java interop looks ugly. Nevertheless the presentation was given from the point of view of language contributor, which Lech is.
Practical demystification of CRDT - by Dmitry Ivanov and Nami Naserazad - both of them from TomTom. Guys are working on TomTom’s NaviCloud product. Their presentation was a practical guide throught the world of CRDTs (link to wikipedia). They showed their failures in implementing the system, gave advices, etc. The whole thing is even uploaded to github, so everyone can check their code (http://github.com/ajantis/scala-crdt). I’ve really enjoyed this talk due to its technical approach. There were no formal definitions, no teortical considerations, just clean report from the trenches.
Purely functiona Web Apps - by Michał Płachta - how to write Gitlab companion app in Haskell + Elm? Haskell for backend and Elm for frontend. This presentation showed great potential of Elm for frontend development. Moderatly approachable considering Friday afternoon and my lack of Haskell knowledge. Still, I plan to come back to this presentation later.

What I haven’t seen but plan to as soon as videos are available

Modeling your domain. What have we learned? Where do we go from here? - by Norbert Wójtowicz - about modeling domain in Clojurescript apps
The F#orce awakend - by ewelina Gabasova - F# + geekery, the Twitter went nuts about that presentation, suppose not without reason
LFE - a lisp flavour on the Erlang VM - by Robert Virding - don’t know why but it seems intriguing enough

Conclusion

As always, things you didn’t expect to be interesting were the best ones. Highlights being presentations by Bruce Tate, Jan Machacek and guys from TomTom (Dmitry Ivanov and Nami Naserazad)

Thanks to organisers for this conference. But I must say, that day 2 was a lot better than the first one. The talks were better structured and presented.

One idea for the organizers - please print the schedule on the back of conference badge. It was pretty annoying to have to take out the A4 sheets with whole day schedule each time I wanted to look at it

32c3 most interesting videos

2016-01-22T00:32:00+01:00

It’s been some time since the most recent incarnation on Chaos Congress took place (CCC site). I’ve finally managed to sift through some of the videos. Have a selection of those I’ve found most interesting to me. If there are others really worth watching - pass me a line! Cheers.

Tor onion services are more useful than most people realize — super packed 32c3 talk from @torproject devs:

State of the Onion - what’s happend in 2015 around Tor project. Informative to say the least.

A very interesting presentation. Especially in the light of Novena Laptop and the like:

SO YOU WANT TO BUILD A SATELLITE? - ha!

INSIDE GLORIOUS LEADER’S OPERATING SYSTEM - this one is actually great! No ridiculing, just a review of features. Go see what os-level surveilance looks like:

3D printing on the moon - just printing something on your 3d printer seems trivial compared to the challange of doing this under Moon conditions:

J. Alex Halderman, Nadia Heninger: Logjam: Diffie-Hellman, discrete logs, the NSA, and you

Super Confitura Man

2014-07-14T20:50:00+02:00

How Super Confitura Man came to be :)

Recently at TouK we had a one-day hackathon. There was no main theme for it, you just could post a project idea, gather people around it and hack on that idea for a whole day - drinks and pizza included.

My main idea was to create something that could be fun to build and be useful somehow to others. I’d figured out that since Confitura was just around a corner I could make a game, that would be playable at TouK’s booth at the conference venue. This idea seemed good enough to attract Rafał Nowak @RNowak3 and Marcin Jasion @marcinjasion - two TouK employees, that with me formed a team for the hackathon.

The initial plan was to develop a simple mario-style game, with preceduraly generated levels, random collectible items and enemies. One of the ideas was to introduce Confitura Man as the main character, but due to time constraints, this fall through. We’ve decided to just choose a random available sprite for a character - hence the onion man :)

How the game is played?

Since we wanted to have a scoreboard and have unique users, we’ve printed out QR codes. A person that would like to play the game could pick up a QR code, show it against a camera attached to the play booth. The start page scanned the QR code and launched the game with username read from paper code.

The rest of the game was playable with gamepad or keyboard.

Technicalities

Writing a game takes a lot of time and effort. We wanted to deliver, so we’ve decided to spend some time in the days before the hackathon just to bootstrap the technology stack of our enterprise.

We’ve decided that the game would be written in some Javascript based engine, with Google Chrome as a web platform. There are a lot of HTML5 game engines - list of html5 game engines and you could easily create a game with each and every of them. We’ve decided to use Phaser IO which handles a lot of difficult, game-related stuff on its own. So, we didn’t have to worry about physics, loading and storing assets, animations, object collisions, controls input/output. Go see for yourself, it is really nice and easy to use.

Scoreboard would be a rip-off from JIRA Survivor with stats being served from some web server app. To make things harder, the backend server was written in Clojure. With no experience in that language in the team, it was a bit risky, but the tasks of the server were trivial, so if all that clojure effort failed, it could be rewritten in something we know.

Statistics

During the whole Confitura day there were 69 unique players (69 QR codes were used), and 1237 games were played. The final score looked like this:

Barister Lingerie 158 - 1450 points
Boilerdang Custardbath 386 - 1060 points
Benadryl Clarytin 306 - 870 points

And the obligatory scoreboard screenshot:

Obstacles

The game, being created in just one day, had to have problems :) It wasn’t play tested enough, there were some rough edges. During the day we had to make a few fixes:

the server did not respect the highest score by specific user, it was just overwritting a user’s score with it’s latest one,
there was one feature not supported on keyboard, that was available on gamepad - turbo button
server was opening a database connection each time it got a request, so after around 5 minutes it would exhaust open file limit for MongoDB (backend database), this was easily fixed - thou the fix is a bit hackish :)

These were easily identified and fixed. Unfortunately there were issues that we were unable to fix while the event was on:

google chrome kept asking for the permission to use webcam - this was very annoying, and all the info found on the web did not work - StackOverflow thread
it was hard to start the game with QR code - either the codes were too small, or the lighting around that area was inappropriate - I think this issue could be fixed by printing larger codes,

Technology evaluation

All in all we were pretty happy with the chosen stack. Phaser was easy to use and left us with just the fun parts of the game creation process. Finding the right graphics with appropriate licensing was rather hard. We didn’t have enough time to polish all the visual aspects of the game before Confitura.

Writing a server in clojure was the most challenging part, with all the new syntax and new libraries. There were tasks, trivial in java/scala, but hard in Clojure - at least for a whimpy beginners :) Nevertheless Clojure seems like a really handy tool and I’d like to dive deeper into its ecosystem.

Source code

All of the sources for the game can be found here TouK/confitura-man.

The repository is split into two parts:

game - HTML5 game
server - clojure based backend server

To run the server you need to have a local MongoDB installation. Than in server’s directory run:

$ lein ring server-headless 

This will start a server on http://localhost:3000

To run the game you need to install dependencies with bower and than run

$ grunt

from game’s directory.

To launch the QR reading part of the game, you enter http://localhost:9000/start.html. After scanning the code you’ll be redirected to http://localhost:9000/index.html - and the game starts.

Conclusion

Summing up, it was a great experience creating the game. It was fun to watch people playing the game. And even with all those glitches and stupid graphics, there were people vigorously playing it, which was awesome.

Thanks to Rafał and Michał for great coding experience, and thanks to all the players of our stupid little game. If you’d like to ask me about anything - feel free to contact me by mail or twitter @zygm0nt

Distributed scans with HBase

2013-12-10T21:26:00+01:00

HBase is by design a columnar store, that is optimized for random reads. You just ask for a row using rowId as an identifier and you get your data instantaneously.

Performing a scan on part or whole table is a completely different thing. First of all, it is sequential. Meaning it is rather slow, because it doesn’t use all the RegionServers at the same time. It is implemented that way to realize the contract of Scan command - which has to return results sorted by key.

So, how to do this efficiently?

The usual way of getting data from HBase is with the help of its API, mainly Scan objects. To accomplish the task I’ll use just them. I’ll specify startRow and stopRow, so that each Scan request will be looking through only part of the key space.

It is crucial to note, that this whole solution works because of key sorting properties in HBase. So, HBase scans a table according to ascending key values. Since keys are of String type, key with value “1” is smaller than “2”, because they are sorted lexicographicly. So, also key with value “12345” is smaller than “2”. I’ve leveraged this property so that I’ve partitioned my whole key space according to the first character of the key. In my case keys contain only digits. So I have 10 ranges:

null-1
1-2
2-3
3-4
4-5
5-6
6-7
7-8
8-9
9-null

The speedup comes from the fact, that each range resides in its own partition. That’s right, I’ve presplit the table to have 10 partitions. This corresponds rather nicely with my cluster’s setup, because I have more than 10 RegionServers. So every partition should be on different RegionServer. It will allow the code to do the requested scan operations in parallel - giving me this exact performance boost.

How I’ve created the input table:



$ create 'tariff_changes', { NAME => 'cf', SPLITS_FILE => 'splits.txt', VERSIONS => 50, MAX_FILESIZE => 1073741824 }

$ alter 'tariff_changes', { NAME => 'cf', SPLITS_FILE => 'splits.txt', VERSIONS => 50, MAX_FILESIZE => 1073741824 }

Split file is just something along this lines:

1 2 3 4 5 6 7 8 9 0

This tells HBase what are the rowKeys starting and ending each of the partitions.

Ok, so after this rather lengthy introduction, what the actual code does? It just spins of a few threads - one for each partition - and runs a Scan request tailored to that partitions key space. This way, I got a 10x speedup for this particular scan. The execution time went down from 30 minutes to 3 for my use case.

I’ve created an example implementation of this idea. You can find it on GitHub: https://github.com/zygm0nt/hbase-distributed-search.

Any ideas on how to speed things up even more with HBase?

Simple HBase ORM

2013-12-08T21:08:00+01:00

When dealing with data stored in HBase, you are quick to come to a conclusion, that it is extremaly inconvenient to reach to it via HBase native API. It is very verbose and you always need to convert between bytes and simple types - a pain.

While I was working on a project of mine, I thought, why not to easy those pains and fetch real objects from HBase.

And that’s how this simplistic, hackish ORM came to life. It is no match for projects like Kundera (a JPA compliant solution), or n-orm. Nevertheless, it just suits my needs :)

Project sources are hosted on GitHub: https://github.com/zygm0nt/hbase-annotations

To make use of this, you need to have an entity class with annotations:

@Column - with argument specifying column family and column name, ie. @Column(“cf:column-name”)
@Id - will store row key in this property,
and optionaly @Value - for Spring Expression Language, in case you need to perform some extraction on the value.

Annotations should be on setter methods.

Now you have your model annotated and ready to be fetched from HBase.

The actual work is done with a service class, that should extend class BaseHadoopInteraction just as class SimpleHBaseClient does.

Then it is possible to just call:

Note that there are more methods you can use on BaseHadoopInteraction. You can also do:

scan
scan with key ranges
delete

What you won’t get from this simple ORM is:

automatic object updating,
nested objects,
saving to HBase - ‘cause I didn’t have a need for that,

Hope you’ll find this piece of code useful. If you see room for improvements while staying in project’s scope - please drop me a message.

And if you are searching for a full-fledged ORM solution for interacting with HBase, just head straight to Kundera project website :)

Recently at storm-users

2013-08-12T22:26:00+02:00

I’ve been reading through storm-users Google Group recently. This resolution was heavily inspired by Adam Kawa’s post “Football zero, Apache Pig hero”. Since I’ve encountered a lot of insightful and very interesting information I’ve decided to describe some of those in this post.

nimbus will work in HA mode - There’s a pull request open for it already… but some recent work (distributing topology files via Bittorrent) will greatly simplify the implementation. Once the Bittorrent work is done we’ll look at reworking the HA pull request. (storm’s pull request)
pig on storm - Pig on Trident would be a cool and welcome project. Join and groupBy have very clear semantics there, as those concepts exist directly in Trident. The extensions needed to Pig are the concept of incremental, persistent state across batches (mirroring those concepts in Trident). You can read a complete proposal.
implementing topologies in pure python with petrel looks like this:

class Bolt(storm.BasicBolt):
    def initialize(self, conf, context):
       ''' This method executed only once '''
        storm.log('initializing bolt')

    def process(self, tup):
       ''' This method executed every time a new tuple arrived '''       
       msg = tup.values[0]
       storm.log('Got tuple %s' %msg)

if __name__ == "__main__":
    Bolt().run()

Fliptop is happy with storm - see their presentation here
topology metrics in 0.9.0: The new metrics feature allows you to collect arbitrarily custom metrics over fixed windows. Those metrics are exported to a metrics stream that you can consume by implementing IMetricsConsumer and configure with Config.java#L473. Use TopologyContext#registerMetric to register new metrics.
storm vs flume - some users’ point of view: I use Storm and Flume and find that they are better at different things - it really depends on your use case as to which one is better suited. First and foremost, they were originally designed to do different things: Flume is a reliable service for collecting, aggregating, and moving large amounts of data from source to destination (e.g. log data from many web servers to HDFS). Storm is more for real-time computation (e.g. streaming analytics) where you analyse data in flight and don’t necessarily land it anywhere. Having said that, Storm is also fault-tolerant and can write to external data stores (e.g. HBase) and you can do real-time computation in Flume (using interceptors)

That’s all for this day - however, I’ll keep on reading through storm-users, so watch this space for more info on storm development.

Zookeeper + Curator = Distributed sync

2013-06-23T22:20:00+02:00

An application developed for one of my recent projects at TouK involved multiple servers. There was a requirement to ensure failover for the system’s components. Since I had already a few separate components I didn’t want to add more of that, and since there already was a Zookeeper ensemble running - required by one of the services, I’ve decided to go that way with my solution.

What is Zookeeper?

Just a crude distributed synchronization framework. However, it implements Paxos-style algorithms (http://en.wikipedia.org/wiki/Paxos_(computer_science)) to ensure no split-brain scenarios would occur. This is quite an important feature, since I don’t have to care about that kind of problems while using this app. You just need to create an ensemble of a couple of its instances - to ensure high availability. It is basically a virtual filesystem, with files, directories and stuff. One could ask why another filesystem? Well this one is a rather special one, especially for distributed systems. The reason why creating all the locking algorithms on top of Zookeeper is easy is its Ephemeral Nodes - which are just files that exist as long as connection for them exists. After it disconnects - such file disappears.

With such paradigms in place it’s fairly easy to create some high level algorithms for synchronization.

Having that in place, it can safely integrate multiple services ensuring loose coupling in a distributed way.

Zookeeper from developer’s POV

With all the base services for Zookeeper started, it seems there is nothing else, than just connect to it and start implementing necessary algorithms. Unfortunately, the API is quite basic and offers files and directories abstractions with the addition of different node type (file types) - ephemeral and sequence. It is also possible to watch a node for changes.

Using bare Zookeeper is hard!

Creating connections is tedious - and there is lots of things to take care of. Handling an established connection is hard - when establishing connection to ensemble, it’s necessary to negotiate a session also. During the whole process a number of exceptions can occur - these are “recoverable” exceptions, that can be gracefully handled and not break the connection.

So, Zookeeper API is hard.

Even if one is proficient with that API, then there come recipes. The reason for using Zookeeper is to be able to implement some more sophisticated algorithms on top of it. Unfortunately those aren’t trivial and it is again quite hard to implement them without bugs.

And since distributed systems are hard, why would anyone want another difficult to handle tool?

Enter Curator

Happily, guys from Netflix implemented a nice abstraction for dealing with Zookeeper internals. They called it Curator and use it extensively in the company’s environment. Curator offers consistent API for Zookeeper’s functionality. It even implements a couple of recipes for distributed systems.

File read/write

The basic use of Zookeeper is as a distributed configuration repository. For this scenario I only need read/write capabilities, to be able to write and read files from the Zookeeper filesystem. This code snippet writes a sample json to a file on ZK filesystem.


EnsurePath ensurePath = new EnsurePath(markerPath);
ensurePath.ensure(client.getZookeeperClient());
String json = “...”;
if (client.checkExists().forPath(statusFile(core)) != null)
     client.setData().forPath(statusFile(core), json.getBytes());
else
     client.create().forPath(statusFile(core), json.getBytes());

Distributed locking

Having multiple systems there may be a need of using an exclusive lock for some resource, or perhaps some big system requires it’s components to synchronize based on locks. This “recipe” is an ideal match for those situations.



lock = new InterProcessSemaphoreMutex(client, lockPath);
lock.acquire(5, TimeUnit.MINUTES);
… do sth …
lock.release();

(from https://github.com/zygm0nt/curator-playground/blob/master/src/main/java/pl/touk/curator/LockingRemotely.java)

Sevice Advertisement

This is quite an interesting use case. With many small services on different servers it is not wise to exchange ip addresses and ports between them. When some of those services may go down, while other will try to replace them - the task gets even harder.

That’s why, with Zookeeper in place, it can be utilised as a registry of existing services.

If a service starts, it registers into the ServiceRegistry, offering basic information, like it’s purpose, role, address, and port.

Services that want to use a specific kind of service request an access to some instance. This way of configuring easily decouples services from their configuration.

Basically this scenario needs ? steps:

1. Service starts and registers its presence (https://github.com/zygm0nt/curator-playground/blob/master/src/main/java/pl/touk/curator/WorkerAdvertiser.java#L44):



ServiceDiscovery discovery = getDiscovery();
            discovery.start();
            ServiceInstance si = getInstance();
            log.info(si);
            discovery.registerService(si);

2. Another service - on another host or in another JVM on the same machine tries to discover who is implementing the service (https://github.com/zygm0nt/curator-playground/blob/master/src/main/java/pl/touk/curator/WorkerFinder.java#L50):


instances = discovery.queryForInstances(serviceName);

The whole concept here is ridiculously simple - the service advertising its presence just stores a file with its whereabouts. The service that is looking for service providers just look into specific directory and read stored definitions.

In my example, the structure advertised by services looks like this (+ some getters and constructor - the rest is here: https://github.com/zygm0nt/curator-playground/blob/master/src/main/java/pl/touk/model/WorkerMetadata.java):



public final class WorkerMetadata {
    private final UUID workerId;
    private final String listenAddress;
    private final int listenPort;
}

Source code

The above recipes are available in Curator library (http://curator.incubator.apache.org/). Recipes’ usage examples are in my github repo at https://github.com/zygm0nt/curator-playground

Conclusion

If you’re in need of a reliable platform for exchanging data and managing synchronization, and you need to do it in a distributed fashion - just choose Zookeeper. Then add Curator for the ease of using it. Enjoy!

image comes from: http://www.flickr.com/photos/jfgallery/2993361148
all source code fragments taken from this repo: https://github.com/zygm0nt/curator-playground

Operational problems with Zookeeper

2013-03-21T23:56:00+01:00

This post is a summary of what has been presented by Kathleen Ting on StrangeLoop conference. You can watch the original here: http://www.infoq.com/presentations/Misconfiguration-ZooKeeper

I’ve decided to put this selection here for quick reference.

Connection mismanagement

too many connections

  WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@247] - Too many connections from /xx.x.xx.xxx - max is 60

running out of ZK connections?
- set maxClientCnxns=200 in zoo.cfg
HBase client leaking connections?
- fixed in HBASE-3777, HBASE-4773, HBASE-5466
- manually close connections

connection closes prematurely

  ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately.

in hbase-site.xml set hbase.zookeeper.recoverable.waittime=30000ms
pig hangs connecting to HBase
```
  WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectionException: Connection refused!
```
CAUSE: location of ZK quorum is not known to Pig
- use Pig 10, which includes PIG-2115
- if there is an overlap between TaskTrackers and ZK quorum nodes
  - set hbase.zookeeper.quorum to final in hbase-site.xml
  - otherwise add hbaze.zoopeeker.quorum=hadoophbasemaster.lan:2181 in pig.properties

Time mismanagement

client session timed out
```
  INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session , timeout of 40000ms exceeded
```
- ZK and HBase need the same session timeout values
  - zoo.cfg: maxSession=Timeout=180000
  - hbase-site.xml: zookeeper.session.timeout=180000
- don’t co-locate ZK with IO-intense DataNode or RegionServer
- specify right amount of heap and tune GC flags
  - turn on parallel/CMS/incremental GC

clients lose connections

  WARN org.apache.zookeeper.ClientCnxn - Session  for server , unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe

don’t use SSD drive for ZK transaction log

Disk management

unable to load database - unable to run quorum server

  FATAL Unable to load database on disk !  java.io.IOException: Failed to process transaction type: 2 error: KeeperErrorCode = NoNode for  at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:152)!

archive and wipe /var/zookeeper/version-2 if other two ZK servers are running

unable to load database - unreasonable length exception
```
  FATAL Unable to load database on disk java.io.IOException: Unreasonable length = 1048583 at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:100)
```
- server allows a client to set data larger than the server can read from disk
- if a znode is not readable, increase jute.maxbuffer
  - look for "Packet len is out of range" in the client log
  - increase it by 20%
  - set in JVMFLAGS="-Djute.maxbuffer=yy" bin/zkCli.sh
  - fixed in ZOOKEEPER-1513
failure to follow leader
```
  WARN org.apache.zookeeper.server.quorum.Learner: Exception when following the leader java.net.SocketTimeoutException: Read timed out 
```
CAUSE:
- disk IO contention, network issues
- ZK snapshot is too large (lots of ZK nodes)
SOLVE:
- reduce IO contention by putting dataDir on dedicated spindle
- increase initLimit on all ZK servers and restart, see ZOOKEEPER-1521
- monitor network

Best Practices

DOs

separate spindles for dataDir & dataLogDir
allocate 3 or 5 ZK servers
tune garbage collection
run zkCleanup.sh script via cron

DON’Ts

dont’ co-locate ZK with I/O intense DataNode or RegionServer
don’t use SSD drive for ZK transaction log

You may use Zookeeper as an observer - a non-voting member:

in zoo.cfg
```
  peerType=observer
```

After WHUG meeting

2012-11-30T22:20:00+01:00

Here are the slides from the talk a gave yesterday. If you have any questions, please ask.

WHUG 8. Beyond Hadoop - checking other options

2012-11-26T09:11:00+01:00

W najbliższy czwartek - czyli 29.11.2012 - poprowadzę prezentację w ramach Warsaw Hadoop User Group. Swoją obecność można odklinąć tu http://www.meetup.com/warsaw-hug/

A o czym będę mówił? Przeklejka ze strony WHUG:

Marcin skupi się na współpracy ekosystemu Hadoopa z innymi narzędziami. Pokaże jak prosto i wygodnie przetwarzać grafy i jak stosować podejście Big Data, w czasie rzeczywistym. Poruszy również temat łatwiejszego tworzenia algorytmów Map-Reduce

Będzie to nieco mniej technicza (ale wciąż praktyczna) wycieczka po obrzeżach tematyki, która jest zwykle poruszana w połączeniu z Hadoop-em.

Prezentacja będzie dotyczyć narzędzi takich jak Cascading, Storm, Titan.

Zapraszam!

Hadoop HA setup

2012-10-30T12:40:00+01:00

With the advent of Hadoop’s 2.x version, there finally is a working High-Availability solution. Even two of those. Now it really is easy to configure and use those solutions. It no longer require external components, like DRBD. It all is just neatly packed into Cloudera Hadoop distribution - the precursor of this solution.

Read on to find out how to use it.

The most important weakness of previous Hadoop releases was the single-point-of-failure, which happend to be NameNode. NameNode as a key component of every Hadoop cluster, is responsible for managing filesystem namespace information and block location. Loosing its data results in loosing all the data stored on DataNodes. HDFS is no longer able to reach for specific files, or its blocks. This renders your cluster inoperable.

So it is crucial to be able to detect and counter problems with NameNode. The most desirable behavior is to have a hot backup, that would ensure a no-downtime cluster operation. To achieve this, the second NameNode need to have up-to-date information on filesystem metadata and it needs to be also up and running. Starting NameNode with existing set of data may easily take many minutes to parse the actual filesystem state.

Previously used solution - depoying SecondaryNameNode - was somewhat flawed. It took long time to recover after failure. It was not a hot-backup solution, which also added to the problem. Some other solution was required.

So, what needed to be made redundant is the edits dir contents and sending block location maps from each of the DataNodes to NameNodes - in case of HA deployment - to both NameNodes. This was accomplished in two steps. The first one with the release of CDH 4 beta - solution based on sharing edits directory. Than, with CDH 4.1 came quorum based solution.

Find out how to configure those on your cluster.

Shared edits directory solution

For this kind of setup, there is an assumption, that in a cluster exists a shared storage directory. It should be deployed using some kind of network-based filesystem. You could try with NFS or GlusterFS.

dfs.nameservices example-cluster dfs.ha.namenodes.example-cluster nn1,nn2 dfs.namenode.rpc-address.example-cluster.nn1 master1:8020 dfs.namenode.rpc-address.example-cluster.nn2 master2:8020 dfs.namenode.http-address.example-cluster.nn1 0.0.0.0:50070 dfs.namenode.http-address.example-cluster.nn2 0.0.0.0:50070 dfs.namenode.shared.edits.dir file:///mnt/filer1/dfs/ha-name-dir-shared dfs.client.failover.proxy.provider.example-cluster org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.ha.fencing.methods sshfence dfs.ha.fencing.ssh.private-key-files /home/user/.ssh/id_dsa dfs.ha.automatic-failover.enabled true ha.zookeeper.quorum zk1:2181,zk2:2181,zk3:2181

This setup is quite OK, as long as you’re comfortable with maintaining a separate service (network storage) for handling the HA state. It seems error prone to me, because it adds another service which high availability should be ensured. NFS seems to be a bad choice here, because AFAIK it does not offer HA out of the box.

On the other hand, we have GlusterFS, which is a distributed filesystem, you can deploy on multiple bricks and increase the replication level.

Nevertheless, it still brings additional burden of another service to maintain.

Quorum based solution

With the release of CDH 4.1.0 we are now able to use a much better integrated solution called JournalNode. Now all the updates are synchronized through a JournalNode. Each JournalNode have the same data and all the NameNodes are able to recive filesystem state updates from that daemons.

This solution is much more consistent with Hadoop ecosystem.

Please note, that the config is almost identical to the one needed for shared edits directory solution. The only difference is the value for dfs.namenode.shared.edits.dir. This now points to all the journal nodes deployed in our cluster.

dfs.nameservices example-cluster dfs.ha.namenodes.example-cluster nn1,nn2 dfs.namenode.rpc-address.example-cluster.nn1 master1:8020 dfs.namenode.rpc-address.example-cluster.nn2 master2:8020 dfs.namenode.http-address.example-cluster.nn1 0.0.0.0:50070 dfs.namenode.http-address.example-cluster.nn2 0.0.0.0:50070 dfs.namenode.shared.edits.dir qjournal://node1:8485;node2:8485;node3:8485/example-cluster dfs.client.failover.proxy.provider.example-cluster org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.ha.fencing.methods sshfence dfs.ha.fencing.ssh.private-key-files /home/user/.ssh/id_dsa dfs.ha.automatic-failover.enabled true ha.zookeeper.quorum zk1:2181,zk2:2181,zk3:2181

Infrastructure

In both cases you need to run Zookeeper-based Failover Controller (hadoop-hdfs-zkfc). This daemon negotiates which NameNode should become active and which standby.

But that’s not all. Depending on the way you’ve choosen to deploy HA you need to do some other things:

Shared edits dir

With shared edits dir you need to deploy networked filesystem, and mount it on your NameNodes. After that you can run your cluster and be happy with your new HA.

Quroum based

For QJournal to operate you need to install one new package called hadoop-hdfs-journalnode. This provides startup scripts for Journal Node daemons. Choose at least three nodes that will be responsible for handling edits state and deploy journal nodes on them.

Conclusion

Thanks to guys from Cloudera we now can use an enterprise grade High Availability features for Hadoop. Eliminating the single point of failure in your cluster is essential for easy maintainability of your infrastructure.

Given the above choices, I’d suggest using QJournal setup, becasue of its relatively small impact on the overal cluster architecture. It’s good performance and fairly simple setup enable the users to easily start using Hadoop in HA setup.

Are you using Hadoop with HA? What are your impressions?

raspberry-pi

2012-10-18T23:16:00+02:00

About a month ago I finally received my very own Raspberry Pi board! Don’t know what that is? Here, read some at their website.

For the sake of completness let me just describe that as prototyping platform with ARM processor. It is really similar in concept to what Arduino is, except it has not that many extensions available (none? or very little, I’ve only found those on Adafruit pages).

So, here is the obligatory picture.

It can run a Linux distribution, so anyone familiar with that can have a go with this low-powered computer.

The board itself is on the market for quite some time now. That’s why there are lots of interesting resources and projects that you can do with that stuff.

Here are just a bunch of them:

UberFridge - for all the brewers in the world :)
SOHO server
a supercomputer - cluster out of R-Pi
Arcade Cabinet - custom build gaming rig
Hand-held emulation machine - R-Pi on the go
sending R-Pi to the edge of Space
coffe machine - MoccaPi
other projects on elinux - Elinux projects
other projects on Hack-a-Day - Hack-a-Day Raspberry Pi tags

Do you also own R-Pi? Share what you plan to do with it.

Hadoop for Enterprises

2012-06-18T11:08:09+02:00

Hadoop’s usage as a big data processing framework gains a lot of attention lately. Now, not only big players see, that they can embrace the data their sites or products are generating and develop their businesses on it. For that to happen two things are needed: the data itself and means of processing really big amounts of it.

Gathering data is relatively easy. These are not necessarily structured data, you don’t need to plan their usage at first. Just start collecting them and than you may experiment with their potential usage. If they’ll come out as useless rubbish - deleting them won’t be hard But imagine the values it may contribute to your business:

faster services - working on optimized data
more clients - because of more relevant search results
happy clients - your service can “read their minds”
etc.

There are many companies that utilize Hadoop ecosystem for their own needs. You can read about some of them here: http://wiki.apache.org/hadoop/PoweredBy But since that page lacks insight into specific applications of Hadoop I’ve tried to delve into
details of how Hadoop helped tame some companies’ big data sets.

Facebook

Being a social network provider, a widely used one, they require no introduction. However if you’ve lived under a rock for last couple years just visit their website http://facebook.com

Their main usage is data warehousing. Since they require to be able to access the data fast and reliably they had a need for real-time querying of their huge, and always growing data set. Their switch from MySQL databases was required due to the increasing workloads they experienced with standard databases. What they got “out of the box” with Hadoop was all the benefits of distributed file system (HDFS features). They expanded the ideas behind that even further and implemented truly Highly Available file system without Single Point of Failure.

Facebook has 3 interesting usage scenarios in which Hadoop plays a major role:

Titan - is Facebook’s messaging system. It processes messages exchanged between users. Ensures that it happens fast and without glitches. Here Hadoop is used mainly as a huge, unlimited storage.
Puma - Facebook Insights - a tool providing page statistics for advanced Facebook users. Based on streams of data (clicks, likes, shares, comments and impressions) it graphs those data and makes it available near instantly.
ODS - Operational Data Store - which stores Facebook’s internal metrics - collections of OS and cluster health metrics. And it facilitates multiple accounting solutions.

Twitter

This popular micro-blogging platform, where you can register your account and follow friends and celebrities for their micro-messages does some pretty interesting things with their Hadoop cluster.

One of their motivations is to speed up their web-page’s functionality. That is why the compute users’ friendships in Twitter’s social graph with Hadoop. Using connections between users they calculate their relationship to each other and estimate groups of users.

Since this service’s users generate lots of content, the company conducts researches based on natural language processing. They probe what could be told about a user from his tweets. They use tweets’ contents for advertisement purpose, trends analysis and many more.

From tweets and user’s behaviours they characterise usage scenarios. Also, they gather usage statistics, like number of searches daily, number of tweets. Based on this seemingly irrelevant data they run comparisons of different types of users. Twitter analyzes data to determine whether mobile users, users who use third party clients or power users use Twitter differently from average users. Of course theses seem like really specific applications but nevertheless they are very original and base on the data that Twitter has been gathering for some time now.

EBay

Being the biggest auctioning site on the Internet, EBay uses Hadoop processing for increasing search relevance based on click-stream data, user data. This seems pretty obvious, considering their area of operation.

However the also have one other interesting thing - they try hard to automatically fill auctioned objects’ metadata, based on the descriptions and other data provided by users. They employ data mining approach for this tasks and judging from their constant growth it seems to work

Social network for professionals, thou a lot smaller than Facebook. Based on click-streams they discover relations between users. All the data concerning latest visits on your profile or people you may know from other places - this comes from Hadoop based analysis of those clicks people make all the time on their sites.

Also a very neat feature, called InMaps (http://inmaps.linkedinlabs.com/) analyse declared schools and companies and generates data for graph with clustered friends of yours.

Last.fm

This on-line radio site, praised by many for its invaluable recommendations’ system seems like a rather small and simple service. But behind the facade of simple web page there are lots of data being processed, so that their services could match a certain level of perfection.

Such large volume of their data comes from scrobbles. Each users of their service listening to a song generates a note about this fact - called scrobble. Based on that and user profiles they calculate global band popularity charts, maps of bands’ popularity and many more usage statistics and timeline charts.

Conclusion

They just try to detect and trace new patterns in seemingly chaotic data sets. Perhaps you could also do the same? Analyze your data and expand your business value?

Comments

wedding cake decorations

We stumbled over here from a different web address and thought I might check things out.
I like what I see so i am just following you.
Look forward to checking out your web page yet again.

rubber floor mats

I like what you guys are up too. This type of clever work and reporting!

Keep up the awesome works guys I’ve added you guys to my own blogroll.

Svayambhut Ghosh

Greetings from Florida! I’m bored at work so I decided to browse your site on my iphone during lunch break. I enjoy the info you present here and can’t wait to take a look
when I get home. I’m surprised at how quick your blog loaded on my cell phone .. I’m not even using WIFI, just 3G .
. Anyways, very good site!

north face jackets

Comfortableness north face jackets
is crucial when they get it that will north face outlet get the best school bags pertaining to going camping north face sale. Your easiest guarantee in the case of even larger delivers has become One with an inner metal framework, one that can wind cheap north face up being aligned to help you appropriately fit your north face women body. They should be now have http://www.salethenorthfacejackets.com secure which were wholly flexible, because essentially in the form of midsection belt to get more aid.

plants sale

I never imagined how much stuff there was out there
on this! Thanks for making it easy to get the picture

gWgVcetqzVZukd

What Programming Languages Do Jobs Require? | Regular Geek regulargeek.com/2009/07/21/what-programming-languages-do-jobs-require view page cahecd As a software engineer, you need to keep your skills sharp and current. This is a general requirement of the job. In addition to this, in the current economy you do not want to be without a job. Obviously, this means learning more about what your current company uses for all of its development. What if you do not have a job or you are looking to leave? What technologies or programming languages should you be looking into? From the page

computer pc repair

Howdy are using Wordpress for your site platform? I’m new to the blog world but I’m trying to
get started and create my own. Do you need any coding expertise to make your own
blog? Any help would be greatly appreciated!

SoapUI ext libs and its weirdness

2011-11-02T16:32:15+01:00

Suppose you want to add some additional jars to your SoapUI installation. It all should work ok if you put them in bin/ext directory. It is scanned at startup, and jars found there are automatically added to classpath.

However if you want to add some JDBC drivers, and happen to be using SoapUI version higher than 3.5.1 it is a bit more tricky.

You may face this NoClassDefFoundError:

An error occured [oracle/jdbc/Driver], see error log for details
java.lang.NoClassDefFoundError: oracle/jdbc/Driver

If so, try registering your drivers with registerJdbcDriver function, like I did in this snippet of code:

if (context.sql == null) { def driver = 'oracle.jdbc.OracleDriver' com.eviware.soapui.support.GroovyUtils.registerJdbcDriver( driver ) def sql = Sql.newInstance('jdbc:oracle:thin:' + dbUri, driver) context.setProperty('sql', sql) }

What a crappy thing!

Comments

Kelli

You can definitely see your expertise in the work you write.
The world hopes for more passionate writers such as you who are not afraid to
say how they believe. Always go after your heart.

Buford

It’s going to be end of mine day, however before finish I am reading this fantastic paragraph to increase my experience.

Chandra

My family every time say that I am wasting my time
here at net, except I know I am getting knowledge every day by reading such pleasant articles.

Florene

Thanks , I’ve just been looking for info approximately this topic for a long time and yours is the greatest I have found out till now. But, what concerning the bottom line? Are you positive in regards to the source?

Myron

Nice post. I was checking continuously this blog and I am impressed!
Extremely helpful information specially the last part
:) I care for such information much. I was looking for this
particular info for a long time. Thank you and best of luck.

What is NoSQL good for?

2011-09-21T23:12:34+02:00

… or how I ended up writing a CouchDB proof of concept app?

Once upon a time I set out on a journey to discover the NoSQL land. I’ve decided that doing simple queries wouldn’t be interesting enough. That’s why I’ve chose to create an app that would be based on some NoSQL database.

The main idea was to create an app, that would dynamically update itself with geographic data flowing in. Since there are myriads of geo-data that are available on the internet, you can pick your favorite one and load them into your SQL database of choice.

In my case the primary source of data was a proprietary database, or more specifically - one table in it continuously updated with new data. To make that data visible on my map I needed to:

buffer the huge amount of those records - so as not to overhoul other services with large traffic, and not to flood the frontend
convert then to my representation
display them - have presentation layer in a browser - since browser-based frontend was the easiest and fastest to develop

The idea of the front-end HTML page was to show new points on the map. From the moment of opening the page records that appear in database table should be shown interactively on the screen.

Toys used

For the first step I chose to use RabbitMQ broker. A queue on the broker would receive messages - one message per database table’s row. Then I’d use some simple groovy middle ware to convert the data to appropriate format and put it onto another db - this time db specific to my app.

You may ask why incorporate another database. It would be good for separating environments - assuming the original data contains some vulnerable content that should be anatomised, or we just don’t feel comfortable exposing the whole database of some XYZ-system just to have access to its one table.

Since for my presentation layer I chose HTML+JS without any application server-based back-end I’ve decided on CouchDB . This seemed like a perfect match for this scenario. Why? - ease of use, REST API, with JSON responses - just great for interacting with my simple front-end.

The flow of things was as shown on the image below:

Avro - for the beginning

As you can see, I’ve chosen JSON as my data-format. I’ve been considering Apache Avro in the first place but using it was a real pain in the ass. Avro itself is used in Apache Hadoop as a serialization layer, so it would seem OK, but it has virtually no documentation. But once you tear through the unintuitive interface and manage to handle all those unthinkable exceptions you get a few pros for this library. It’s great in that it does not require code generation - I like it being made on the fly. It also offers sending data in binary format, which was not necessary, but never the less is a nice feature.

What I certainly didn’t like about it was its orientation on the files rather than chunks of data - so it was not so obvious how should I send data through the wire.

Than I found out it can produce JSON output, which would work for me, except the output could not have been parsed by other JSON libraries :) (I’ve asked on stackoverflow about that, but with no luck).

If my whining haven’t put you back and still would like to see how to use Avro, try this unit test in project’s GitHub repo: AvroSimpleTest.groovy

Svenson

I’ve dropped Avro in favour of a simple JSON lib called (Svenson and that was painless. The only thing I was forced to do was create my model class in Java - the rest of the project is written in Groovy. I’ve no idea why was that necessary, and didn’t want to look into it.

RabbitMQ

Further on the way is RabbitMQ, to which records are filled by a feeding middle-ware written in Groovy. Since I use ActiveMQ on a day-to-day basis, I’ve decided to try something new. This broker is a really nice piece of software. Being written in Erlang makes it really fast. What’s more it has some extensive capabilities and is easy to approach for anyone similar with messaging (JMS and friends). For such a lightweight product it is really powerful - implements AMQP!

CouchDB

From the broker’s queue messages are again fetched by a middle-ware just to be put into CouchDB view. This database is also written in Erlang. It’s very reliable, however the way it handles refreshing view isn’t the most pleasant one - performance-wise.

Word of advice - if you’re on Debian derivative, be cautious with apt-repository version. It’s rather _ancient_. Also remember to add allow_jsonp = true to you config file /opt/couchbase/etc/couchdb/local.ini. It’s not enabled by default, and not having this set would result with empty responses from the CouchDB server.

The problem here is, that the browser doesn’t allow quering a web server with hostname other than the one the script originates. More on this case here. Seems like my problem could be overcame by changing url in index.html and hostname couchdb listens on to the same address.

I’ve also created a view, that would expose an event by key: view code

Presenting the dots

As a back-end I’ve done some JQuery based AJAX calls - nothing too fancy. All things necessary for presentation layer are in this file.

Things to consider

Please bear in mind that this whole application is rather a playground, not a full-fledged project!! After creating all the parts I have some doubts about some architectural decisions I made. I don’t think the security have been taken into account seriously enough. Also scalability was never an issue ;-)

If you have some thoughts about any of the aspects mentioned in this post, please feel free to comment or contact me directly :)

And also you may try the application by yourself - it’s on the GitHub.

Comments

Marcin

@Piotrek, here is a link to JIRA ticket concerning this feature. I think it is being discussed ATM: https://issues.apache.org/jira/browse/COUCHDB-431

Piotrek Reinmar Koszuliński

About Same Origin Policy - now there’s Cross Origin Resource Sharing available in most of common browsers. It should help You if CouchDB has support for it.

Marcin

@klausa, thanks for your advice. I’ve made some changes to the post.

klausa

>The main idea was to create an app, that would dynamically update itself with geographic data flowing in.

Not to nitpick, but that doesn’t seem like an idea for app. I think you should explain what that displayed data is here. If you moved your ‘Presenting the dots’ paragraph just above ‘Toys used’, it would be clear what do you wanted to do with this app.

>Also remember to add allow_jsonp = true to you config file /opt/couchbase/etc/couchdb/local.ini.

I think you should explain what that option *really* does.

Other than that, nice post!

5 best things to do with your Kindle

2011-08-23T00:19:14+02:00

I bought a Kindle (3rd generation, Wi-Fi only) some time ago - like half a year ago. Read some books, done some web-browsing (awful, quite unpleasant). Gradually I became more and more curious of other things possible to achieve with this slate-looking piece of tech. These are my thoughts and ideas.

Got a Kindle? Use it every day? Feel like modding or extending your ways of usage? Great! Read on, and share your thoughts in comments!

Readability! - This web app is great! Generally this is a simple plug-in for your browser that will show a little button somewhere on the toolbar, and if you click it, the page you’re reading now will be transformed into nice and sleek content-only page. Look on the screen below:

This plug-ins additional function is sending to Kindle account. That’s the nicest way to read those loads of RSS-sources articles :) The only limitation is that graphics won’t be included if resulting file would exceed allowed size of kindle documents - that’s 2MB AFAIR.
Install some hacks! - be that serious hacks or rather some simple software modifications:
- read all book formats with Calibre - link
- play Zork on a Kindle!!! - link
- alternative Kindle keyboard - link
- custom fonts - link
Install custom screen-savers - do this to be able to install your own images. …because you’ve always wanted to have some other things on screen when your kindle is in standby mode. Of course, the original screen-savers look great, but there are only few. Installing this hack gave me an opportunity to have a multitude of new images. Now my Kindle looks even better!
Try out Chinese kindle software - doukan.com - As a matter of fact, I haven’t installed that software yet. It doesn’t look good enough for me, and has some minor problems. However this is great, that there is actually some other option - I’m not forced to use the official firmware. And this distribution has many nice features like PDF reflow.
Enable Chinese fonts support on your kindle - damn! I’d like a simple, step by step tutorial on how to set up chinese fonts on a kindle. I’d like to put some font file on my device, fire some chinese book and be able to see the actual characters..
Programming for Kindle - with Kindle Official SDK - well, not quite! - unfortunately this is reserved only for the Chosen Ones. I’ve applied for the SDK but they haven’t sent me my developer key yet, and it’s been ~2 months. This is not “being supportive” or “supporting the community”.

And how do You use your Kindle? Perhaps you’re doing some serious, crazy things with it? Share your thoughts!

Comments

Hoppke

Zgłosiłem się po kindlowe SDK prawie rok temu i niiic, cisza. Widać nie jestem dość cool, by dać mi tę zabawkę do ręki :)

Co do książek, to fakt, DRM wszędzie. Ale DRM w ebookach działa jak każdy inny (czyli marnie – da się zdjąć DRM Empiku, Amazona, itp.), więc użytkownik z odrobiną zacięcia da radę.

PS. Mój kindel postanowił wyzionąć ducha jakoś w zeszłym tygodniu, na 10 dni przed upływem gwarancji. Kindle znajomej padł ciut (tydzień-dwa?) wcześniej. Amazon bez szemrania wysyła nowe, ale… nie sposób oprzeć mi się uczuciu, że te urządzonka były obliczone na rok życia. A przynajmniej pierwsza seria z preorderów, obecne są (mam nadzieję) już trwalsze.

pecet

Dzięki za odpowiedzi :)

moher

Też kupowałem kindle ~6 m-cy temu więc się wtrącę:

Ad 1. Ja kupowałem bezpośrednio w Amazonie i nie zapłaciłem VAT-u (OIDP cła na elektronikę z USA nie ma).
Ad 2. Przeglądarka IMHO z JS radzi sobie całkiem dobrze, ale jest cholernie wolna i nawigacja jest niewygodna.
Ad 3. Domyślnie tylko WPA2-PSK, ale jest tam nomalny wpa supplicant więc można edytować sobie konfig i szaleć.
Ad 4. Ja całkiem sporo czytam i ładuje raz na miesiąc, może minimalnie częściej.
Ad 5. Można kupować z Amazona, ja kupuje z Amazon UK, bo kiedy konfigurowałem Kindle to miał niższe ceny książek. Co do DRM to obsługuje tylko swój DRM (czyba azw, czyli mobi + Amazonowy DRM), z innych trzeba zdjąć DRM i skonwertować na obsługiwany format (Calibre rulez!).

Marcin

Okej, no więc po koleji:

1. Kupowałem bezpośrednio na Amazon - tyle że na amerykańskim, bo tylko z tamtego ślą Kindle do Polski. Uważam, że to jest najtańsza możliwa opcja. Cło opłaca Amazon, Ty niczym się nie przejmujesz, wszystko jest zrobione za Ciebie. Cała impreza kosztowała mnie coś koło 400zł (Kindle 3 wifi only). Z tego co widziałem to na Allegro jest zdecydowanie drożej.

2. Przeglądanie stron na Kindlu to tylko w razie naprawdę dużej potrzeby. Mi się nie podoba, wyświetlacz jest na tyle mało responsywny, że swobodne surfowanie po sieci jest niewykonalne. Jak musisz koniecznie coś sprawdzić, to sprawdzisz, ale dla przyjemności to raczej w ten sposób się tego nie robi ;-)

3. Nie mam dostępu do WPA2 z Radiusem. Używam na WPA2 z PSK - i działa bez zarzutu. Może pogooglaj gdzieś?

4. To prawda, trzyma miesiąc, tylko trzeba pamiętać żeby Wi-Fi wyłączać, bo nawet na standbaju zrzera baterię.

5. W Polsce można bez problemu kupować książki z Amazona (nadal, przez Wifi, bo przez 3G to nie wiem). Co do Polskich sklepów, to o ile oferują wspierane przez Kindla formaty, to nie powinno być problemu. Ja osobiście raczej mało książek kupuję na Kindla - korzystam z ogólno dostępnej klasyki + mam osobno kupione PDFy itp. Generalnie nie przeczytasz żadnych książek w pub’ach ani tym podobnych formatach. Aczkolwiek są na to haki (między innymi chinski software, o którym pisałem).

W każdym razie polecam zakup, bo naprawdę warto - chyba że wolałbyś coś w stylu IPada (kolory, łatwe surfowanie), to wtedy Kindle nie jest dla Ciebie :)

pecet

Sorry że po polsku, ale przymierzam się do kupna Kindla, i mam parę pytań, wybacz jeśli zaśmiecam ci notkę:

1. Gdzie kupowałeś bezpośrednio na Amazonie czy przez pośrednika z Allegro, jak z cłem i innymi podatkami?
2. Przeglądarka w Kindle 3 podobno na webkicie, jak w praktyce, dobrze sobie radzi ze stronami, co z JSem?
3. WiFi obsługuje szyfrowanie WPA2 korporacyjne z serwerem RADIUS, czy tylko wersję WPA2 z PSK?
4. Jak z bateryjką, słyszałem że miesiąc daje rady, prawda to?
5. Można w Polandii kupować w Amazonie książki do Kindla? Są jakieś polskie sklepy z polskimi legalnymi ksiażkami, które później bez problemów wrzucę do Kindla, czy przez DRM nie da rady?