You can definitely see your expertise in the work you write.
The world hopes for more passionate writers such as you who are not afraid to
say how they believe. Always go after your heart.
Over time I’ve learned to gather insights, ideas and understanding from them. I’ve developed some technique and would like to share it. Is it really helpful? I don’t know. Perhaps it just works for me. Can’t say for sure.
The most basic paper that shows how to efficiently read research papers is “How to read a paper”. It’s great, short and succinct but pinpoints exactly the things you should do to read and comprehend such content. At least it works for me.
The whole procedure outlines reading a paper in multiple papers. It also states that reading from paper is a lot more efficient that from screen. At least it gets all the online distractions away.
I’ve decided to go with just two passes. First pass is just to quickly scan the paper and see if it’s actually relevant. Second pass should give me sufficient understanding of the paper to decide should I dwell on it longer (ie. to implement or use in some project).
Ok, so, here’s what I get from each pass:
This pass should be sufficient to assign the following:
This pass should be enough to:
Hey, it’s not me “going loco”. Check this guy - Avi Bryant talking about his experience with reading research papers. The video is here. He shows his product for introducing mass edits in spreadsheets by generalizing editing with a groups of algorithms. There are no details of the algorithms, but please watch this just for his great motivation and passion.
Since I’ve started tracking my progress with reading research papers I’ve already read quite a number of those. Here, have a look at some extremely interesting papers, at least things I find this way:
Of course this is not for everyone. Just try and decide whether you like that kind of brain muscle stretching, or not. If the things described above scare you, perhaps try with something smaller. There is a delightful Youtube channel 2 minute papers, which goes beyond computer science and shows great scientific innovations in just two minutes. This in itself is just too short to actually get all the details, but is just enough to get you interested in a specific subject. You can later dig deeper into specific areas.
Reading research papers changes perspective. It’s great, do this as frequent as you want, but just ingest new ideas, or read papers that are effectively building blocks of specific industries. They are really good and reading the classics is always in fashion!
]]>This is information overload.
Disclaimer: Please treat this entry as a mind dump - the description of my approach at this moment. These are ways of dealing with information overflow, and also threads. If you’d like to discuss anything related, feel free to write about that in comments’ section.
You can always say I’m just clicking like a madman, or suffering from ADHD or FOMA. Sure, this would be the simplest explanation.
But I’d argue, that my behavior isn’t that rare. People just tend to get lost in this abundance of information. That’s why I’ve tried introducing minimalistic approach in my digital activity.
There is actually a trend that says - leave only the most important things, have only a hundred of them, this should suffice. You can read about it all around the internet.
In general this is more about simplifying your living, not only its physical aspects. There are lots of advises how to have less on your mind, let go of unimportant things. And this all really helps.
Some of those I’ve tried. And it helped. I’ve stopped blindly gathering things. I’ve thrown away, sold or swapped some others I know I won’t look into again. I’ve somewhat learned how to at best try to be a minimalist in physical world. I’m still struggling to set my mind onto minimalist’s tracks.
What I’ve been lacking was following those rules with my thought processes and my digital activity. The way I can most precisely describe my mind, or what’s happening in it is indycar race with new cars constantly joining without the old ones retiring. This means all the time I get all the weird ideas, some small, like clean the fridge, some bigger, like I need to check this new awesome hardware or software platform, and sometimes they’re just overwhelming but tempting. The kind of bold ones - “get rich young or die trying” kind.
Not doing anything isn’t helping. Advising at least some solutions to this data influx is the least we can do. I’ve mentioned just the most useful to me. If you have yours, please share those in the comments.
]]>Who would have thought? Me switching to Mac? But it actually happened. I’ve been a long time Linux user, so why actually do the switch now? Or ever? Have a look at my thoughts and insights of novice Mac user.
My humble beginnings were with RedHat 6.0 on a AMD Duron 600Mhz PC machine. With a dialup connection. And setting that thing to connect into Poland’s national provider TPSA was rather painful. After some time using Linux, going from RedHat to Slackware I’ve decided things weren’t tough enough and switched to BSDs. That was really fun! I’ve used Free and Open flavors for a few years which very really great. Than switched back to Linux. I’ve used those *nix systems for all-things-computer. At home and at work.
And than came the new job and I had to choose: regular Dell or Macbook? Blue pill or red pill? I chose the later.
The switch wasn’t that painful, but mainly thanks to superb hardware Apple offers. Having to get used to a completely new OS was quite another story. It’s not that I was shocked by it. As a long time Linux user all the concepts are known to me. It’s the other way around - there are a lot of things missing that I learned to require from my tool of work.
What I really appreciate is the known userland tools - it’s BSD at the heart of it. Well, of course the kernel is some whacky Mach microkernel, but as for the userland I’m happy :) You can read more about the kernel and system’s design history on wikipedia (Os X, Darwin) XNU)
One of the biggest disappointments was the filesystem and the way it’s presented to users. Mac OS X comes with HFS+ (link) filesystem which is case insensitive!!! To me this seems like an abomination. Plus there are multiple shortcuts taken by Apple engineers, like its endiannes: primarily Macs used PowerPC chips which are BigEndian by desing, but after switching to Intel processors everything is LittleEndian now. AFAIK HFS+ has to still switch bytes when reading metadata.
Really neat thing I’ve found out recently is that under the hood OS X uses PF (OpenBSD’s packet filter) as a firewalling solution. I don’t know which version is in the current release and how does it compare against the original implementation, but since PF has such a nice syntax and performance it’s great to have it on board. There are numerous blog posts about setting a decent firewall on OS X with PF so go have a look. Also you can play with PF by means of a set of apps called Murus.
Spotlight is just that neat little thing out there that seems like indexing all the things and runs installed programs. But it can also serve as a calculator and …
You can also read the contents of its cache file.
Up until version 10.10.4 of OS X it was possible to have additions to spotlight, but now this behaviour is blocked by the system. http://mac-how-to.wonderhowto.com/how-to/customize-spotlight-search-mac-os-x-yosemite-0160786/
Setup your mac with this shell script - but be very careful and read through this file first! https://gist.github.com/brandonb927/3195465
by Flickr user blakespot
And it is quite positive, that Mac OS X developers care about small, but extremely important things - like building OpenSSH with LibreSSL support!
Recently I’ve experienced a huge slowdown on my Mac. The It support’s solution was to reinstall the OS. I refused, this seemed like a barbaric method and also I just wanted to use this opportunity to delve deeper into internals of this OS.
I’ve started with analysing system behaviour with DTrace - probing interface originating from Solaris
Useful introductory links are here:
All in all the switch was relatively painless, and with tools and tricks described here I feel very comfortable using this system.
]]>This small conference had its third installment in (http://www.polin.pl/en)[Polin - Museum of the History of Polish Jews]. Organizing Scala conference isn’t that obvious and having that kind of attention is really great - all the foreign speakers, etc.
Here is a quick recap of presentations I’ve attended.
A keynote about typeclasses. I’ve read the slides before the presentation -and they are available here, so it wasn’t new to me. But I’d strongly recommend to watch this talk.
Also, there is another nice talk by Paweł: Category theory is absolute general nonsense! This talk is in Polish only.
Different way of monadic composition, implemented in a separate library: eff-cats (there is also scalaz alternative). Technically it was a bit over my head, but after experimenting with the concept at home it seems interesting.
I liked that Eric, being Zalando employee, developed that for their internal usage. Also the beauty of easily constructing different technological stacks with consistent API is very tempting.
Interesting things used/mentioned were:
Deep dive into scalac incremental compiler implementation. While the presentation itself, the way it was presented, was very nice, I didn’t particularly enjoy the topic. It seemed more appropriate for tools’ developers.
This one looked promising. Joining multiple exciting technologies seems often that way. Unfortunately it was rather tech preview of Kubernetes app setup rather than Scala presentation.
I’m a bit disappointed, because looking through the sources of the project akka-cluster-etcd there are lots of things to describe. Kubernetes not being one of them ;-)
Well, that was a clear introduction to shapeless. Till this presentation I’ve treated shapeless as only some vague thingy you use when writing libraries. Valentin proved otherwise.
He gave a clear and consistent tutorial on the basics of shapeless and its potential use in some example applications. He created a simple diff tool. Not that useful in real life application, but definitely helped make a point.
Will look into real life examples in near future.
Fun fact: The speaker had his presentation in REPLesent, a sbt plugin for presentations - and performed most of the slide-jumping-fu with just one hand, other held in his pocket :D Looked quite hilarious.
Valentin also mentioned shapeless’ typeclass derivation - this actually looks nice! Have a look at this question
Functional Design Patterns - enjoyable talk. With a bit of theoretical background, lots of useful use-cases. The talk was about cats and scalaz - both being very similar, and patterns they contain.
Key points being:
There is this nice video called “A purely functional approach to building large applications”. Go see it.
To conclude - the conference was really great. Both the talks and social side of the event were enjoyable. I’ve been blown away by the great architecture of the museum itself. Been there a couple times and enjoyed all of them.
The much needed change would be to bring more tracks, because sometimes topics were too specific and I’d gladly change lecture to some other.
As for the usual conference merch, there were no bags with leaflets and pens etc. It came as a surprise to me, but I also really liked it. People infact throw away most of the contents from such bags. Also, no t-shirt as a default - great idea! If you like it - just buy one, otherwise don’t bother.
Overall - great job Softwaremill! Keep up the good work!.
]]>This last two days I’ve spend on a conference in Kraków. The topics revolved around functional programming, with all the experimental stuff, popular languages, etc. Here are the talks I’ve been to and short summaries of those:
Propositions as types - the keynote delivered by Philip Wadler - seen it already on YouTube. But the thing with Lambda t-shirt looked funny again :D
Using “Program shaping” and algorithmic skeletons to parallelise an evolutionary multi-agent system in Erlang - pretty long title, huh :) The presentation, thou very academic was also quite interesting. The whole concept not being particularly new, but the set of tools used for refactorings into parallel code seemed interesting.
Static analysis to identify divide-and-conquer algorithms - how to find particular class of algorithms in existing programs? Interesting but shallow, due to time limitations.
The Mysteries of Dropbox - John Hughes of QuickCheck fame showed how to test a big system using blackbox approach. Nice presentation, with all the great insight he always brings into his talks.
Muvr - Jan Machacek - that was a fun presentation. About how to build an app that detects your exercise routine and counts how many repeats you’ve done. There’ve been all:
Embracing change - how to introduce Clojure into your company’s technology stack seamlessly - by Artur Skowroński. Hilarious presentation about adopting new languages. It wasn’t that groundbreaking, at least for me, but the way it was presented, and the slides with all the Cthulhu references in them, were great!
Things that matter by Bruce Tate - great energetic talk, with all the ideas about learning new languages. Bruce is the author of two excellent books: “Seven languages in seven weeks” and “Seven more languages in seven weeks” :) He went through languages mentioned in his books and summed up principles associated with their inception.
I’ve almost decided to not come to this presentation, weighting between it and sleeping a little bit longer in my hotel room.
The Zen of Akka - delivered by a resident hakker at Typesafe - Konrad Malawski. Nice talk about Akka pitfalls, there were lots of recommendations and even some previously unknown to me features of Akka toolkit. Plus I loved the Japanese elements.
Creating reactive components using ClojureScript React wrappers - by Konrad Szydlo introduced Rum as a wrapper for React, described part of the ecosystem and explained mechanics behind the technology. Fine presentation, very intense on content. Konrad had 111 slides, but managed to show all of them without being hasty.
Getting started with Frege - Lech Glowiak, Frege seems to be a Haskell on the JVM, which in itself seems like a nice thing. But I treat such languages rather as nice experiments, than something useful. Especially its Java interop looks ugly. Nevertheless the presentation was given from the point of view of language contributor, which Lech is.
Practical demystification of CRDT - by Dmitry Ivanov and Nami Naserazad - both of them from TomTom. Guys are working on TomTom’s NaviCloud product. Their presentation was a practical guide throught the world of CRDTs (link to wikipedia). They showed their failures in implementing the system, gave advices, etc. The whole thing is even uploaded to github, so everyone can check their code (http://github.com/ajantis/scala-crdt). I’ve really enjoyed this talk due to its technical approach. There were no formal definitions, no teortical considerations, just clean report from the trenches.
Purely functiona Web Apps - by Michał Płachta - how to write Gitlab companion app in Haskell + Elm? Haskell for backend and Elm for frontend. This presentation showed great potential of Elm for frontend development. Moderatly approachable considering Friday afternoon and my lack of Haskell knowledge. Still, I plan to come back to this presentation later.
As always, things you didn’t expect to be interesting were the best ones. Highlights being presentations by Bruce Tate, Jan Machacek and guys from TomTom (Dmitry Ivanov and Nami Naserazad)
Thanks to organisers for this conference. But I must say, that day 2 was a lot better than the first one. The talks were better structured and presented.
One idea for the organizers - please print the schedule on the back of conference badge. It was pretty annoying to have to take out the A4 sheets with whole day schedule each time I wanted to look at it
]]>Recently at TouK we had a one-day hackathon. There was no main theme for it, you just could post a project idea, gather people around it and hack on that idea for a whole day - drinks and pizza included.
My main idea was to create something that could be fun to build and be useful somehow to others. I’d figured out that since Confitura was just around a corner I could make a game, that would be playable at TouK’s booth at the conference venue. This idea seemed good enough to attract Rafał Nowak @RNowak3 and Marcin Jasion @marcinjasion - two TouK employees, that with me formed a team for the hackathon.
The initial plan was to develop a simple mario-style game, with preceduraly generated levels, random collectible items and enemies. One of the ideas was to introduce Confitura Man as the main character, but due to time constraints, this fall through. We’ve decided to just choose a random available sprite for a character - hence the onion man :)
Since we wanted to have a scoreboard and have unique users, we’ve printed out QR codes. A person that would like to play the game could pick up a QR code, show it against a camera attached to the play booth. The start page scanned the QR code and launched the game with username read from paper code.
The rest of the game was playable with gamepad or keyboard.
Writing a game takes a lot of time and effort. We wanted to deliver, so we’ve decided to spend some time in the days before the hackathon just to bootstrap the technology stack of our enterprise.
We’ve decided that the game would be written in some Javascript based engine, with Google Chrome as a web platform. There are a lot of HTML5 game engines - list of html5 game engines and you could easily create a game with each and every of them. We’ve decided to use Phaser IO which handles a lot of difficult, game-related stuff on its own. So, we didn’t have to worry about physics, loading and storing assets, animations, object collisions, controls input/output. Go see for yourself, it is really nice and easy to use.
Scoreboard would be a rip-off from JIRA Survivor with stats being served from some web server app. To make things harder, the backend server was written in Clojure. With no experience in that language in the team, it was a bit risky, but the tasks of the server were trivial, so if all that clojure effort failed, it could be rewritten in something we know.
During the whole Confitura day there were 69 unique players (69 QR codes were used), and 1237 games were played. The final score looked like this:
And the obligatory scoreboard screenshot:
The game, being created in just one day, had to have problems :) It wasn’t play tested enough, there were some rough edges. During the day we had to make a few fixes:
These were easily identified and fixed. Unfortunately there were issues that we were unable to fix while the event was on:
All in all we were pretty happy with the chosen stack. Phaser was easy to use and left us with just the fun parts of the game creation process. Finding the right graphics with appropriate licensing was rather hard. We didn’t have enough time to polish all the visual aspects of the game before Confitura.
Writing a server in clojure was the most challenging part, with all the new syntax and new libraries. There were tasks, trivial in java/scala, but hard in Clojure - at least for a whimpy beginners :) Nevertheless Clojure seems like a really handy tool and I’d like to dive deeper into its ecosystem.
All of the sources for the game can be found here TouK/confitura-man.
The repository is split into two parts:
To run the server you need to have a local MongoDB installation. Than in server’s directory run:
1
|
|
This will start a server on http://localhost:3000
To run the game you need to install dependencies with bower and than run
1
|
|
from game’s directory.
To launch the QR reading part of the game, you enter
http://localhost:9000/start.html
. After scanning the code you’ll be
redirected to http://localhost:9000/index.html
- and the game starts.
Summing up, it was a great experience creating the game. It was fun to watch people playing the game. And even with all those glitches and stupid graphics, there were people vigorously playing it, which was awesome.
Performing a scan on part or whole table is a completely different thing. First of all, it is sequential. Meaning it is rather slow, because it doesn’t use all the RegionServers at the same time. It is implemented that way to realize the contract of Scan command - which has to return results sorted by key.
So, how to do this efficiently?
The usual way of getting data from HBase is with the help of its API, mainly Scan objects. To accomplish the task I’ll use just them. I’ll specify startRow and stopRow, so that each Scan request will be looking through only part of the key space.
It is crucial to note, that this whole solution works because of key sorting properties in HBase. So, HBase scans a table according to ascending key values. Since keys are of String type, key with value “1” is smaller than “2”, because they are sorted lexicographicly. So, also key with value “12345” is smaller than “2”. I’ve leveraged this property so that I’ve partitioned my whole key space according to the first character of the key. In my case keys contain only digits. So I have 10 ranges:
The speedup comes from the fact, that each range resides in its own partition. That’s right, I’ve presplit the table to have 10 partitions. This corresponds rather nicely with my cluster’s setup, because I have more than 10 RegionServers. So every partition should be on different RegionServer. It will allow the code to do the requested scan operations in parallel - giving me this exact performance boost.
How I’ve created the input table:
$ create 'tariff_changes', { NAME => 'cf', SPLITS_FILE => 'splits.txt', VERSIONS => 50, MAX_FILESIZE => 1073741824 }
$ alter 'tariff_changes', { NAME => 'cf', SPLITS_FILE => 'splits.txt', VERSIONS => 50, MAX_FILESIZE => 1073741824 }
Split file is just something along this lines:
1
2
3
4
5
6
7
8
9
0
This tells HBase what are the rowKeys starting and ending each of the partitions.
Ok, so after this rather lengthy introduction, what the actual code does? It just spins of a few threads - one for each partition - and runs a Scan request tailored to that partitions key space. This way, I got a 10x speedup for this particular scan. The execution time went down from 30 minutes to 3 for my use case.
I’ve created an example implementation of this idea. You can find it on GitHub: https://github.com/zygm0nt/hbase-distributed-search.
Any ideas on how to speed things up even more with HBase?
]]>While I was working on a project of mine, I thought, why not to easy those pains and fetch real objects from HBase.
And that’s how this simplistic, hackish ORM came to life. It is no match for projects like Kundera (a JPA compliant solution), or n-orm. Nevertheless, it just suits my needs :)
Project sources are hosted on GitHub: https://github.com/zygm0nt/hbase-annotations
To make use of this, you need to have an entity class with annotations:
Annotations should be on setter methods.
Now you have your model annotated and ready to be fetched from HBase.
The actual work is done with a service class, that should extend class BaseHadoopInteraction just as class SimpleHBaseClient does.
Then it is possible to just call:
Note that there are more methods you can use on BaseHadoopInteraction. You can also do:
What you won’t get from this simple ORM is:
Hope you’ll find this piece of code useful. If you see room for improvements while staying in project’s scope - please drop me a message.
And if you are searching for a full-fledged ORM solution for interacting with HBase, just head straight to Kundera project website :)
]]>nimbus will work in HA mode - There’s a pull request open for it already… but some recent work (distributing topology files via Bittorrent) will greatly simplify the implementation. Once the Bittorrent work is done we’ll look at reworking the HA pull request. (storm’s pull request)
pig on storm - Pig on Trident would be a cool and welcome project. Join and groupBy have very clear semantics there, as those concepts exist directly in Trident. The extensions needed to Pig are the concept of incremental, persistent state across batches (mirroring those concepts in Trident). You can read a complete proposal.
implementing topologies in pure python with petrel looks like this:
class Bolt(storm.BasicBolt): def initialize(self, conf, context): ''' This method executed only once ''' storm.log('initializing bolt') def process(self, tup): ''' This method executed every time a new tuple arrived ''' msg = tup.values[0] storm.log('Got tuple %s' %msg) if __name__ == "__main__": Bolt().run()
Fliptop is happy with storm - see their presentation here
topology metrics in 0.9.0: The new metrics feature allows you to collect arbitrarily custom metrics over fixed windows. Those metrics are exported to a metrics stream that you can consume by implementing IMetricsConsumer and configure with Config.java#L473. Use TopologyContext#registerMetric to register new metrics.
storm vs flume - some users’ point of view: I use Storm and Flume and find that they are better at different things - it really depends on your use case as to which one is better suited. First and foremost, they were originally designed to do different things: Flume is a reliable service for collecting, aggregating, and moving large amounts of data from source to destination (e.g. log data from many web servers to HDFS). Storm is more for real-time computation (e.g. streaming analytics) where you analyse data in flight and don’t necessarily land it anywhere. Having said that, Storm is also fault-tolerant and can write to external data stores (e.g. HBase) and you can do real-time computation in Flume (using interceptors)
That’s all for this day - however, I’ll keep on reading through storm-users, so watch this space for more info on storm development.
]]>An application developed for one of my recent projects at TouK involved multiple servers. There was a requirement to ensure failover for the system’s components. Since I had already a few separate components I didn’t want to add more of that, and since there already was a Zookeeper ensemble running - required by one of the services, I’ve decided to go that way with my solution.
Just a crude distributed synchronization framework. However, it implements Paxos-style algorithms (http://en.wikipedia.org/wiki/Paxos_(computer_science)) to ensure no split-brain scenarios would occur. This is quite an important feature, since I don’t have to care about that kind of problems while using this app. You just need to create an ensemble of a couple of its instances - to ensure high availability. It is basically a virtual filesystem, with files, directories and stuff. One could ask why another filesystem? Well this one is a rather special one, especially for distributed systems. The reason why creating all the locking algorithms on top of Zookeeper is easy is its Ephemeral Nodes - which are just files that exist as long as connection for them exists. After it disconnects - such file disappears.
With such paradigms in place it’s fairly easy to create some high level algorithms for synchronization.
Having that in place, it can safely integrate multiple services ensuring loose coupling in a distributed way.
With all the base services for Zookeeper started, it seems there is nothing else, than just connect to it and start implementing necessary algorithms. Unfortunately, the API is quite basic and offers files and directories abstractions with the addition of different node type (file types) - ephemeral and sequence. It is also possible to watch a node for changes.
Creating connections is tedious - and there is lots of things to take care of. Handling an established connection is hard - when establishing connection to ensemble, it’s necessary to negotiate a session also. During the whole process a number of exceptions can occur - these are “recoverable” exceptions, that can be gracefully handled and not break the connection.
So, Zookeeper API is hard.
Even if one is proficient with that API, then there come recipes. The reason for using Zookeeper is to be able to implement some more sophisticated algorithms on top of it. Unfortunately those aren’t trivial and it is again quite hard to implement them without bugs.
And since distributed systems are hard, why would anyone want another difficult to handle tool?
Happily, guys from Netflix implemented a nice abstraction for dealing with Zookeeper internals. They called it Curator and use it extensively in the company’s environment. Curator offers consistent API for Zookeeper’s functionality. It even implements a couple of recipes for distributed systems.
The basic use of Zookeeper is as a distributed configuration repository. For this scenario I only need read/write capabilities, to be able to write and read files from the Zookeeper filesystem. This code snippet writes a sample json to a file on ZK filesystem.
EnsurePath ensurePath = new EnsurePath(markerPath);
ensurePath.ensure(client.getZookeeperClient());
String json = “...”;
if (client.checkExists().forPath(statusFile(core)) != null)
client.setData().forPath(statusFile(core), json.getBytes());
else
client.create().forPath(statusFile(core), json.getBytes());
Having multiple systems there may be a need of using an exclusive lock for some resource, or perhaps some big system requires it’s components to synchronize based on locks. This “recipe” is an ideal match for those situations.
lock = new InterProcessSemaphoreMutex(client, lockPath);
lock.acquire(5, TimeUnit.MINUTES);
… do sth …
lock.release();
This is quite an interesting use case. With many small services on different servers it is not wise to exchange ip addresses and ports between them. When some of those services may go down, while other will try to replace them - the task gets even harder.
That’s why, with Zookeeper in place, it can be utilised as a registry of existing services.
If a service starts, it registers into the ServiceRegistry, offering basic information, like it’s purpose, role, address, and port.
Services that want to use a specific kind of service request an access to some instance. This way of configuring easily decouples services from their configuration.
Basically this scenario needs ? steps:
1. Service starts and registers its presence (https://github.com/zygm0nt/curator-playground/blob/master/src/main/java/pl/touk/curator/WorkerAdvertiser.java#L44):
ServiceDiscovery discovery = getDiscovery();
discovery.start();
ServiceInstance si = getInstance();
log.info(si);
discovery.registerService(si);
2. Another service - on another host or in another JVM on the same machine tries to discover who is implementing the service (https://github.com/zygm0nt/curator-playground/blob/master/src/main/java/pl/touk/curator/WorkerFinder.java#L50):
instances = discovery.queryForInstances(serviceName);
The whole concept here is ridiculously simple - the service advertising its presence just stores a file with its whereabouts. The service that is looking for service providers just look into specific directory and read stored definitions.
In my example, the structure advertised by services looks like this (+ some getters and constructor - the rest is here: https://github.com/zygm0nt/curator-playground/blob/master/src/main/java/pl/touk/model/WorkerMetadata.java):
public final class WorkerMetadata {
private final UUID workerId;
private final String listenAddress;
private final int listenPort;
}
The above recipes are available in Curator library (http://curator.incubator.apache.org/). Recipes’ usage examples are in my github repo at https://github.com/zygm0nt/curator-playground
If you’re in need of a reliable platform for exchanging data and managing synchronization, and you need to do it in a distributed fashion - just choose Zookeeper. Then add Curator for the ease of using it. Enjoy!
I’ve decided to put this selection here for quick reference.
too many connections
WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@247] - Too many connections from /xx.x.xx.xxx - max is 60
running out of ZK connections?
maxClientCnxns=200
in zoo.cfg
HBase client leaking connections?
connection closes prematurely
ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately.
in hbase-site.xml
set hbase.zookeeper.recoverable.waittime=30000ms
pig hangs connecting to HBase
WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectionException: Connection refused!
CAUSE: location of ZK quorum is not known to Pig
hbase.zookeeper.quorum
to final in hbase-site.xml
hbaze.zoopeeker.quorum=hadoophbasemaster.lan:2181
in pig.properties
client session timed out
INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session <id>, timeout of 40000ms exceeded
zoo.cfg
: maxSession=Timeout=180000
hbase-site.xml
: zookeeper.session.timeout=180000
clients lose connections
WARN org.apache.zookeeper.ClientCnxn - Session <id> for server <name>, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Broken pipe
unable to load database - unable to run quorum server
FATAL Unable to load database on disk ! java.io.IOException: Failed to process transaction type: 2 error: KeeperErrorCode = NoNode for <file> at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:152)!
/var/zookeeper/version-2
if other two ZK servers
are runningunable to load database - unreasonable length exception
FATAL Unable to load database on disk java.io.IOException: Unreasonable length = 1048583 at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:100)
jute.maxbuffer
"Packet len <xx> is out of range"
in the client logJVMFLAGS="-Djute.maxbuffer=yy" bin/zkCli.sh
failure to follow leader
WARN org.apache.zookeeper.server.quorum.Learner: Exception when following the leader java.net.SocketTimeoutException: Read timed out
CAUSE:
SOLVE:
DOs
DON’Ts
You may use Zookeeper as an observer - a non-voting member:
in zoo.cfg
peerType=observer
A o czym będę mówił? Przeklejka ze strony WHUG:
Marcin skupi się na współpracy ekosystemu Hadoopa z innymi narzędziami. Pokaże jak prosto i wygodnie przetwarzać grafy i jak stosować podejście Big Data, w czasie rzeczywistym. Poruszy również temat łatwiejszego tworzenia algorytmów Map-Reduce
Będzie to nieco mniej technicza (ale wciąż praktyczna) wycieczka po obrzeżach tematyki, która jest zwykle poruszana w połączeniu z Hadoop-em.
Prezentacja będzie dotyczyć narzędzi takich jak Cascading, Storm, Titan.
Zapraszam!
]]>Read on to find out how to use it.
The most important weakness of previous Hadoop releases was the single-point-of-failure, which happend to be NameNode. NameNode as a key component of every Hadoop cluster, is responsible for managing filesystem namespace information and block location. Loosing its data results in loosing all the data stored on DataNodes. HDFS is no longer able to reach for specific files, or its blocks. This renders your cluster inoperable.
So it is crucial to be able to detect and counter problems with NameNode. The most desirable behavior is to have a hot backup, that would ensure a no-downtime cluster operation. To achieve this, the second NameNode need to have up-to-date information on filesystem metadata and it needs to be also up and running. Starting NameNode with existing set of data may easily take many minutes to parse the actual filesystem state.
Previously used solution - depoying SecondaryNameNode - was somewhat flawed. It took long time to recover after failure. It was not a hot-backup solution, which also added to the problem. Some other solution was required.
So, what needed to be made redundant is the edits dir contents and sending block location maps from each of the DataNodes to NameNodes - in case of HA deployment - to both NameNodes. This was accomplished in two steps. The first one with the release of CDH 4 beta - solution based on sharing edits directory. Than, with CDH 4.1 came quorum based solution.
Find out how to configure those on your cluster.
For this kind of setup, there is an assumption, that in a cluster exists a shared storage directory. It should be deployed using some kind of network-based filesystem. You could try with NFS or GlusterFS.
This setup is quite OK, as long as you’re comfortable with maintaining a separate service (network storage) for handling the HA state. It seems error prone to me, because it adds another service which high availability should be ensured. NFS seems to be a bad choice here, because AFAIK it does not offer HA out of the box.
On the other hand, we have GlusterFS, which is a distributed filesystem, you can deploy on multiple bricks and increase the replication level.
Nevertheless, it still brings additional burden of another service to maintain.
With the release of CDH 4.1.0 we are now able to use a much better integrated solution called JournalNode. Now all the updates are synchronized through a JournalNode. Each JournalNode have the same data and all the NameNodes are able to recive filesystem state updates from that daemons.
This solution is much more consistent with Hadoop ecosystem.
Please note, that the config is almost identical to the one needed for shared edits directory solution. The only difference is the value for dfs.namenode.shared.edits.dir. This now points to all the journal nodes deployed in our cluster.
In both cases you need to run Zookeeper-based Failover Controller (hadoop-hdfs-zkfc). This daemon negotiates which NameNode should become active and which standby.
But that’s not all. Depending on the way you’ve choosen to deploy HA you need to do some other things:
With shared edits dir you need to deploy networked filesystem, and mount it on your NameNodes. After that you can run your cluster and be happy with your new HA.
For QJournal to operate you need to install one new package called hadoop-hdfs-journalnode. This provides startup scripts for Journal Node daemons. Choose at least three nodes that will be responsible for handling edits state and deploy journal nodes on them.
Thanks to guys from Cloudera we now can use an enterprise grade High Availability features for Hadoop. Eliminating the single point of failure in your cluster is essential for easy maintainability of your infrastructure.
Given the above choices, I’d suggest using QJournal setup, becasue of its relatively small impact on the overal cluster architecture. It’s good performance and fairly simple setup enable the users to easily start using Hadoop in HA setup.
Are you using Hadoop with HA? What are your impressions?
]]>For the sake of completness let me just describe that as prototyping platform with ARM processor. It is really similar in concept to what Arduino is, except it has not that many extensions available (none? or very little, I’ve only found those on Adafruit pages).
So, here is the obligatory picture.
It can run a Linux distribution, so anyone familiar with that can have a go with this low-powered computer.
The board itself is on the market for quite some time now. That’s why there are lots of interesting resources and projects that you can do with that stuff.
Here are just a bunch of them:
Do you also own R-Pi? Share what you plan to do with it.
]]>Hadoop’s usage as a big data processing framework gains a lot of attention lately. Now, not only big players see, that they can embrace the data their sites or products are generating and develop their businesses on it. For that to happen two things are needed: the data itself and means of processing really big amounts of it.
Gathering data is relatively easy. These are not necessarily structured data, you don’t need to plan their usage at first. Just start collecting them and than you may experiment with their potential usage. If they’ll come out as useless rubbish - deleting them won’t be hard But imagine the values it may contribute to your business:
There are many companies that utilize Hadoop ecosystem for their own needs. You can read about some of them here: http://wiki.apache.org/hadoop/PoweredBy But since that page lacks insight into specific applications of Hadoop I’ve tried to delve into
details of how Hadoop helped tame some companies’ big data sets.
Being a social network provider, a widely used one, they require no introduction. However if you’ve lived under a rock for last couple years just visit their website http://facebook.com
Their main usage is data warehousing. Since they require to be able to access the data fast and reliably they had a need for real-time querying of their huge, and always growing data set. Their switch from MySQL databases was required due to the increasing workloads they experienced with standard databases. What they got “out of the box” with Hadoop was all the benefits of distributed file system (HDFS features). They expanded the ideas behind that even further and implemented truly Highly Available file system without Single Point of Failure.
Facebook has 3 interesting usage scenarios in which Hadoop plays a major role:
This popular micro-blogging platform, where you can register your account and follow friends and celebrities for their micro-messages does some pretty interesting things with their Hadoop cluster.
One of their motivations is to speed up their web-page’s functionality. That is why the compute users’ friendships in Twitter’s social graph with Hadoop. Using connections between users they calculate their relationship to each other and estimate groups of users.
Since this service’s users generate lots of content, the company conducts researches based on natural language processing. They probe what could be told about a user from his tweets. They use tweets’ contents for advertisement purpose, trends analysis and many more.
From tweets and user’s behaviours they characterise usage scenarios. Also, they gather usage statistics, like number of searches daily, number of tweets. Based on this seemingly irrelevant data they run comparisons of different types of users. Twitter analyzes data to determine whether mobile users, users who use third party clients or power users use Twitter differently from average users. Of course theses seem like really specific applications but nevertheless they are very original and base on the data that Twitter has been gathering for some time now.
Being the biggest auctioning site on the Internet, EBay uses Hadoop processing for increasing search relevance based on click-stream data, user data. This seems pretty obvious, considering their area of operation.
However the also have one other interesting thing - they try hard to automatically fill auctioned objects’ metadata, based on the descriptions and other data provided by users. They employ data mining approach for this tasks and judging from their constant growth it seems to work
Social network for professionals, thou a lot smaller than Facebook. Based on click-streams they discover relations between users. All the data concerning latest visits on your profile or people you may know from other places - this comes from Hadoop based analysis of those clicks people make all the time on their sites.
Also a very neat feature, called InMaps (http://inmaps.linkedinlabs.com/) analyse declared schools and companies and generates data for graph with clustered friends of yours.
This on-line radio site, praised by many for its invaluable recommendations’ system seems like a rather small and simple service. But behind the facade of simple web page there are lots of data being processed, so that their services could match a certain level of perfection.
Such large volume of their data comes from scrobbles. Each users of their service listening to a song generates a note about this fact - called scrobble. Based on that and user profiles they calculate global band popularity charts, maps of bands’ popularity and many more usage statistics and timeline charts.
They just try to detect and trace new patterns in seemingly chaotic data sets. Perhaps you could also do the same? Analyze your data and expand your business value?
Suppose you want to add some additional jars to your SoapUI installation. It all should work ok if you put them in bin/ext directory. It is scanned at startup, and jars found there are automatically added to classpath.
However if you want to add some JDBC drivers, and happen to be using SoapUI version higher than 3.5.1 it is a bit more tricky.
You may face this NoClassDefFoundError:
An error occured [oracle/jdbc/Driver], see error log for details java.lang.NoClassDefFoundError: oracle/jdbc/Driver
If so, try registering your drivers with registerJdbcDriver function, like I did in this snippet of code:
What a crappy thing!
You can definitely see your expertise in the work you write.
The world hopes for more passionate writers such as you who are not afraid to
say how they believe. Always go after your heart.
It’s going to be end of mine day, however before finish I am reading this fantastic paragraph to increase my experience.
My family every time say that I am wasting my time
here at net, except I know I am getting knowledge every day by reading such pleasant articles.
Thanks , I’ve just been looking for info approximately this topic for a long time and yours is the greatest I have found out till now. But, what concerning the bottom line? Are you positive in regards to the source?
Nice post. I was checking continuously this blog and I am impressed!
Extremely helpful information specially the last part
:) I care for such information much. I was looking for this
particular info for a long time. Thank you and best of luck.
Once upon a time I set out on a journey to discover the NoSQL land. I’ve decided that doing simple queries wouldn’t be interesting enough. That’s why I’ve chose to create an app that would be based on some NoSQL database.
The main idea was to create an app, that would dynamically update itself with geographic data flowing in. Since there are myriads of geo-data that are available on the internet, you can pick your favorite one and load them into your SQL database of choice.
In my case the primary source of data was a proprietary database, or more specifically - one table in it continuously updated with new data. To make that data visible on my map I needed to:
The idea of the front-end HTML page was to show new points on the map. From the moment of opening the page records that appear in database table should be shown interactively on the screen.
For the first step I chose to use RabbitMQ broker. A queue on the broker would receive messages - one message per database table’s row. Then I’d use some simple groovy middle ware to convert the data to appropriate format and put it onto another db - this time db specific to my app.
You may ask why incorporate another database. It would be good for separating environments - assuming the original data contains some vulnerable content that should be anatomised, or we just don’t feel comfortable exposing the whole database of some XYZ-system just to have access to its one table.
Since for my presentation layer I chose HTML+JS without any application server-based back-end I’ve decided on CouchDB . This seemed like a perfect match for this scenario. Why? - ease of use, REST API, with JSON responses - just great for interacting with my simple front-end.
The flow of things was as shown on the image below:
As you can see, I’ve chosen JSON as my data-format. I’ve been considering Apache Avro in the first place but using it was a real pain in the ass. Avro itself is used in Apache Hadoop as a serialization layer, so it would seem OK, but it has virtually no documentation. But once you tear through the unintuitive interface and manage to handle all those unthinkable exceptions you get a few pros for this library. It’s great in that it does not require code generation - I like it being made on the fly. It also offers sending data in binary format, which was not necessary, but never the less is a nice feature.
What I certainly didn’t like about it was its orientation on the files rather than chunks of data - so it was not so obvious how should I send data through the wire.
Than I found out it can produce JSON output, which would work for me, except the output could not have been parsed by other JSON libraries :) (I’ve asked on stackoverflow about that, but with no luck).
If my whining haven’t put you back and still would like to see how to use Avro, try this unit test in project’s GitHub repo: AvroSimpleTest.groovy
I’ve dropped Avro in favour of a simple JSON lib called (Svenson and that was painless. The only thing I was forced to do was create my model class in Java - the rest of the project is written in Groovy. I’ve no idea why was that necessary, and didn’t want to look into it.
Further on the way is RabbitMQ, to which records are filled by a feeding middle-ware written in Groovy. Since I use ActiveMQ on a day-to-day basis, I’ve decided to try something new. This broker is a really nice piece of software. Being written in Erlang makes it really fast. What’s more it has some extensive capabilities and is easy to approach for anyone similar with messaging (JMS and friends). For such a lightweight product it is really powerful - implements AMQP!
From the broker’s queue messages are again fetched by a middle-ware just to be put into CouchDB view. This database is also written in Erlang. It’s very reliable, however the way it handles refreshing view isn’t the most pleasant one - performance-wise.
Word of advice - if you’re on Debian derivative, be cautious with apt-repository version. It’s rather _ancient_. Also remember to add allow_jsonp = true to you config file /opt/couchbase/etc/couchdb/local.ini. It’s not enabled by default, and not having this set would result with empty responses from the CouchDB server.
The problem here is, that the browser doesn’t allow quering a web server with hostname other than the one the script originates. More on this case here. Seems like my problem could be overcame by changing url in index.html and hostname couchdb listens on to the same address.
I’ve also created a view, that would expose an event by key: view code
As a back-end I’ve done some JQuery based AJAX calls - nothing too fancy. All things necessary for presentation layer are in this file.
Please bear in mind that this whole application is rather a playground, not a full-fledged project!! After creating all the parts I have some doubts about some architectural decisions I made. I don’t think the security have been taken into account seriously enough. Also scalability was never an issue ;-)
If you have some thoughts about any of the aspects mentioned in this post, please feel free to comment or contact me directly :)
And also you may try the application by yourself - it’s on the GitHub.
@Piotrek, here is a link to JIRA ticket concerning this feature. I think it is being discussed ATM: https://issues.apache.org/jira/browse/COUCHDB-431
About Same Origin Policy - now there’s Cross Origin Resource Sharing available in most of common browsers. It should help You if CouchDB has support for it.
@klausa, thanks for your advice. I’ve made some changes to the post.
>The main idea was to create an app, that would dynamically update itself with geographic data flowing in.
Not to nitpick, but that doesn’t seem like an idea for app. I think you should explain what that displayed data is here. If you moved your ‘Presenting the dots’ paragraph just above ‘Toys used’, it would be clear what do you wanted to do with this app.
>Also remember to add allow_jsonp = true to you config file /opt/couchbase/etc/couchdb/local.ini.
I think you should explain what that option *really* does.
Other than that, nice post!
I bought a Kindle (3rd generation, Wi-Fi only) some time ago - like half a year ago. Read some books, done some web-browsing (awful, quite unpleasant). Gradually I became more and more curious of other things possible to achieve with this slate-looking piece of tech. These are my thoughts and ideas.
Got a Kindle? Use it every day? Feel like modding or extending your ways of usage? Great! Read on, and share your thoughts in comments!
And how do You use your Kindle? Perhaps you’re doing some serious, crazy things with it? Share your thoughts!
Zgłosiłem się po kindlowe SDK prawie rok temu i niiic, cisza. Widać nie jestem dość cool, by dać mi tę zabawkę do ręki :)
Co do książek, to fakt, DRM wszędzie. Ale DRM w ebookach działa jak każdy inny (czyli marnie – da się zdjąć DRM Empiku, Amazona, itp.), więc użytkownik z odrobiną zacięcia da radę.
PS. Mój kindel postanowił wyzionąć ducha jakoś w zeszłym tygodniu, na 10 dni przed upływem gwarancji. Kindle znajomej padł ciut (tydzień-dwa?) wcześniej. Amazon bez szemrania wysyła nowe, ale… nie sposób oprzeć mi się uczuciu, że te urządzonka były obliczone na rok życia. A przynajmniej pierwsza seria z preorderów, obecne są (mam nadzieję) już trwalsze.
Dzięki za odpowiedzi :)
Też kupowałem kindle ~6 m-cy temu więc się wtrącę:
Ad 1. Ja kupowałem bezpośrednio w Amazonie i nie zapłaciłem VAT-u (OIDP cła na elektronikę z USA nie ma).
Ad 2. Przeglądarka IMHO z JS radzi sobie całkiem dobrze, ale jest cholernie wolna i nawigacja jest niewygodna.
Ad 3. Domyślnie tylko WPA2-PSK, ale jest tam nomalny wpa supplicant więc można edytować sobie konfig i szaleć.
Ad 4. Ja całkiem sporo czytam i ładuje raz na miesiąc, może minimalnie częściej.
Ad 5. Można kupować z Amazona, ja kupuje z Amazon UK, bo kiedy konfigurowałem Kindle to miał niższe ceny książek. Co do DRM to obsługuje tylko swój DRM (czyba azw, czyli mobi + Amazonowy DRM), z innych trzeba zdjąć DRM i skonwertować na obsługiwany format (Calibre rulez!).
Okej, no więc po koleji:
1. Kupowałem bezpośrednio na Amazon - tyle że na amerykańskim, bo tylko z tamtego ślą Kindle do Polski. Uważam, że to jest najtańsza możliwa opcja. Cło opłaca Amazon, Ty niczym się nie przejmujesz, wszystko jest zrobione za Ciebie. Cała impreza kosztowała mnie coś koło 400zł (Kindle 3 wifi only). Z tego co widziałem to na Allegro jest zdecydowanie drożej.
2. Przeglądanie stron na Kindlu to tylko w razie naprawdę dużej potrzeby. Mi się nie podoba, wyświetlacz jest na tyle mało responsywny, że swobodne surfowanie po sieci jest niewykonalne. Jak musisz koniecznie coś sprawdzić, to sprawdzisz, ale dla przyjemności to raczej w ten sposób się tego nie robi ;-)
3. Nie mam dostępu do WPA2 z Radiusem. Używam na WPA2 z PSK - i działa bez zarzutu. Może pogooglaj gdzieś?
4. To prawda, trzyma miesiąc, tylko trzeba pamiętać żeby Wi-Fi wyłączać, bo nawet na standbaju zrzera baterię.
5. W Polsce można bez problemu kupować książki z Amazona (nadal, przez Wifi, bo przez 3G to nie wiem). Co do Polskich sklepów, to o ile oferują wspierane przez Kindla formaty, to nie powinno być problemu. Ja osobiście raczej mało książek kupuję na Kindla - korzystam z ogólno dostępnej klasyki + mam osobno kupione PDFy itp. Generalnie nie przeczytasz żadnych książek w pub’ach ani tym podobnych formatach. Aczkolwiek są na to haki (między innymi chinski software, o którym pisałem).
W każdym razie polecam zakup, bo naprawdę warto - chyba że wolałbyś coś w stylu IPada (kolory, łatwe surfowanie), to wtedy Kindle nie jest dla Ciebie :)
Sorry że po polsku, ale przymierzam się do kupna Kindla, i mam parę pytań, wybacz jeśli zaśmiecam ci notkę:
1. Gdzie kupowałeś bezpośrednio na Amazonie czy przez pośrednika z Allegro, jak z cłem i innymi podatkami?
2. Przeglądarka w Kindle 3 podobno na webkicie, jak w praktyce, dobrze sobie radzi ze stronami, co z JSem?
3. WiFi obsługuje szyfrowanie WPA2 korporacyjne z serwerem RADIUS, czy tylko wersję WPA2 z PSK?
4. Jak z bateryjką, słyszałem że miesiąc daje rady, prawda to?
5. Można w Polandii kupować w Amazonie książki do Kindla? Są jakieś polskie sklepy z polskimi legalnymi ksiażkami, które później bez problemów wrzucę do Kindla, czy przez DRM nie da rady?
We stumbled over here from a different web address and thought I might check things out.
I like what I see so i am just following you.
Look forward to checking out your web page yet again.
I like what you guys are up too. This type of clever work and reporting!
Keep up the awesome works guys I’ve added you guys to my own blogroll.
Greetings from Florida! I’m bored at work so I decided to browse your site on my iphone during lunch break. I enjoy the info you present here and can’t wait to take a look
when I get home. I’m surprised at how quick your blog loaded on my cell phone .. I’m not even using WIFI, just 3G .
. Anyways, very good site!
Comfortableness <a href="http://www.salethenorthfacejackets.com">north face jackets</a>
is crucial when they get it that will <a href="http://www.salethenorthfacejackets.com">north face outlet</a> get the best school bags pertaining to going camping <a href="http://www.salethenorthfacejackets.com">north face sale</a>. Your easiest guarantee in the case of even larger delivers has become One with an inner metal framework, one that can wind <a href="http://www.salethenorthfacejackets.com">cheap north face</a> up being aligned to help you appropriately fit your <a href="http://www.salethenorthfacejackets.com/the-north-face-women-1">north face women</a> body. They should be now have http://www.salethenorthfacejackets.com secure which were wholly flexible, because essentially in the form of midsection belt to get more aid.
I never imagined how much stuff there was out there
on this! Thanks for making it easy to get the picture
What Programming Languages Do Jobs Require? | Regular Geek regulargeek.com/2009/07/21/what-programming-languages-do-jobs-require view page cahecd As a software engineer, you need to keep your skills sharp and current. This is a general requirement of the job. In addition to this, in the current economy you do not want to be without a job. Obviously, this means learning more about what your current company uses for all of its development. What if you do not have a job or you are looking to leave? What technologies or programming languages should you be looking into? From the page
Howdy are using Wordpress for your site platform? I’m new to the blog world but I’m trying to
get started and create my own. Do you need any coding expertise to make your own
blog? Any help would be greatly appreciated!