Writing an ORC File Using Java

So, I saw this question sometime ago on Stack Overflow and I’ve been meaning to answer this for sometime.  The question: How to convert .txt / .csv file to ORC format.

Below is a code sample of how to do it using Apache Crunch.  I originally couldn’t find any documentation on how to do this.  Most of what I was found how to use a Hive query to create ORC files from an existing database table.  So I started reading the unit tests of various projects until I found something made sense. And that’s how I came across Apache Crunch.

The valuable lesson I’ve learned when working with a new open-source project is to not neglect taking a look at the unit tests. Many of the answers you’re looking for can be discovered there.

Also, one note about the code below is that it’s variation of what can already be found in Apache Crunch’s unit tests. And second, when choosing the stripesize and buffersize to use, you’ll need to experiment with your HDFS cluster and see what gives you the best performance for running Hive queries on the data you created. I haven’t found any universal rule-of-thumb on what to set these at that I actually witnessed for myself to work.

I’m currently doing something similar. Taking tab delimited data stored in a Kafka topic and landing it on HDFS as ORC files.

import org.apache.crunch.types.orc.OrcUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.OrcStruct;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;
import org.apache.hadoop.hive.serde2.io.ShortWritable;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.hive.ql.io.orc.Writer;

import java.io.IOException;

public class Test {

	public static void main(String[] args) throws IOException {
		Test test = new Test();
		String tsvString = "text_string\t1\t2\t3\t123.4\t123.45";

	public void createOrcFile(String input) throws IOException {
		String typeStr = "struct";
		TypeInfo typeInfo = TypeInfoUtils.getTypeInfoFromTypeString(typeStr);
		ObjectInspector inspector = OrcStruct.createObjectInspector(typeInfo);

		String[] inputTokens = input.split("\\t");

		OrcStruct orcLine = OrcUtils.createOrcStruct(
				new Text(inputTokens[0]),
				new ShortWritable(Short.valueOf(inputTokens[1])),
				new IntWritable(Integer.valueOf(inputTokens[2])),
				new LongWritable(Long.valueOf(inputTokens[3])),
				new DoubleWritable(Double.valueOf(inputTokens[4])),
				new FloatWritable(Float.valueOf(inputTokens[5])));
		Configuration conf = new Configuration();
		Path tempPath = new Path("/tmp/test.orc");

		Writer writer = OrcFile.createWriter(tempPath, OrcFile.writerOptions(conf).inspector(inspector).stripeSize(100000).bufferSize(10000));

Update: In case wordpress.com is filtering out code from my blog. I’ve recreated the example as gist in github below.

Update 2017-03-17: The dependencies were asked for in the comments. These jar’s are quite old, since it was a while ago when I first wrote this. I would try getting it working with these dependencies first and see if it works, and then try updating them.

Posted in code example, coding, github, hadoop, java, programming | Tagged , , , | 10 Comments

Embedded List of Records Using Avro Generic Record

I’m posting this here, because I know someone out there will be Googling how to do this like I was.

I didn’t find much and ended up spending a lot of time reading through the Avro API and using my debugger. Hopefully this will help someone.

The code below is an example how to construct an avro object that has an embedded (nested) list of avro objects using the GenericRecord class. In the example, there is a User avro record that contains a list of phoneNumber records.

String schemaDescription = "[ { \n"
	+ "\"type\" : \"record\", \n"
	+ "\"name\" : \"PhoneNumber\", \n"
	+ "\"namespace\" : \"com.test.namespace.avro\", \n"
	+ "\"fields\" : [ \n"
	+ "   {\"name\" : \"id\", \"type\": \"long\", \"doc\": \"Attribute id\" }, \n"
	+ "   {\"name\" : \"updateDate\",  \"type\": \"long\", \"doc\": \"Update date of the attribute\" }, \n"
	+ "   {\"name\" : \"value\", \"type\" : \"string\", \"doc\": \"Value of the attribute\" } \n"
	+ " ] \n"
	+ "}, \n"
	+ "{ \n"
	+ "\"type\" : \"record\", \n"
	+ "\"name\" : \"User\", \n"
	+ "\"namespace\" : \"com.test.namespace.avro\", \n"
	+ "\"fields\" : [ \n"
	+ "   {\"name\": \"timestamp\", \"type\": \"long\"},\n"
	+ "   {\"name\": \"username\", \"type\": \"string\" },\n"
	+ "   {\"name\" : \"phoneNumbers\", \"type\" : [\"null\", {\"type\" : \"array\", \"items\" : \"PhoneNumber\"}] } \n"
	+ "] \n"
	+ "}] \n";
Schema schema = new Schema.Parser().parse(schemaDescription);
Schema phoneNumberSchema = schema.getTypes().get(0);

List list = new ArrayList<>();

Date d = new Date();
GenericRecord genericPhoneNumber = new GenericData.Record(phoneNumberSchema);
genericPhoneNumber.put("id", Long.valueOf(1L));
genericPhoneNumber.put("updateDate", d.getTime());
genericPhoneNumber.put("value", "111-555-5555");

genericPhoneNumber = new GenericData.Record(phoneNumberSchema);
genericPhoneNumber.put("id", Long.valueOf(2L));
genericPhoneNumber.put("updateDate", d.getTime());
genericPhoneNumber.put("value", "222-666-6666");

Schema userSchema = schema.getTypes().get(1);
GenericRecord genericUser = new GenericData.Record(userSchema);
genericUser.put("phoneNumbers", list);
genericUser.put("username", "test_username");
genericUser.put("timestamp", d.getTime());
Posted in code example, coding, java | Tagged , , | Leave a comment

Blogging Confidential

A small goal of mine for 2014 was to write on my blog more.  And I was really enjoying it at the beginning of year, regularly posting every Thursday morning.

But now, it’s March and my regular posting has slowed down.  Here are some of the challenges I’ve been struggling with since I started blogging again.

1) Relevant Content — Originally, this was supposed to be my tech blog.  I wanted a place to share my developer stories, the problems I’ve solved, and my learning.

However, with all the non-disclosure statements I’ve signed with my employer, I don’t know what’s safe to share outside of work anymore. Last month, I wrote this amazing story about a technical challenge my team and I were facin.  The post was pretty good and because it was written in a story form, even non-technical readers would have enjoyed it.

After writing it, the next day I decided not to post it.  I thought it might expose too much of my company’s infrastructure and development practices.  Since I had doubts of weither to post this or not, I thought it was safer not to. I could have gone to the legal department to have my posting reviewed, but that would pointless. They would say no anyway.

I started writing about my family and parenting.  Given the number of views I get, these seem to be popular.  Cute stories of toddlers and babies will always be loved.
But as husband and parent, I also want to protect the ones I love.  So, I’m cautious of the amount of content that I post about family on a public forum such as blog.

So, I limit what I write about work.  And I limit what I write about my family.  These are two biggest elements of who I am.  What else can I write about that’s relative?
I’m still trying to answer that.

2) Finding the time to write — Needless to say, being a father of two little girls is fun, but it has put an even greater restriction on my time.  Each girl needs their own time with Daddy when I’m home.  And I love giving my wife a break from the girls as well.
Many times,  my blogging has fallen off on my list of priorities.

Anyway, enough whining.  Let’s post this.

Posted in Uncategorized | Leave a comment

Dream Again

I’ve discussed previously how I listen to my dreams and try to discern what they are tell me.

However, I’ve noticed a reoccurring theme that comes at least once a week these days.  Each dream is different, but it all has me performing the same action, playing guitar.  I could be playing my own guitar.  I could be playing a friend’s guitar.  I could have gone out and bought a new guitar.  I could be a younger or older version of myself playing guitar.  And in the dream, the guitar is always electric.

I’ve typically ignored these dreams; I’m a pianist remember?

bSometime ago, I decided to concentrate on one instrument and let go of the others, this included the guitar.  I wasn’t a college kid anymore, I needed to focus on one instrument.  I couldn’t manage dedicated practice time to all my musical interests.  So in the past four years, I’ve been studying jazz piano exclusively; taking private jazz piano lessons.  And I haven’t looked back since.

But these dreams still keep happening.  It’s obvious a part of me still hasn’t let go.

Therefore, this weekend I’m going to set up a small practice space in my home office away from the piano in the living room.  I’m not going to schedule a set practice time or have an agenda on what I play.  I’m just going to have my electric guitar connected to some headphones in my office and if I feel like practicing, it’s there for me.

We’ll see  how this goes.


Posted in dreams, music | Tagged , , , , | 2 Comments

Hadoop Streaming Confidential

Warning: This is a short tech article.  To my usual readers that come here to read my cute daddy-daughter stories, please feel free to skip this one.

If you are working with hadoop streaming with python and see the following error in your logs:

Caused by: java.io.IOException: Cannot run program "my_mapper.py": error=2, 
    No such file or directory

What you should do is check the shebang at the top of your python script.  And make sure that the python executable is located in the same location on EVERY node of your cluster.

So, if your shebang is:


Verify that /opt/python/bin/python is located exists on every node. Otherwise, when you try to run your hadoop streaming job, you’ll receive the error from above, when one of the nodes tries to run the python script.

Let me know in the comments below if this helped you. Thanks.

Posted in coding, hadoop | Tagged , , , | Leave a comment

Piano Confidential


The family piano was the piano I grew up with.  When I was kid around middle school age, my piano playing progressed so much that it was obvious that playing on consumer keyboard wasn’t enough for me anymore.  My mother and I went to a piano store one day and I picked out the piano that I had wanted.  It was a simple black console.  My mother slowly paid it off over the years.

My piano followed me through life.  It was around  through college, my grad school years, my first programming job, and then marriage.  I always had to get a ground floor apartment, because there was no way I was getting the piano up all those stairs

When my wife and I bought a house, I was finally relieved.  I could have finally have my own piano room.  And most importantly, I didn’t have to worry about upsetting my neighbors anymore.

That all changed when we started having kids.

My eldest daughter, who’s now two, had always been around my piano playing.  But then one morning when I started my usual practice routine, my daughter stops playing with her toys and comes up to me by my side.  She starts tapping up on arm to get my attention.

I look down at her and she says, “My Nano!”
“Your piano?” I say.
“Yeah! My Nano!”

I guess this is what I get for having my piano room double as her play room.

Posted in Parenting, piano, random | Tagged , | Leave a comment

What Dreams May Come

One of the things my wife always found strange about me is that I always try to listen to my dreams.  If I strongly remember a dream after waking up, I try to gleam any useful information from them.  Often, I find that these dreams are my subconscious screaming to tell me something.

We’re all familiar with stories of others who do this.  Like how movie director, James Cameron, got the idea of Terminator from a dream he had.  The dream was of a man who had ripped his skin off to reveal a robot skeleton underneath.  Or the classic story of Elias Howe, he was trying to invent the first ever sewing machine and couldn’t get past a problem he had been having.  Then one night he had a nightmare of being cooked by island natives.  The next day he remembered the unique spears the natives had and this was the inspiration of solving the problem with threading the needle for his machine.

So here are a few of my dreams.
Once, I had a strange dream of going out and buying record player.  I don’t know why, but I did.  And in the dream once the record player was in my hands, I woke up.  I couldn’t go back to sleep.

I spent the rest of the night reading up on record players.  I knew I wanted a new and not a used one.  I bookmarked what I wanted and since it was close to my birthday, it was my gift from my wife.

That first night I played the record player, I couldn’t believe the experience.  I still kept my childhood records from the 80s and began to play them one by one.  I was immediately transported back to the 80s.  These were songs I hadn’t heard in 25+ years.  These were songs that don’t get played on radio, or used in tv or movies, or put into commercials.  And I knew all the lyrics.

It reminded my of my youth.

Another dream I had took place a few years into the future and in the dream I was browsing around a shopping mall with my wife.  Going from store to store, I noticed that each store had some sort of Disney merchandise.  It was everywhere.  In the dream, I thought to myself, “I should buy some stock in Disney.”

When I woke up the next day, I decided to listen to that and bought shares in DIS for no reason other than I had a dream.  Didn’t bother with any research.

The stock is up 25% from where I bought it.

So, I have had a few more dream that I haven’t acted on yet.  They seem to be more complex and need sometime to make happen.  Some require learning a new skill.  Some require me to save some money to purchase.

But so far, it’s been a pretty good ride.

Posted in general, random | Tagged | 1 Comment