Building Shaded Uber Jar with Jersey-Client And Getting ClientHandlerException

I’ve been meaning to write about this on my blog to help others out there who come across this problem with using jersey-client in their code.

So I was working with the jersey-client for the first time and everything was going fine. My code hits a restful api somewhere, retrieves some useful metrics, and then acts on it.

However when I compiled my code and ran it on our teams dev server I got the following com.sun.jersey.api.client.ClientHandlerException below.

Fun, right?

Anyway, it turns out the reason for this error was because I was using maven-shade-plugin to build an uber jar of the project (a requirement for this project) and the transformer I was using was overwriting the metadata in META-INF for each jersey service and not appending the metadata like it should.

So the fix:
Updated my pom file. Note that this fix is only for building shaded uber jar with maven when using jersey-client. Hope this helps someone else out there.

For search engines…

Exception in thread “main” com.sun.jersey.api.client.ClientHandlerException: com.sun.jersey.api.client.ClientHandlerException: A message body writer for Java type, class java.lang.String, and MIME media type, application/json, was not found
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:570)
Caused by: com.sun.jersey.api.client.ClientHandlerException: A message body writer for Java type, class java.lang.String, and MIME media type, application/json, was not found

Advertisements
Posted in code example, coding, github, java, Learning to Code, programming | Tagged , , , , , | Leave a comment

Reading and Writing ORC Files Using Vectorized Row Batch in Java

So looking at the analytics of my blog, it appears my article on how to write ORC files using core java is really popular.

In this post, I’ll be going over how to do write and read ORC files using Apache ORC’s Vectorized Row Batch. Using this approach improves your performance times. “The focus is on speed and access the data fields directly.”

Unlike my previous post, this time I wised up and created project for you to clone and work with on github. If you want to skip reading and just dig into the code, you can find the project here.  Also, all future posts I write that have code examples will be in this project.

First you’ll need to add this your projects pom file if you haven’t already. For this post, I’m using the latest release, org.apache.orc:orc-mapreduce:1.4.0.

First Example: Simple Write
This first example is lifted directly from Apache ORC’s documentation. The only thing I changed here was the filename of the ORC file that is being generated.
Link to Code


	public static void main(String [ ] args) throws java.io.IOException
	{
		Configuration conf = new Configuration();
		TypeDescription schema = TypeDescription.fromString("struct");
		Writer writer = OrcFile.createWriter(new Path("/tmp/simple-my-file.orc"),
				OrcFile.writerOptions(conf)
						.setSchema(schema));


		VectorizedRowBatch batch = schema.createRowBatch();
		LongColumnVector x = (LongColumnVector) batch.cols[0];
		LongColumnVector y = (LongColumnVector) batch.cols[1];
		for(int r=0; r < 10000; ++r) {
			int row = batch.size++;
			x.vector[row] = r;
			y.vector[row] = r * 3;
			// If the batch is full, write it out and start over.
			if (batch.size == batch.getMaxSize()) {
				writer.addRowBatch(batch);
				batch.reset();
			}
		}
		if (batch.size != 0) {
			writer.addRowBatch(batch);
			batch.reset();
		}
		writer.close();
	}

Second Example: Write ORC File With Different Datatypes
Now in my next example, it is a little more interesting. In it, I’m writing to other datatypes, not just integers. Note how a String must be converted to a byte array and use the BytesColumnVector.
Link to Code


	public static void main(String [ ] args) throws java.io.IOException
	{
		Configuration conf = new Configuration();
		TypeDescription schema = TypeDescription.createStruct()
				.addField("int_value", TypeDescription.createInt())
				.addField("long_value", TypeDescription.createLong())
				.addField("double_value", TypeDescription.createDouble())
				.addField("float_value", TypeDescription.createFloat())
				.addField("boolean_value", TypeDescription.createBoolean())
				.addField("string_value", TypeDescription.createString());

		Writer writer = OrcFile.createWriter(new Path("/tmp/my-file.orc"),
				OrcFile.writerOptions(conf)
						.setSchema(schema));


		VectorizedRowBatch batch = schema.createRowBatch();
		LongColumnVector intVector = (LongColumnVector) batch.cols[0];
		LongColumnVector longVector = (LongColumnVector) batch.cols[1];
		DoubleColumnVector doubleVector = (DoubleColumnVector) batch.cols[2];
		DoubleColumnVector floatColumnVector = (DoubleColumnVector) batch.cols[3];
		LongColumnVector booleanVector = (LongColumnVector) batch.cols[4];
		BytesColumnVector stringVector = (BytesColumnVector) batch.cols[5];


		for(int r=0; r < 100000; ++r) {
			int row = batch.size++;

			intVector.vector[row] = rand.nextInt();
			longVector.vector[row] = rand.nextLong();
			doubleVector.vector[row] = rand.nextDouble();
			floatColumnVector.vector[row] = rand.nextFloat();
			booleanVector.vector[row] =  rand.nextBoolean() ? 1 : 0;
			stringVector.setVal(row, UUID.randomUUID().toString().getBytes());

			if (batch.size == batch.getMaxSize()) {
				writer.addRowBatch(batch);
				batch.reset();
			}
		}
		if (batch.size != 0) {
			writer.addRowBatch(batch);
			batch.reset();
		}
		writer.close();
	}

Third Example: Read ORC File With Different Datatypes
Finally in this example, you can of how to read the previously generated ORC file from the example above.
Link to Code


	public static void main(String [ ] args) throws java.io.IOException
	{
		Configuration conf = new Configuration();
		Reader reader = OrcFile.createReader(new Path("/tmp/my-file.orc"),
				OrcFile.readerOptions(conf));

		RecordReader rows = reader.rows();
		VectorizedRowBatch batch = reader.getSchema().createRowBatch();

		while (rows.nextBatch(batch)) {
			LongColumnVector intVector = (LongColumnVector) batch.cols[0];
			LongColumnVector longVector = (LongColumnVector) batch.cols[1];
			DoubleColumnVector doubleVector  = (DoubleColumnVector) batch.cols[2];
			DoubleColumnVector floatVector = (DoubleColumnVector) batch.cols[3];
			LongColumnVector booleanVector = (LongColumnVector) batch.cols[4];
			BytesColumnVector stringVector = (BytesColumnVector)  batch.cols[5];


			for(int r=0; r < batch.size; r++) {
				int intValue = (int) intVector.vector[r];
				long longValue = longVector.vector[r];
				double doubleValue = doubleVector.vector[r];
				double floatValue = (float) floatVector.vector[r];
				boolean boolValue = booleanVector.vector[r] != 0;
				String stringValue = new String(stringVector.vector[r], stringVector.start[r], stringVector.length[r]);

				System.out.println(intValue + ", " + longValue + ", " + doubleValue + ", " + floatValue + ", " + boolValue + ", " + stringValue);

			}
		}
		rows.close();
	}

Again, note how to read a string correctly, you need to also use the ByteColumnVector’s start and length arrays. Otherwise, you get the whole byte array stored in the column as one giant continuous string. I discovered how to do this by looking at the source code for ByteColumnVector. I noticed that there was a method named “stringifyValue”, I knew the solution I needed was in this method. So I took a look and there it was. This is an important point that I’m always drilling on with our apprentices at work. Always check the source code for the open-source project you’re working with. You got to learn to dig in there, not all answers are on Google and StackOverflow.

Posted in code example, coding, github, hadoop, java, Learning to Code | Tagged , , , , , | 2 Comments

A Last Minute Lightning Talk

As I had stated in one of my previous posts,  I’ve been wanting to one day to a tech talk, but needed to work up to do something like that.  And then out of nowhere, I found myself being asked to do a lightning talk at work of what I learned at the Strata+Hadoop Conference that I attended.  My audience were the other software engineers and some of the data scientists of the company.  Below are some of initial thoughts on the talk I gave.

  1. I’ve been working on my pace.  I tried to consciously slow down and not talk so fast.  I think this is working.  But in the moments I thought to slow down, I had a tendency to look up in the space above the listeners head.  I don’t know if that’s a good thing or bad.  I think I need to film myself when I practice or give another talk to see.

  2. Even though this talk was short and unplanned, I need to still at least appear to be prepared.  So work on confidence, but also make it feel like a conversation and not a lecture.

  3. Because of time, I wasn’t the one who prepared my slides, so my bullet points was not in the correct order from the way I had practiced in my mind.  I was able to recover.  This just reminds me to be prepared for anything.

 

 

Posted in career, hadoop, java, Learning to Code, public speaking | Tagged , , , , , | Leave a comment

Happy Trails!

So, imagine spending the last few months minimizing your life. Deciding what to keep and what to sell off. Everything you’ve decided to keep must fit in 5×10 storage.

Next thing will be to leave your job and move out of your apartment. And for the next 6 months you and your significant other will be hiking the Pacific Crest Trail.

My brother-in-law and his wife are doing just that. Today is the first day of their long hike. They are starting their journey at the Mexican border and plan to hike the trail north into Canada. They estimate it will take approximately 6 months. Thankfully, they will be journaling about their journey here.

I’m so excited for them. I never had the opportunity in life to do something this adventurous. So, I’m going to live vicariously through the stories they post for the next 6 months.

I’m also proud of them. The amount of planning and preparation to embark on such trip is astounding. And then having the courage to leave safe and secure jobs, I know I wouldn’t be able to do that. Everything you need to survive must fit in your hiking pack.

Proud of you guys.

Posted in family, general, random | Tagged , , , , | Leave a comment

Talking the Talk

Me Thinking About Talking


Not too long ago, I was on the train talking with one of my coworkers, Jayesh, and I saw him working on something. So, I asked him about it.

He told me he submitted three different proposals for talks at the next Apache Big Data Conference. I was wowed. I was even more wowed when he said all three were accepted. So from now and until the conference, when he’s not working at his job or when his family is sleeping, he’s preparing his talks and their slides.

This made my wheels in my head start turning, especially after coming home from the Strata Conf. Why don’t I give a talk at a conference?

I would of course need to build up to do something like a conference. I would first give a lunch-n-learn at work. That’s where I would start this journey, by giving a presentation to my peers. Next, I would see if I can give a talk at a local meetup / user groups. I would slowly work up to doing a national conference.

I don’t see this all happening over night. This would take careful planning and work over a year or so. And a lot of practice. I normally do not give presentation or talks to big groups as part of my everyday life. So I see myself doing a lot of rehearsing and dry runs, until it feels natural.

So what would I talk about?

One thing I learned at Strata Conf, there were two types of talks that I really enjoyed. The first were debugging talks. These were talks that discussed how to do debug your big data jobs. The second type of talk I enjoyed were tutorials. These tutorial talks walk the audience through building a small working baby project staring from nothing. A tutorial talk helps people bootstrap the process of getting started in working with a new technology. It’s more hands on and immediate.

I love these talks the most because they affect my everyday life now. I walk out with action items that I can use in my day to day to development. And if I feel like this, I know others feel the same and would appreciate people making talks such as these.

So, my first attempt at creating a tech talk will be a tutorial and maybe attempt a debugging talk down the road. As the wheels turn in my mind, I’m picturing a simple kafka-spark-streaming tutorial. I’ll just have to get down to is and create it now.

Posted in career, coding, general, java | Tagged , , , | 1 Comment

Writing an ORC File Using Java

2017/06/13: Checkout my updated article on reading and writing ORC files using Vectorized Row Batch in Java.
Linked here: https://codecheese.wordpress.com/2017/06/13/reading-and-writing-orc-files-using-vectorized-row-batch-in-java/

So, I saw this question sometime ago on Stack Overflow and I’ve been meaning to answer this for sometime.  The question: How to convert .txt / .csv file to ORC format.

Below is a code sample of how to do it using Apache Crunch.  I originally couldn’t find any documentation on how to do this.  Most of what I was found how to use a Hive query to create ORC files from an existing database table.  So I started reading the unit tests of various projects until I found something made sense. And that’s how I came across Apache Crunch.

The valuable lesson I’ve learned when working with a new open-source project is to not neglect taking a look at the unit tests. Many of the answers you’re looking for can be discovered there.

Also, one note about the code below is that it’s variation of what can already be found in Apache Crunch’s unit tests. And second, when choosing the stripesize and buffersize to use, you’ll need to experiment with your HDFS cluster and see what gives you the best performance for running Hive queries on the data you created. I haven’t found any universal rule-of-thumb on what to set these at that I actually witnessed for myself to work.

I’m currently doing something similar. Taking tab delimited data stored in a Kafka topic and landing it on HDFS as ORC files.


import org.apache.crunch.types.orc.OrcUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.OrcStruct;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;
import org.apache.hadoop.hive.serde2.io.ShortWritable;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.hive.ql.io.orc.Writer;

import java.io.IOException;

public class Test {

	public static void main(String[] args) throws IOException {
		Test test = new Test();
		String tsvString = "text_string\t1\t2\t3\t123.4\t123.45";
		test.createOrcFile(tsvString);
	}

	public void createOrcFile(String input) throws IOException {
		String typeStr = "struct";
		TypeInfo typeInfo = TypeInfoUtils.getTypeInfoFromTypeString(typeStr);
		ObjectInspector inspector = OrcStruct.createObjectInspector(typeInfo);

		String[] inputTokens = input.split("\\t");

		OrcStruct orcLine = OrcUtils.createOrcStruct(
				typeInfo,
				new Text(inputTokens[0]),
				new ShortWritable(Short.valueOf(inputTokens[1])),
				new IntWritable(Integer.valueOf(inputTokens[2])),
				new LongWritable(Long.valueOf(inputTokens[3])),
				new DoubleWritable(Double.valueOf(inputTokens[4])),
				new FloatWritable(Float.valueOf(inputTokens[5])));
		Configuration conf = new Configuration();
		Path tempPath = new Path("/tmp/test.orc");

		Writer writer = OrcFile.createWriter(tempPath, OrcFile.writerOptions(conf).inspector(inspector).stripeSize(100000).bufferSize(10000));
		writer.addRow(orcLine);
		writer.close();
	}
}

Update: In case wordpress.com is filtering out code from my blog. I’ve recreated the example as gist in github below.

Update 2017-03-17: The dependencies were asked for in the comments. These jar’s are quite old, since it was a while ago when I first wrote this. I would try getting it working with these dependencies first and see if it works, and then try updating them.

Posted in code example, coding, github, hadoop, java, programming | Tagged , , , | 11 Comments

Embedded List of Records Using Avro Generic Record

I’m posting this here, because I know someone out there will be Googling how to do this like I was.

I didn’t find much and ended up spending a lot of time reading through the Avro API and using my debugger. Hopefully this will help someone.

The code below is an example how to construct an avro object that has an embedded (nested) list of avro objects using the GenericRecord class. In the example, there is a User avro record that contains a list of phoneNumber records.


String schemaDescription = "[ { \n"
	+ "\"type\" : \"record\", \n"
	+ "\"name\" : \"PhoneNumber\", \n"
	+ "\"namespace\" : \"com.test.namespace.avro\", \n"
	+ "\"fields\" : [ \n"
	+ "   {\"name\" : \"id\", \"type\": \"long\", \"doc\": \"Attribute id\" }, \n"
	+ "   {\"name\" : \"updateDate\",  \"type\": \"long\", \"doc\": \"Update date of the attribute\" }, \n"
	+ "   {\"name\" : \"value\", \"type\" : \"string\", \"doc\": \"Value of the attribute\" } \n"
	+ " ] \n"
	+ "}, \n"
	+ "{ \n"
	+ "\"type\" : \"record\", \n"
	+ "\"name\" : \"User\", \n"
	+ "\"namespace\" : \"com.test.namespace.avro\", \n"
	+ "\"fields\" : [ \n"
	+ "   {\"name\": \"timestamp\", \"type\": \"long\"},\n"
	+ "   {\"name\": \"username\", \"type\": \"string\" },\n"
	+ "   {\"name\" : \"phoneNumbers\", \"type\" : [\"null\", {\"type\" : \"array\", \"items\" : \"PhoneNumber\"}] } \n"
	+ "] \n"
	+ "}] \n";
Schema schema = new Schema.Parser().parse(schemaDescription);
Schema phoneNumberSchema = schema.getTypes().get(0);

List list = new ArrayList<>();

Date d = new Date();
GenericRecord genericPhoneNumber = new GenericData.Record(phoneNumberSchema);
genericPhoneNumber.put("id", Long.valueOf(1L));
genericPhoneNumber.put("updateDate", d.getTime());
genericPhoneNumber.put("value", "111-555-5555");
list.add(genericPhoneNumber);

genericPhoneNumber = new GenericData.Record(phoneNumberSchema);
genericPhoneNumber.put("id", Long.valueOf(2L));
genericPhoneNumber.put("updateDate", d.getTime());
genericPhoneNumber.put("value", "222-666-6666");
list.add(genericPhoneNumber);

Schema userSchema = schema.getTypes().get(1);
GenericRecord genericUser = new GenericData.Record(userSchema);
genericUser.put("phoneNumbers", list);
genericUser.put("username", "test_username");
genericUser.put("timestamp", d.getTime());
Posted in code example, coding, java | Tagged , , | Leave a comment