Reading and Writing ORC Files Using Vectorized Row Batch in Java

So looking at the analytics of my blog, it appears my article on how to write ORC files using core java is really popular.

In this post, I’ll be going over how to do write and read ORC files using Apache ORC’s Vectorized Row Batch. Using this approach improves your performance times. “The focus is on speed and access the data fields directly.”

Unlike my previous post, this time I wised up and created project for you to clone and work with on github. If you want to skip reading and just dig into the code, you can find the project here.  Also, all future posts I write that have code examples will be in this project.

First you’ll need to add this your projects pom file if you haven’t already. For this post, I’m using the latest release, org.apache.orc:orc-mapreduce:1.4.0.

First Example: Simple Write
This first example is lifted directly from Apache ORC’s documentation. The only thing I changed here was the filename of the ORC file that is being generated.
Link to Code


	public static void main(String [ ] args) throws java.io.IOException
	{
		Configuration conf = new Configuration();
		TypeDescription schema = TypeDescription.fromString("struct");
		Writer writer = OrcFile.createWriter(new Path("/tmp/simple-my-file.orc"),
				OrcFile.writerOptions(conf)
						.setSchema(schema));


		VectorizedRowBatch batch = schema.createRowBatch();
		LongColumnVector x = (LongColumnVector) batch.cols[0];
		LongColumnVector y = (LongColumnVector) batch.cols[1];
		for(int r=0; r < 10000; ++r) {
			int row = batch.size++;
			x.vector[row] = r;
			y.vector[row] = r * 3;
			// If the batch is full, write it out and start over.
			if (batch.size == batch.getMaxSize()) {
				writer.addRowBatch(batch);
				batch.reset();
			}
		}
		if (batch.size != 0) {
			writer.addRowBatch(batch);
			batch.reset();
		}
		writer.close();
	}

Second Example: Write ORC File With Different Datatypes
Now in my next example, it is a little more interesting. In it, I’m writing to other datatypes, not just integers. Note how a String must be converted to a byte array and use the BytesColumnVector.
Link to Code


	public static void main(String [ ] args) throws java.io.IOException
	{
		Configuration conf = new Configuration();
		TypeDescription schema = TypeDescription.createStruct()
				.addField("int_value", TypeDescription.createInt())
				.addField("long_value", TypeDescription.createLong())
				.addField("double_value", TypeDescription.createDouble())
				.addField("float_value", TypeDescription.createFloat())
				.addField("boolean_value", TypeDescription.createBoolean())
				.addField("string_value", TypeDescription.createString());

		Writer writer = OrcFile.createWriter(new Path("/tmp/my-file.orc"),
				OrcFile.writerOptions(conf)
						.setSchema(schema));


		VectorizedRowBatch batch = schema.createRowBatch();
		LongColumnVector intVector = (LongColumnVector) batch.cols[0];
		LongColumnVector longVector = (LongColumnVector) batch.cols[1];
		DoubleColumnVector doubleVector = (DoubleColumnVector) batch.cols[2];
		DoubleColumnVector floatColumnVector = (DoubleColumnVector) batch.cols[3];
		LongColumnVector booleanVector = (LongColumnVector) batch.cols[4];
		BytesColumnVector stringVector = (BytesColumnVector) batch.cols[5];


		for(int r=0; r < 100000; ++r) {
			int row = batch.size++;

			intVector.vector[row] = rand.nextInt();
			longVector.vector[row] = rand.nextLong();
			doubleVector.vector[row] = rand.nextDouble();
			floatColumnVector.vector[row] = rand.nextFloat();
			booleanVector.vector[row] =  rand.nextBoolean() ? 1 : 0;
			stringVector.setVal(row, UUID.randomUUID().toString().getBytes());

			if (batch.size == batch.getMaxSize()) {
				writer.addRowBatch(batch);
				batch.reset();
			}
		}
		if (batch.size != 0) {
			writer.addRowBatch(batch);
			batch.reset();
		}
		writer.close();
	}

Third Example: Read ORC File With Different Datatypes
Finally in this example, you can of how to read the previously generated ORC file from the example above.
Link to Code


	public static void main(String [ ] args) throws java.io.IOException
	{
		Configuration conf = new Configuration();
		Reader reader = OrcFile.createReader(new Path("/tmp/my-file.orc"),
				OrcFile.readerOptions(conf));

		RecordReader rows = reader.rows();
		VectorizedRowBatch batch = reader.getSchema().createRowBatch();

		while (rows.nextBatch(batch)) {
			LongColumnVector intVector = (LongColumnVector) batch.cols[0];
			LongColumnVector longVector = (LongColumnVector) batch.cols[1];
			DoubleColumnVector doubleVector  = (DoubleColumnVector) batch.cols[2];
			DoubleColumnVector floatVector = (DoubleColumnVector) batch.cols[3];
			LongColumnVector booleanVector = (LongColumnVector) batch.cols[4];
			BytesColumnVector stringVector = (BytesColumnVector)  batch.cols[5];


			for(int r=0; r < batch.size; r++) {
				int intValue = (int) intVector.vector[r];
				long longValue = longVector.vector[r];
				double doubleValue = doubleVector.vector[r];
				double floatValue = (float) floatVector.vector[r];
				boolean boolValue = booleanVector.vector[r] != 0;
				String stringValue = new String(stringVector.vector[r], stringVector.start[r], stringVector.length[r]);

				System.out.println(intValue + ", " + longValue + ", " + doubleValue + ", " + floatValue + ", " + boolValue + ", " + stringValue);

			}
		}
		rows.close();
	}

Again, note how to read a string correctly, you need to also use the ByteColumnVector’s start and length arrays. Otherwise, you get the whole byte array stored in the column as one giant continuous string. I discovered how to do this by looking at the source code for ByteColumnVector. I noticed that there was a method named “stringifyValue”, I knew the solution I needed was in this method. So I took a look and there it was. This is an important point that I’m always drilling on with our apprentices at work. Always check the source code for the open-source project you’re working with. You got to learn to dig in there, not all answers are on Google and StackOverflow.

Advertisements
This entry was posted in code example, coding, github, hadoop, java, Learning to Code and tagged , , , , , . Bookmark the permalink.

2 Responses to Reading and Writing ORC Files Using Vectorized Row Batch in Java

  1. Thanks for your example. I was having trouble while reading orc file. Your reader example helped me :).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s