Writing an ORC File Using Java

2017/06/13: Checkout my updated article on reading and writing ORC files using Vectorized Row Batch in Java.
Linked here: https://codecheese.wordpress.com/2017/06/13/reading-and-writing-orc-files-using-vectorized-row-batch-in-java/

So, I saw this question sometime ago on Stack Overflow and I’ve been meaning to answer this for sometime.  The question: How to convert .txt / .csv file to ORC format.

Below is a code sample of how to do it using Apache Crunch.  I originally couldn’t find any documentation on how to do this.  Most of what I was found how to use a Hive query to create ORC files from an existing database table.  So I started reading the unit tests of various projects until I found something made sense. And that’s how I came across Apache Crunch.

The valuable lesson I’ve learned when working with a new open-source project is to not neglect taking a look at the unit tests. Many of the answers you’re looking for can be discovered there.

Also, one note about the code below is that it’s variation of what can already be found in Apache Crunch’s unit tests. And second, when choosing the stripesize and buffersize to use, you’ll need to experiment with your HDFS cluster and see what gives you the best performance for running Hive queries on the data you created. I haven’t found any universal rule-of-thumb on what to set these at that I actually witnessed for myself to work.

I’m currently doing something similar. Taking tab delimited data stored in a Kafka topic and landing it on HDFS as ORC files.

import org.apache.crunch.types.orc.OrcUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.OrcStruct;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;
import org.apache.hadoop.hive.serde2.io.ShortWritable;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.hive.ql.io.orc.Writer;

import java.io.IOException;

public class Test {

	public static void main(String[] args) throws IOException {
		Test test = new Test();
		String tsvString = "text_string\t1\t2\t3\t123.4\t123.45";

	public void createOrcFile(String input) throws IOException {
		String typeStr = "struct";
		TypeInfo typeInfo = TypeInfoUtils.getTypeInfoFromTypeString(typeStr);
		ObjectInspector inspector = OrcStruct.createObjectInspector(typeInfo);

		String[] inputTokens = input.split("\\t");

		OrcStruct orcLine = OrcUtils.createOrcStruct(
				new Text(inputTokens[0]),
				new ShortWritable(Short.valueOf(inputTokens[1])),
				new IntWritable(Integer.valueOf(inputTokens[2])),
				new LongWritable(Long.valueOf(inputTokens[3])),
				new DoubleWritable(Double.valueOf(inputTokens[4])),
				new FloatWritable(Float.valueOf(inputTokens[5])));
		Configuration conf = new Configuration();
		Path tempPath = new Path("/tmp/test.orc");

		Writer writer = OrcFile.createWriter(tempPath, OrcFile.writerOptions(conf).inspector(inspector).stripeSize(100000).bufferSize(10000));

Update: In case wordpress.com is filtering out code from my blog. I’ve recreated the example as gist in github below.

Update 2017-03-17: The dependencies were asked for in the comments. These jar’s are quite old, since it was a while ago when I first wrote this. I would try getting it working with these dependencies first and see if it works, and then try updating them.

This entry was posted in code example, coding, github, hadoop, java, programming and tagged , , , . Bookmark the permalink.

11 Responses to Writing an ORC File Using Java

  1. this is code is not working its gives error in typeinfo Error : < expected at the end of 'struct'

    • Thanks for letting me know.
      I’ve updated the example above and tested that it works. It seems I forgot to put in my to put in the typeinfo for the example.

      Just in case wordpress.com cache hasn’t updated by the time you see this.
      You’ll need to update line 28.
      String typeStr = “struct”;
      String typeStr = “struct”;

      Thanks again.

      • Ha! It isn’t me. Something in wordpress.com is removing the struct typeinfo. It was removed in my comment.
        I’m going to look into making a gist or public code sample repo in github.

  2. K Abram says:

    I had to built a java to ORC utility and ended up making an annotation driven open-source library around it. It is available at https://github.com/eclecticlogic/eclectic-orc and features a very simple mechanism to define your schema and uses runtime code generation to efficiently take care of the boilerplate for creating the writer.

    • Hi,
      I took a quick look at your repo. You should really look into supporting sub structs. At my job, we heavily use them in our ORC files. And if we use them and leverage them, it’s most likely other companies use them as well.

  3. Sham says:

    Hi Melanio,
    I am reading ORC file in java and then split this file based on size. for instance if file size is 5GB then I need to create 5 files with 1GB size each. I am able to do this using java. only problem here is that if original files stripe size is different and split file stripe size is different , I want to set original files stripe size to all split files. How I can retrieve stripe size of file using orcreader in java ? Please reply

    • I’m trying to understand. You just want to get the stripe size of your input file before splitting it up, correct.
      You can take a look at hive’s orcfiledump and run the command “hive –orcfiledump filename” and it will give you all the metadata pertaining to your ORC file. If you need to do it programmatically, I’ll just take a look at the orcfiledump source code and take a look how the developers did it. That’s what I did when I wanted to know how they got it to readout embedded structs.

  4. Can you give links to the jar files you used.?

  5. Pingback: Reading and Writing ORC Files Using Vectorized Row Batch in Java | Code Cheese

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s