Character Sets and Converting WINDOWS-1252 to ISO-8859-1

At work, I have the task of migrating some third-party data over to our system. I was given some test XML files from the third-party which were formatted to the XML Schema we had given them. I began work on them immediately.

Everything went fine. With some minor tweaks, our system digested them without error. I could have then gone on my way onto another task; however, I decided to view the content on our test site. That’s when I saw them, bad characters everywhere. The character encoding listed on the XML files was ISO-8859-1. I had suspicion that the XML files really weren’t using this encoding.

Of course, I thought there must be a way to prove this, to determine what the character set used to encode a file. I was thinking, “if I knew what it was, I could probably convert it.” After a few google searches and reading articles such as Joel’s , I discovered there really isn’t a way to accurately determine character sets. I also read plenty of rants that respecting the character set is the job of the creator, “That’s why we have standards.”

Well that doesn’t help. I asked one of senior developers about this. And with some assumptions, he was able to do determine that it was most likely encoded using windows-1252 or something close to it. Really, it takes someone looking at the text to make an educated guess at what the character set could be. There is no way do to this successfully in a programmatic way.

So anyway, using iconv, below is how to convert windows-1252 to ISO-8859-1.

iconv -f windows-1252 -t ISO-8859-1//TRANSLIT input.xml -o output.xml
Advertisements
This entry was posted in general. Bookmark the permalink.

One Response to Character Sets and Converting WINDOWS-1252 to ISO-8859-1

  1. Greg Sandell says:

    Hi Mel, I wrote a perl script once to diagnose a file to quantify the number of characters of extended ascii or UTF-8 byte sequences to try to come to a judgement of whether the file was plain ascii, Latin-1, WinLatin-1 or UTF-8. I was using it at iCrossing when we would get gigabyte-sized nightly feeds from customers like Sears. Let me know if you’d like it. I also have a blog article on character sets here: http://bit.ly/GMgP8Y . Congrats on your huge number of hits today! And the blog looks great in WordPress.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s