At work, I have the task of migrating some third-party data over to our system. I was given some test XML files from the third-party which were formatted to the XML Schema we had given them. I began work on them immediately.
Everything went fine. With some minor tweaks, our system digested them without error. I could have then gone on my way onto another task; however, I decided to view the content on our test site. That’s when I saw them, bad characters everywhere. The character encoding listed on the XML files was ISO-8859-1. I had suspicion that the XML files really weren’t using this encoding.
Of course, I thought there must be a way to prove this, to determine what the character set used to encode a file. I was thinking, “if I knew what it was, I could probably convert it.” After a few google searches and reading articles such as Joel’s , I discovered there really isn’t a way to accurately determine character sets. I also read plenty of rants that respecting the character set is the job of the creator, “That’s why we have standards.”
Well that doesn’t help. I asked one of senior developers about this. And with some assumptions, he was able to do determine that it was most likely encoded using windows-1252 or something close to it. Really, it takes someone looking at the text to make an educated guess at what the character set could be. There is no way do to this successfully in a programmatic way.
So anyway, using iconv, below is how to convert windows-1252 to ISO-8859-1.
iconv -f windows-1252 -t ISO-8859-1//TRANSLIT input.xml -o output.xml