Mojibake

en.wikipedia.org

Daily Digest email

Get the top HN stories in your inbox every day.

rpigab

The funniest story about mojibake is the one about that letter sent by a French to a Russian address by writing the mojibake as the address, and the Russian ostal service actually understanding what each character meant and decoded in the right charset.

https://unicodebook.readthedocs.io/definitions.html#mojibake

zacharynewton

One of the great parts of the Python ecosystem for data processing is https://ftfy.readthedocs.io/en/latest/ which can handle mojibake and many other unicode-related translation problems.

But seriously, I'm always a little upset when data vendors/customers/etc don't specify the encoding they are using. You'd be surprised how many official or unique sources still use weird encodings in the name of compatibility.

xk3

If you want to test ftfy online it's available here:

https://ftfy.vercel.app/

wiredfool

FTFY is amazing. Really useful for processing excel generated csvs.

nness

Funny, I have used it for the same use-case (and a sad reminder how horrific Excel's handling of UTF-8 in CSV files can be...)

Dalewyn

Mojibake is annoying, but it's also an interesting peek into the inner workings of how computers store and handle data (in this case, characters). It's one of the more mundane examples of "computers store data in 1s and 0s" in action.

Incidentally, a lot of Japanese software are still written to Shift-JIS, so it's still fairly tedious trying to run them in an environment that's not set to Shift-JIS. I wonder if there's an AppLocale equivalent for Windows 11...

tkgally

I’ve been using Japanese on computers on a daily basis since the mid-1990s. Mojibaké used to be a regular headache, but fortunately I rarely encounter it now.

Most of the mojibaké I do see appears when staff at the Japanese university where I teach send around zipped folders of files with Japanese names. When unzipped, the files often have mojibakéd names. I haven’t yet found a way to repair them. (The contents of the files are fine.)

Almost all of the staff use Windows computers, while I and most of rest of the faculty use Macs.

lifthrasiir

ZIP famously doesn't specify any character encoding in its file names (a later version of APPNOTE introduced an additional bit to signal that UTF-8 is in use which is to my knowledge not really taken off). Many archivers therefore assumed that they are in the active code page, which meant you can experience mojibake even in Windows and in fact was a major pain when you deal with ZIP files originated from other East Asian countries. Later archivers generally have an option to set or guess the character encoding---if you still have those files, try them.

throwaway197164

WinRAR (GUI): "Options -> Name encoding -> Japanese Shift-JIS".

7-zip (CLI only): 7za.exe x -mcp=932 file.zip

There are also online tools [1] that handle this.

[1] https://ianharmon.github.io/mojibake-fixer/

jicksaw

The Unarchiver tries to guess the correct encoding. GUI is for Mac only, but the CLI is cross-platform.

https://theunarchiver.com/

nephrite

In Russian this phenomenon is called "бНОПНЯ" (read "b-nop-nya") and was caused by taking the word "Вопрос" (meaning: "question") in win-1251 encoding and reading it as if it was in KOI-8 encoding.

Also this is called "крокозябры" (read: kro-ko-zya-bry, nonsense word, no translation) especially when reading a binary file in a text viewer.

kaoD

> Also this is called "крокозябры" (read: kro-ko-zya-bry

In Esperanto there's krokodili[0] (literally "to crocodile") which is used to describe speaking non-Esperanto among esperantists.

This was further adapted into Toki Pona as "kokosila".

I found it funny how similar they are.

[0] https://en.m.wiktionary.org/wiki/krokodili

ezoe

It's really difficult to explain the concept of mojibake to the software developer who is still believing that ASCII is fine.

It's also difficult to explain that they are using wrong font to render the kanji because of Han Unification.

elcamino44

The font issue is so challenging. Even Chinese developers (who obviously don’t think ASCII is fine) sometimes won’t understand why using a Chinese font to render kanji is an issue.

yegle

There's an inside joke among programmers in Mainland China where the GBK encoding is used.

锟斤拷 (which doesn't mean anything) is the result of interpreting UTF-8's replacement character [1] in GBK.

烫 (hot, scorching) is interpreting 0xcc in GBK. In debug mode, Visual Studio will initialize unused memory with 0xcc.

The inside joke is: 手持两把锟斤拷，口中疾呼烫烫烫 Holding two 锟斤拷 in hands and screaming "hot hot hot"

[1]: https://www.fileformat.info/info/unicode/char/fffd/index.htm

innocentoldguy

As text processing has moved away from encodings like Windows 1252 and Shift-JIS to UTF-8, mojibake has become much less of an issue. It was a frequent mess in the 1990s though.

monkpit

I remember frequently having to manually select the correct encoding while browsing in the 90s and early 00s. I’m glad it’s gone.

lifthrasiir

I still regularly see mojibake from Japanese ZIP files. The worst has indeed passed, but it will remain a lingering problem for decades.

nidnogg

This is a legitimate issue when solving merge conflicts via Azure's built in conflict manager - it will muck you up no matter what if you have any funny punctuation going on.

My previous gig used to have an obscenely contrived scheme of multiple dummy "conflict" branches to solve issues locally whenever a conflict would arise due to that.

Really glad to be off the Microsoft stack today.

leeter

So back at one of my first jobs I worked a lot with XML, as a dev you often forget to test some of the odder corner cases but this had come up somewhere and I decided to test it... and lo and behold we failed horribly. Ever since then any time I'm using any sort of serialization format I add mojibake to my tests. My usual sequence these days is either Japanese/Chinese or <string of emoji that hacker news removes> or both. The amount of software claiming to respect encodings that doesn't is quite amazing. Many times they'll include things like the XML declaration and then completely ignore it. Ditto HTML and encoding headers and tags, also byte order marks.

franciscop

Note that "bake*" in Japanese also means "monster/ghost". I am not sure if intended or not, but can def see this being a magnific pun in the language, since that alt translation would be "character monster".

* Note: not sure if this is actually an official alt meaning in Japanese, an intended pun, or none of the above, just my notes thinking this could be a magnific pun. It uses a different kanji so would only be possible if this was regularly written in katakana or only spoken of.

rippercushions

It's the same word: 化け(る) bake(ru) means "change, transform, alter, corrupt". So a monster is ''o-bake'', "something which has been changed [into a monster]".

That said, no, most Japanese would not associate ''mojibake'' with "character monsters", it's just "altered characters".

franciscop

Ah nice, thanks for the clarification! Does that "altered" have the connotation of "corrupt" here? Or it could be altered in any generic way?

bitwize

Bakemono means something more like "changeling", e.g., a tanuki who is assuming human form. The bake in mojibake has more to do with this concept of changing than with ghosts or monsters.

0xFEE1DEAD

Just yesterday I watched a talk from NDC Copenhagen by Dylan Beattie about this exact topic. The story which stood out the most was this https://www.youtube.com/watch?v=gd5uJ7Nlvvo&t=22m09s

The whole talk was an interesting watch tho.

Daily Digest email

Get the top HN stories in your inbox every day.