You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
koehr.ing/blog/2017-04-09-the-magic-0xc2.md

62 lines
3.3 KiB
Markdown

# The Magic 0xC2
*Written 2017-04-09*
I built a web application with file upload functionality. Some Vue.js in the front and a CouchDB in the back. Everything should be pretty simple and straigt forward.
But…
<!-- more -->
When I uploaded image files, they somehow got mangled. The uploaded file was bigger than the original and the new "file format" was not readable by any means. I got intrigued. What is it, that happens to the files? The changes seemed very random but reproducible, so I created a few test files to see what exactly changes and when.
My first file looked like this:
```
0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
```
To my surprise, the file stayed the same! My curiosity grew. In the meantime I found a very intriguing pattern in uploads hexdump: `C3 BF C3`. It was everywhere. In another file, I found similar patterns with `C2`. So I wrote my next test file. This time a binary file:
```
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 |................|
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |.... !"#$%&'()01|
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |23456789@ABCDEFG|
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |HIPQRSTUVWXY`abc|
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |defghipqrstuvwxy|
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |................|
96 97 98 99 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab |................|
ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb |................|
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
```
**EDIT**: As you probably already noticed, I counted up like in Base10 but it is actually Base16. So I skipped A-F until reaching A0. This might look weird but didn't affect the test.
The result after uploading was
```
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 |................|
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |.... !"#$%&'()01|
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |23456789@ABCDEFG|
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |HIPQRSTUVWXY`abc|
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |defghipqrstuvwxy|
c2 80 c2 81 c2 82 c2 83 c2 84 c2 85 c2 86 c2 87 |................|
c2 88 c2 89 c2 90 c2 91 c2 92 c2 93 c2 94 c2 95 |................|
c2 96 c2 97 c2 98 c2 99 c2 a0 c2 a1 c2 a2 c2 a3 |................|
c2 a4 c2 a5 c2 a6 c2 a7 c2 a8 c2 a9 c2 aa c2 ab |................|
c2 ac c2 ad c2 ae c2 af c2 b0 c2 b1 c2 b2 c2 b3 |................|
c2 b4 c2 b5 c2 b6 c2 b7 c2 b8 c2 b9 c2 ba c2 bb |................|
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
```
There it was again: The magic **0xC2**!
So all bytes with a value higher than *0x79* got followed by a *0xC2*. *0x79* is the ASCII code for *y*. This is at least what I thought. It actually is the other way around: All bytes with value *0x80* or higher got prefixed by a *0xC2*! — there the scales fell from my eyes: **UTF-8 encoding**!
In *UTF-8* all characters after *0x7F* are at least two bytes long. They get prefixed with *0xC2* until *0xC2BF* (which is the inverted question mark `¿`), which is then followed by *0xC380*. So what happened is, that on the way to the server, the file got encoded to UTF-8 ¯\\\_(ツ)\_/¯
**EDIT:** Corrected some mistakes after some comments on [Hackernews](https://news.ycombinator.com/item?id=14089827)