Utf8 byte order mark is great. We're just gonna put invisible binary data in this plain text file. Nothing could ever go wrong with that!

*concatenates two text files*

Oh........ oh no 😨

The workflow for the Kitsune Tails script was that I would download it from Google Docs as text files, and then run it through our script tool to generate code from that

But the script quickly became large enough it got split over multiple docs. So I just used "type" (that's like "cat" for you linux folks) in the batch file to concatenate the text files first

Guess who then found out that Google Docs inserts a utf8 bom when downloading as plaintext? So then I had to add additional code to look for the byte order mark at the start of lines to explicitly remove it or it would mess things up. Very cool and fun and a good use of my limited time

Show thread

tbh given the unicode standard says that while UTF8 BOMs are technically allowed their use is discouraged (see hachyderm.io/@danderson/113290 for why it's allowed) we should really have some kind of wall of shame for applications that add a UTF8 BOM *on purpose*

Show thread

@eniko but this really hampers my esoteric programming language composed entirely of byte order marks 😢

@typeswitch @eniko This could work: require UTF-16 encoding, encode a BOM using little-endian UTF-16 to represent a 0, a BOM using big-endian UTF-16 to represent a 1. The fundamental building block of the language is a sequence of 3 little- or big-endian BOMs. This gives you 3 bits or 8 symbols, enough to encode the symbols of brainfuck. Adopt brainfuck semantics.

@typeswitch @eniko Tho this would not really be an appropriate use of the BOM I feel. The BOM represents having to deal wirh endianness differences, which my idea doesn't properly reflect. I think 0 should be encoded as the machine's native endianness UTF-16 BOM, while a 1 should be encoded as the non-native endianness UTF-16 BOM. A program may optionally start with a byte order mark to indicate native endianness.

Follow

@mort @typeswitch @eniko You only need three symbols to encode Whitespace semantics. Use native BOM for one symbol, and non-native BOM as a prefix to either of them for two more symbols.

Sign in to participate in the conversation
Librem Social

Librem Social is an opt-in public network. Messages are shared under Creative Commons BY-SA 4.0 license terms. Policy.

Stay safe. Please abide by our code of conduct.

(Source code)

image/svg+xml Librem Chat image/svg+xml