From XML to JSON to CBOR

Daily Digest email

Get the top HN stories in your inbox every day.

mrbluecoat

Feels like a CBOR ad to me. I agree that most techs are familiar with XML and JSON, but calling CBOR a "pivotal data format" is a stretch compared to Protobuf, Parquet, Avro, Cap'n Proto, and many others: https://en.m.wikipedia.org/wiki/Comparison_of_data-serializa...

ognyankulev

The fact that the long article misses to make the historical/continuation link to MessagePack is by itself a red flag signalling a CBOR ad.

Edit: OK, actually there is a separate page for alternatives: https://cborbook.com/introduction/cbor_vs_the_other_guys.htm...

mikepurvis

Notably missing is a comparison to Cap'n Proto, which to me feels like the best set of tradeoffs for more binary interchange needs.

I honestly wonder sometimes if it's held back by the name— I love the campiness of it, but I feel like it could be a barrier to being taken seriously in some environments.

aidenn0

Doesn't Cap'n Proto require the receiver to know the types for proper decoding? This wouldn't entirely disqualify it from comparison, since e.g. protobufs are that way as well, but they make it less interesting for comparing to CBOR, which is type-tagged.

kentonv

> I feel like it could be a barrier to being taken seriously in some environments.

Working as intended. ;)

abrookewood

Have to agree. I've heard of every format you mentioned, but never heard of CBOR.

pelagicAustral

I first heard of it while developing a QR code travel passport during the Covid era... the technical specification included CBOR as part of the implementation requirement. Past this, I have not crossed path with it again...

darthrupert

CBOR is just a standard data format. Why would it need an ad? What are they selling here?

Retr0id

A lot of people (myself included) are working on tools and protocols that interoperate via CBOR. Nobody is selling CBOR itself, but I for one have a vested interest in promoting CBOR adoption (which makes it sound nefarious but in reality I just think it's a neat format, when you add canonicalization).

CBOR isn't special here, similar incentives could apply to just about any format - but JSON for example is already so ubiquitous that nobody needs to promote it.

8n4vidtmkvmk

If I adopt a technology, I probably don't want it to die out. Widespread support is generally good for all that use it.

f_devd

I would agree their claim is a bit early, but I think a key difference between those you mentioned and CBOR is the stability expectation. Protobuf/Parquet/etc are usually single-source libraries/frameworks, which can be changed quite quickly, while CBOR seems to be going for a spec-first approach.

_the_inflator

Love or hate JSON, the beauty and utility stem from the fact that you have only the fundamental datatypes as a requirement, and that's it.

Structured data that, by nesting, pleases the human eye, reduced to the max in a key-value fashion, pure minimalism.

And while you have to write type converters all the time for datetime, BLOBs etc., these converters are the real reasons why JSON is so useful: every OS or framework provides the heavy lifting for it.

So any elaborated new silver bullet would require solving the converter/mapper problem, which it can't.

And you can complain or explain with JSON: "Comments not a feature?! WTF!" - Add a field with the key "comment"

Some smart guys went the extra mile and nevertheless demanded more, because wouldn't it be nice to have some sort of "strict JSON"? JSON schema was born.

And here you can visibly experience the inner conflict of "on the one hand" vs "on the other hand". Applying schemas to JSON is a good cause and reasonable, but guess what happens to JSON? It looks like unreadable bloat, which means XML.

Extensibility is fine, basic operations appeal to both demands, simple and sophisticated, and don't impose the sophistication on you just for a simple 3-field exchange about dog food preferences.

sevensor

My complaint about JSON is that it’s not minimal enough. The receiver always has to validate anyway, so what has syntax typing done for us? Different implementations of JSON disagree about what constitutes a valid value. For instance, is

    {“x”: NaN}

valid JSON? How about 9007199254740993? Or -.053? If so, will that text round trip through your JSON library without loss of precision? Is that desirable if it does?

Basically I think formats with syntax typed primitives always run into this problem: even if the encoder and decoder are consistent with each other about what the values are, the receiver still has to decide whether it can use the result. This after all is the main benefit of a library like Pydantic. But if we’re doing all this work to make sure the object is correct, we know what the value types are supposed to be on the receiving end, so why are we making a needlessly complex decoder guess for us?

aidenn0

NaN is not a valid value in JSON. Neither are 0123 or .123 (there must always be at least one digit before the decimal marker, but extraneous leading zeroes are disallowed).

JSON was originally parsed in javascript with eval() which allowed many things that aren't JSON through, but that doesn't make JSON more complex.

sevensor

That’s my point, though! I’ve run into popular JSON libraries that will emit all of those! 9007199254740993 is problematic because it’s not representable as a 64 bit float. Python’s JSON library is happy to write it, even though you need an int to represent it, and JSON doesn’t have ints.

Edit: I didn’t see my thought all the way through here. Syntax typing invites this kind of nonconformity, because different programming languages mean different things by “number,” “string,” “date,” or even “null.” They will bend the format to match their own semantics, resulting in incompatibility.

conartist6

Yeah I would emit NaN and just hope the receiver handles it.

What's the point of lying about the data?

The format offers you no data type that would not be an outright lie when applied to this data, so you may as well not lie and break the format

zzo38computer

> you have only the fundamental datatypes as a requirement

Not really; the set of datatypes has problems. It uses Unicode, not binary data and not non-Unicode text. Numbers are usually interpreted as floating point numbers rather than integers, which can also be a problem. Keys can only be strings. And, other problems. So, the data types are not very good.

And, since it is a text format, it means that escaping is required.

> And while you have to write type converters all the time for datetime, BLOBs etc.

Not having a proper data type for binary means that you will need to encode it using different types and then avoids the benefit of JSON, anyways. So, I think JSON is not as helpful.

I think DER is better (you do not have to use all of the types; only the types that you are using is necessary to be implemented, because the format of DER makes it possible to skip anything that you do not care about), and I made up TER which is text based format which can be converted to DER (so, even though a binary data is represented as text, it is still representing the binary data type, rather than needing to use the wrong data type like JSON does).

> And you can complain or explain with JSON: "Comments not a feature?! WTF!" - Add a field with the key "comment"

But then it is a part of the data, which you might not want.

elcritch

CBOR (and MsgPack) still embraces that simplicity. It provides the same types of key-value, lists, and basic values.

However the types are more precise allowing you to differentiate between int32’s or int64’s or between strings or bytes.

Essentially you can replace json with it and gain performance, less ambiguity but with the same flexibility. You do need a step to print CBOR in human readable form, but it has a standardized human readable form similar to a typed json.

dang

Related. Others?

Begrudgingly Choosing CBOR over MessagePack - https://news.ycombinator.com/item?id=43229259 - March 2025 (78 comments)

MessagePack vs. CBOR (RFC7049) - https://news.ycombinator.com/item?id=23838565 - July 2020 (2 comments)

CBOR – Concise Binary Object Representation - https://news.ycombinator.com/item?id=20603378 - Aug 2019 (71 comments)

CBOR – Concise Binary Object Representation - https://news.ycombinator.com/item?id=10995726 - Jan 2016 (36 comments)

Libcbor – CBOR implementation for C and others - https://news.ycombinator.com/item?id=9597198 - May 2015 (5 comments)

CBOR – A new object encoding format - https://news.ycombinator.com/item?id=6932089 - Dec 2013 (9 comments)

RFC 7049 - Concise Binary Object Representation (CBOR) - https://news.ycombinator.com/item?id=6632576 - Oct 2013 (52 comments)

brookst

Odd that the XML and JSON sections show examples of the format, but CBOR doesn’t. I’m left with no idea what it looks like, other than “building on JSON’s key/value format”.

cbm-vic-20

There's an example in the "Putting it Together" section, showing JSON, a "human readable" representation of CBOR, and the hexidecimal bytes of CBOR.

https://cborbook.com/part_1/practical_introduction_to_cbor.h...

account-5

I'm assuming, since it's a binary encoded, the textual output would not be something you'd like to look at.

brookst

Why? I’m comfortable reading 0x48 0x65 0x78 0x61 0x64 0x65 0x63 0x69 0x6D 0x61 0x6C

8n4vidtmkvmk

With a table explaining what the byte codes mean? Absolutely I want to see that.

sam_lowry_

People look at TCP packets all the time.

account-5

In which format? As a list of 1s and 0s; in hex? TCP or IP if I just pasted the textual version of any binary data id captured without some form of conversion it's not good to look at. Especially if it's not accompanied by the encoding schema so you can actually make sense of it.

makapuf

ASN.1 while complex has really seems to be a step up from those (even if older) in terms of terseness (as binary encoding) and generality.

eadmund

Would you rather write a parser for this:

    SEQUENCE {
      SEQUENCE {
        OBJECT IDENTIFIER '1 2 840 113549 1 1 1'
        NULL
        }
      BIT STRING 0 unused bits, encapsulates {
          SEQUENCE {
            INTEGER
              00 EB 11 E7 B4 46 2E 09 BB 3F 90 7E 25 98 BA 2F
              C4 F5 41 92 5D AB BF D8 FF 0B 8E 74 C3 F1 5E 14
              9E 7F B6 14 06 55 18 4D E4 2F 6D DB CD EA 14 2D
              8B F8 3D E9 5E 07 78 1F 98 98 83 24 E2 94 DC DB
              39 2F 82 89 01 45 07 8C 5C 03 79 BB 74 34 FF AC
              04 AD 15 29 E4 C0 4C BD 98 AF F4 B7 6D 3F F1 87
              2F B5 C6 D8 F8 46 47 55 ED F5 71 4E 7E 7A 2D BE
              2E 75 49 F0 BB 12 B8 57 96 F9 3D D3 8A 8F FF 97
              73
            INTEGER 65537
            }
          }
      }

or this:

    (public-key
      (rsa
        (e 65537)
        (n
         165071726774300746220448927123206364028774814791758998398858897954156302007761692873754545479643969345816518330759318956949640997453881810518810470402537189804357876129675511237354284731082047260695951082386841026898616038200651610616199959087780217655249147161066729973643243611871694748249209548180369151859)))

I know that I’d prefer the latter. Yes, we could debate whether the big integer should be a Base64-encoded binary integer or not, but regardless writing a parser for the former is significantly more work.

And let’s not even get started with DER/BER/PEM and all that insanity. Just give me text!

flowerthoughts

The ASN.1 notation wasn't meant for parsing. And then people started writing parsing generators for it, so they adapted. However, you're abusing a text format for human reading and pretending it's a serialization format.

The BER/PER are binary formats and great where binary formats are needed. You also have XER (XML) and JER (JSON) if you want text. You can create an s-expr encoding if you want.

Separate ASN.1--the data model from ASN.1--the abstract syntax notation (what you wrote) from ASN.1's encoding formats.

[1] https://www.itu.int/en/ITU-T/asn1/Pages/asn1_project.aspx

eadmund

> However, you're abusing a text format for human reading and pretending it's a serialization format.

They should be the same, in order to facilitate human debugging. And we were discussing ASN.1, not its serialisations. Frankly, I thought that it was fairer to compare the S-expression to ASN.1, because both are human-readable, rather than to an opaque blob like:

    MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDrEee0Ri4Juz+QfiWYui/9UGSXau/2P8LjnTD8V4Unn+2FAZVGE3kL23bzeoULYv4PeleB3gfm

Sure, that blob is far more space-efficient, but it’s also completely opaque without tooling. Think how many XPKI errors over the years have been due to folks being unable to know at a glance what certificates and keys actually say.

zzo38computer

That is a text format, although DER is a binary format and encodes the data which there is represented by text. I think they should not have made a bit string (or octet string) to encapsulate another ASN.1 data and would be better to put it directly, but nevertheless it can work. The actual data to be parsed will be binary, not the text format like that.

DER is a more restricted variant of BER and I think DER is better than BER. PEM is also DER format but is encoded as base64 and has a header to indicate what type of data is being stored, rather than directly.

jabl

Yes, but that comes from the telecom world. Hence thanks to NIH, that wheel must be reinvented.

nly

The FOSS tooling for it sucks balls. That's why

zzo38computer

Then, work to make a better one. (I had written a C library to read/write DER format, although it does not deal with the schema.)

JoelJacobson

Fun fact: CBOR is used within the WebAuthn (Passkey) protocol.

To do Passkey-verification server-side, I had to implement a pure-SQL/PLpgSQL CBOR parser, out of fear that a C-implementation could crash the PostgreSQL server: https://github.com/truthly/pg-cbor

teatro

That’s why I’m wondering if there is an actual CBOR encoder in the browsers? I mean, there must be one, or am I wrong?

esbranson

Yes.[1][2]

[1] https://source.chromium.org/chromium/chromium/src/+/main:com...

[2] https://source.chromium.org/chromium/chromium/src/+/main:dev...

esbranson

And .Net 5 circa 2020 added support for CBOR. ASP.NET ended up being a good choice for an experimental WebAuthn server for FedCM and DID experiments.

nabla9

CBOR is when you need option for very small code size. If you can always use compression, CBOR provides no significant data size improvement over JSON.

With small code size it beats also BSON, EBML and others.

surajrmal

Or compute. Compression isn't free, especially on power constrained devices. At scale power and compute also have real cost implications. Most data centers have long been using binary encoding formats such as protobuf to save on compute and network bandwidth. cbor is nice because it's self describing so you can still understand it without a schema, which is a nice property people like about json.

8n4vidtmkvmk

Doesn't capn proto win hands down on compute?

I haven't used it, but I thought that was the big claim.

kentonv

Not necessarily.

Cap'n Proto serialization can be a huge win in terms of compute if you are communicating using shared memory or reading huge mmaped files, especially if the reader only cares to read some random subset of the message but not the whole thing.

But in the common use case of sending messages over a network, Cap'n Proto probably isn't a huge difference. Pushing the message through a socket is still O(n), and the benefits of compression might outweigh the CPU cost. (Though at least with Cap'n Proto, you have the option to skip compression. Most formats have some amount of compression baked into the serialization itself.)

Note that benchmarks vary wildly depending on the use case and the type of data being sent, so it's not really possible to say "Well it's N% faster"... it really depends. Sometimes Protobuf wins! You have to test your use case. But most people don't have time to build their code both ways to compare.

I actually think Cap'n Proto's biggest wins are in the RPC system, not the serialization. But these wins are much harder to explain, because it's not about speed, but instead expressiveness. It's really hard to understand the benefits of using a more expressive language until you've really tried it.

(I'm the author of Cap'n Proto.)

Zardoz84

gzip, deflate, brotli ?

fjfaase

This is a link to just one section of a larger book. The next section compare CBOR with a number of other binary storage format, such as protobuf.

aidenn0

I admit I got nerd-sniped here, but the table for floats[1] suggests that 10000.0 be represented as a float32. However, isn't it exactly representable as 0x70e2 in float16[2]? There are only 10 significant bits to the mantissa (including the implicit 1), while float16 has 11 so there's even an extra bit to spare.

1: https://cborbook.com/part_1/practical_introduction_to_cbor.h...

2: i.e. 1.220703125×2¹³

aidenn0

Looks like it's a typo; they state:

> 0x47c35000 encodes 10000.0

But by my math that encodes 100000.0 (note the extra zero).

gethly

I wish browsers would support CBOR natively so I could just return CBOR instead of JSON(++speed --size ==win) and not have to be concerned with decoding it or not being able to debug requests in dev console.

dylan604

JSON + compression (++speed --size ==win)

your server can do this natively for live data. your browser can decompress natively. and ++human-readable. if you're one of those that doesn't want the user to read the data, then maybe CBOR is attractive??? but why would you send data down the wire that you don't want the user to see? isn't the point of sending the data to the client is so the client can display that data?

gethly

That is true. Basic content encoding works very well with json but that still means there is the compression step, which would not be necessary with CBOR as it is already a binary payload. It would allow faster response and delivery times natively. Of course, we are talking few ms, but I say why leave those ms on the floor?

I guess i'm just shouting at the clouds :D

8n4vidtmkvmk

It's still not attractive to hide data from the user. Unless it's encrypted, the user can read it.

dylan604

i think i'm using a different meaning of "seeing". to the user, it won't be plain text that is human readable. unencrypted CBOR byte data might as well be encrypted to the end user.

glenjamin

The only mention I can see in this document of compression is

> Significantly smaller than JSON without complex compression

Although compression of JSON could be considered complex, it's also extremely simple in that it's widely used and usually performed in a distinct step - often transparently to a user. Gzip, and increasingly zstd are widely used.

I'd be interested to see a comparison between compressed JSON and CBOR, I'm quite surprised that this hasn't been included.

dylan604

> I'm quite surprised that this hasn't been included.

Why? That goes against the narrative of promoting one over the other. Nissan doesn't advertise that a Toyota has something they don't. They just pretend it doesn't exist.

JimDabell

Previously:

CBOR – Concise Binary Object Representation - https://news.ycombinator.com/item?id=20603378 - Aug 2019 (71 comments)

Begrudgingly Choosing CBOR over MessagePack - https://news.ycombinator.com/item?id=43229259 - Mar 2025 (78 comments)

Daily Digest email

Get the top HN stories in your inbox every day.