Skip to content(if available)orjump to list(if available)

The sum of all knowledge and the sorry state of the web


It has never been easier to share knowledge, and thus there has never been a greater time to be curious.

Encyclopaedia Britannica is still there, and it's great. But there's also an army of content creators are there to teach you just about anything. If one source doesn't do it, you have more to choose from.

Yesterday, I read an article about North Africa that mentioned in passing Qaddafi's underground river. A few minutes later I was watching a documentary about it, then a variety of videos about that man. My information binge extended to other African dictators, and will probably last a few days longer.

If I'm curious about something, I can go really deep. I'm not constrained to a few paragraphs in whatever book my local library carries.

You could argue that the web is in a sorry state, but if that's the cost of giving everyone, everywhere access to all this knowledge, then it's a deal worth making. This might be more obvious to someone who did not have access to a well-stocked library.

The problem is not availability, but curation. The sum of all knowledge back then was a well-curated book. Now it's literally all of it, unfiltered.


I’m not sure you and the author are talking about the same thing. He mentions the fact you can’t even read the news from 10 years ago, the content has simply disappeared. No amount of YouTube videos can replace that.

The problem is not “everything, everywhere” or a lack of filters but the extreme commercialization of all content available, closed networks, the short life of URLs…


> No amount of YouTube videos can replace that.

The irony here is that Youtube videos from ten years ago are still alive and well. As Youtube makes a much better places for publishing and archiving content than the rest of the Web. With Youtube you don't have to worry about URLs changing or domain names expiring or anything like that. You just publish your video once, get a unique video-id and don't have to worry about anything else. Google's monopoly worked here in our favor for once and they have been reasonably good in not breaking old content (not perfectly, as video annotation got crippled pretty badly).

I think that's where the rest of the Web fell short. The Web has no concept of "publishing". There is no ISBN when you write a blog, no library were you could look up that ISBN. It's all just a file on a server or an entry in a database, that will get mangled and lost in the coming years. Worse yet, the article itself isn't even accessible from the Web, it's mixed together with a user-interface, ads and other stuff or spread across multiple pages. All this makes it quite tricky to keep old content readable and archived for the future.

This also leads to a weird situation that a lot of publications are still avoiding the Web after 20+ years. They publish as ePub or PDFs instead, which you can somewhat access from the Web, but really aren't well integrated into it. But it's by far the easiest way to ensure that a text document published today will remain readable a decade down the line.


Unlisted youtube videos from 10 years ago are all gone... Google made the decision to delete every video 'shared by URL' because of the possibility that the URL generation algorithm had leaked. It was legally less risky to delete all the content than to risk leaking all the content to the open web.

IMO, they made the wrong call - it would have been better for the internet as a whole to notify all users that "We have a new 'share by link' option, and no longer consider the old links private. Please update all old links and then click this button to disable the old URL. If you don't click the button, videos shared by URL may be discovered by others in the future."


Unfortunately when I look through my old Favourites playlist, which at some point reached the limit of 5000 videos, I can see how many of those videos are now private, deleted, or blocked in my country. The worst part is that in many cases I can't even recover the video title, so I have no idea what has been lost. A possible solution would be to store the titles separately, but I didn't think about this while I added stuff to my collection. I do agree though that the unique, unchanging URL is a huge boon, when I look at the situation in my browser's bookmarks for comparison.


There's plenty of videos on youtube that have been removed. Every few years when I look through my list of liked videos, a couple more are gone forever. Granted, this is likely by the creator themselves, but that doesn't matter when the purpose is archival.


> The irony here is that Youtube videos from ten years ago are still alive and well.

They're not, though.

Because Youtube re-encodes videos every couple of years, with new "better" lossy compression algorithms. And each time the videos get successively worse.

Watching a 2008 Youtube video will not only look grainy because it's 360p, but it'll look actively *worse* than it did in 2008 because of all the lossy compression that was applied to it over the decades.


Well, I once had a comment exchange with Sean Young (of Bladerunner fame) on YouTube, but that is gone now. So much for its archival qualities.


>The irony here is that Youtube videos from ten years ago are still alive and well.

Roughly 1/3 of my YouTube bookmarks are dead and most of them are much more recent than 10 years. They purge videos at an alarming rate.


> The irony here is that Youtube videos from ten years ago are still alive and well.

Tons of them aren't. I've run into many linked from wikipedia footnotes that no longer exist, particularly digitized film from the 20th century. A ton of wikipedia pages still cite old films, documentary footage, linking to videos which were uploaded by Jeff Quitney, who was banned from youtube a few years ago because some of the old films he uploaded contained material that ran counter to modern values (I think the one that eventually got him banned was an old christian film warning children about homosexual predators.) When they banned him, they took down a ton of completely innocuous videos because a tiny minority were offensive.


> There is no ISBN when you write a blog, no library were you could look up that ISBN.

An ISBN is a string of characters, much like a URI/URL, and offers no more "protection" for long term access and the latter. Books get mangled and lost too; their only benefit is that it is harder to mangle and lose them, and there is likely to be more than one of them.


You absolutely can read news from 10 even 20 even 100 years ago.

One example:

I think you forget or may not have grown up with the microfiche.

The reality of the internet is that everyone has a voice and things will only be archived if someone gives a damn to archive them. And that's fine. Some information deserves to be transient. Hell, we've survived millennia without this level of information storage. Does every single YouTube video, Reddit post, and Flickr photo really need to live forever? No. Would it be nice? Sure.


That's fine for stuff that was printed in newspapers, but you (like the topmost commenter) seem to be replying to something different from what the article is talking about.

Paper is great, because it doesn't just evaporate when you look away. Although it does degrade, it's a slow process—slow enough that you can notice that it's happening and think to yourself, "Gee, I maybe ought to do something about this." It's not the same for transient digital media. There are bonafide news items and other digital content that are now no longer accessible because they were digital-first but the business incentives were so misaligned and/or their legacy has been so mismanaged that, perversely, it's easier to access to the content of a 50-year old news article than it is for others that are 5–15 years old. People can always trawl through their parents' and grandparents' belongings and come across the only known surviving copy of something and donate it to a library or sell the collection in a yard sale or eBay. (Whether they know it's the only one known to exist or not isn't a precondition.) That's not just less likely with the Web, but it's drastically less likely. No one's gleaning much from the unevicted entries in someone's browser cache.




This is very interesting. Do you know how are these licensed?


Depends where you read. Also how do you read the news of 50 or 100 years ago - isn't that more difficult.

I use UK news sites. BBC Guardian and Telegraph all have their old articles online


One of the functions of public libraries is to archive the news. One could browse weekly papers going back decades, either as hard copies or microfilm. There's a good chance major city libraries still do that, or have digital scans.

For example:


BBC used to be an extensive and complicated site with non-news articles, study guides and curated collections. All of that was destroyed with a revamp. You can still read the news articles, but that's all that's left.


Not really. There are vast archives of newspaper articles accessible through a web interface. You just need library access. The free, open web is mostly just a spam ocean, but if you make an effort to access the services that catalog useful information, it’s still very useful. The Google web is shit, though.


Yes, old paper newspapers, which are very thin now, and vanishing.

For web based newspapers, good luck trusting longevity there, just because the library has a free interface to an api...


> You just need library access.

Assuming you live in a place where public libraries are well-funded and can provide such amenities.


> you can’t even read the news from 10 years ago, the content has simply disappeared

Isn't most of it on the wayback machine?


if you are researching something you need 100%, what I wrote 10 years ago with access to 80% has now turned into bullshit with < 20% available.


From August 30, 1856. Goes back even further but point mmade. Needs a payment to read but it's there.


The "All content available" statement is a bit much. I'd go so far as to say there are a good portion of sites that have commercialized information. But I can easily access other forms of information not commercialized by avoiding mainstream views.




It's really the curation that needs to be taught to everyone. A big dose of critical thinking skills is what we all need, because in earlier times you could tell the crappy ideas by the way they were packaged: crazy guy at Hyde Park Corner, dude with a megaphone shouting out passages of the bible, crappy home-made flyers. You could work backwards from "nutter is probably wrong" to why he was wrong. Part of why this worked was because it cost something to publish stuff, and so publishers would have a think about what they wanted to spend their resources on.

Nowadays everyone has figured out how to package the message, and it's super cheap to do so and get it out there. (Incidentally, actual packaging is the same now, crappy products used to come in crappy packages, but no more.)

So now to pick apart an argument you have to be a bit more aware of the actual content, and it's a bit harder to get to the bottom of BS.


Curation is the first step, but we also need more organizations who take on the job of keeping some bits of the internet from going dark. Especially for things that were originally done for the common good.

There's a local group here that basically specializes in devops for little public projects that have run their course. They even do a little bit of work trying to provide templates for new projects (eg, for a local hackathon), but I'd like to see them go farther.

Bigger, more national or international groups that come up with recipes for projects where they say, "If you build your project on this structure, then we will be more likely to run it for you," I think is a reasonable logical next step for the Internet. Given the task of running 10 projects that 'need' 4 servers each, it would be very good if I could do it all with 20 servers, not 44 (40 + orchestration machines).

We don't have a "PBS for the internet" but then PBS didn't always exist either. You need a beach head of some sort to even propose such a thing to government.


It may never have been easier to share knowledge but it also hasn't ever been easier to share misinformation.

It also hasn't gotten easier to find information. Search results are worse now than 10 years ago because of Search Engine "Optimization". Google search needs uBlocklist[1] with a steadily growing list of manually curated domains just to poorly approximate usability.

Additionally, sites load slower every day. Why does every partial refresh of a site require 2 seconds, even though I'm on a decent computer? On phones sites are practically unusable.

The benefits you're touting for the web could be accomplished with Web 1.0 level of tech or even simpler protocols such as Gopher[2] or Gemini[3]. Everything else is an overall decrease in accessibility, usability, user experience, and the ease of finding or sharing knowledge.





I find that you used to find “silos” of information when searching - if you were looking for information on toilets say, you might end up on a love site [687] dedicated to plumbing, which would have a whole cornucopia of information and knowledgeable people.

Now you’re less likely do find that kind of thing and more likely to find a video or a SEO optimized site - which can be much more difficult to parse for verifiability.



> It has never been easier to share knowledge, and thus there has never been a greater time to be curious.

It's also never been easier to share disinformation, and pollute the vast sea of actual knowledge that exists on the internet.

Sources of information are silo'd into proprietary closed off gardens run by large corporations who only serve their shareholders. Searching for information has been corrupted by advertisers and the sheer amount of misleading content, that finding reliable sources often feels like searching for a needle in a haystack.

The one exception is Wikipedia, though it also struggles with keeping factual information, and has its own set of issues.


Add in the rising tide of ML generated content that is able to get itself well placed in search results, diluting information with real value. It's really irritating to start reading something that starts out seeming legit and then you get in a few sentences or paragraphs before it becomes incoherent nonsense.


Wikipedia can be show to be biased or unbalanced about political issues. Since wikipedia views newspapers as reliable sources their articles are often skewed by whatever is the conventions of the day.


I second the call that WP is web done right. Yes, of course there's bias; it's not possible to produce bias-free content, and WP's particularly bad in the fields of politics and history, and really any field where facts aren't settled and feelings are strong.

Enter critical thinking. If you dig just a little (e.g. read the talk pages and the edit histories), you can soon learn that some topic has been taken over by POV-pushers and is unreliable. Anything to do with Israel/Palestine/West Bank is unreliable; the boss is a zionist, and so are a lot of the senior staff, so it's not surprising. But WP is a million times better than the web that search engines expose.

Incidentally, I usually search using DDG. But DDG seems to hate Wikipedia; WP results usually don't show up on DDG until page 2. Google surfaces WP results on page 1, if not at the top of the resultset.


> Sources of information are silo'd into proprietary closed off gardens run by large corporations who only serve their shareholders. Searching for information has been corrupted by advertisers and the sheer amount of misleading content, that finding reliable sources often feels like searching for a needle in a haystack.

I'm not particularly worried about disinformation or misinformation as long as good information is out there to improve the signal to noise ratio. But you've definitely hit a nail on the head here, good information has value so there are a lot of incentives to wall it off and charge for it. This is the real danger.


> good information has value so there are a lot of incentives to wall it off and charge for it.

There's even an incentive there to pollute free sources of good information in order to devalue them relative to proprietary knowledgebases.


Curation won't solve the issues of the Knowledge Gap Hypothesis or the Information Defecit Model or the Digital Divide or Info Asymmetry.

If we test people on where they went "deep" there is a good chance most will fail the test.

We don't even know what we are trying to do with these networks.


Lets face it - early libraries probably sucked too. We will improve


IMO the worst side effect of current web of knowledge is what I'd call the illusion of knowledge. When it was more difficult to access and publish information, that imposed a much higher bar on what was being consumed. These days, people watch a 10-minute YouTube video or read a reddit comment or twitter thread and believe (perhaps unconsciously) that makes them knowledgeable in said topic. They will then, in an absolutely confident tone, display their expertise by answering questions and stating their opinion as if it was a fact. More people read this, and the cycle begins.

You see it all the time on HN and other forums. If you're an actual expert in a specific (usually scientific) subfield and you read comments about an article in that field, you find that a large percentage are not just factually wrong, but also written in an extremely confident tone by people who have probably studied the topic for about 10 minutes.

By having easy access to all this information people have stopped being humble about what they don't know.


People also used to be confidently wrong all the time before the internet was ubiquitous, except then it was hard to quickly verify they were.


True, but a difference is that they could not spread their confidently wrong opinions globally, only locally, and that opinions were tied to their identity.

Take my mom for example. She's 80 years old and doesn't use the internet that much. She is confidently wrong about a lot of things she sees on TV or hears on the radio. A recent example is COVID misinformation.

The difference is that my mom can't easily influence millions of others because she doesn't have the reach, but also because people are unlikely to take the word of an 80-year old person without any medical credentials or training seriously. It's much easier to look "legitimate" when you are hiding behind an online persona. If my mom wrote a blog or posted on HN/reddit, she could certainly come off as a doctor, or even lie about being one, and many would believe her. Doing this locally, in person, is much harder and riskier.


> True, but a difference is that they could not spread their confidently wrong opinions globally, only locally, and that opinions were tied to their identity.

> If my mom wrote a blog or posted on HN/reddit, she could certainly come off as a doctor, or even lie about being one, and many would believe her.

I don't think your mom (likeable as she no doubt is) could get an audience of millions just by posting her opinion on HN/Reddit/Social media.

I think the situation is pretty much back to what it was: most of the population have a limited reach of influence. Some people have a much greater reach.

The difference is that the people with greater reach used to be trained journalists who held to a code of conduct and were given that reach by institutions. Now there is no such code of conduct, and the assignment of audience reach is more random, and totally uncontrolled by any institution.


Pre internet, people could not spread confidently wrong opions globally, but connected and well connected people could. Take a look at the wave after wave of popular, and misleading non fiction books printed last century, or the "scientific" food pyramid, or the testimony of nurse Nariya. On and on it goes, as it has, only now, anyone can play.


> True, but a difference is that they could not spread their confidently wrong opinions globally, only locally, and that opinions were tied to their identity.

The content producer - audience ratio was just different, a dumb line from a journalist would definitely have an outsized reach that go well beyond what the writer might have expected.

As you note, even today people with a global reach aren’t that many: I could be shouting a lot of thing on top of my soap box, I’d probably not be actually reaching more than a few dozen people, and we have enough content for the effect of a single person to be vastly diluted.


Many true things about COVID were labelled "misinformation" at one point or another.

The publishing gatekeepers of a few decades ago projected an image of confidence, but I'm not sure they were actually any more accurate than random youtube videos. The media establishment of today certainly doesn't seem to be.


Given you are such an expert on viral disease, doesn't it make more sense for you to author a book, movie, ... to explain to your mum what she's wrong?

PS: I hope you're not one of these Wikipedia editors censoring physicists on Wikipedia because you "know" better?, or perhaps censoring virologists on Wikipedia because likewise.

PS 2: Some doctors were wrong about COVID. Some governments were very wrong. COVID lockdowns hurt the world economy worse the GFC.


This is true.

Long before the web, when I was a child, my father would chide me for my bold claims unsupported by facts with the phrase "Confident; but wrong."


Its good to remind myself this . Not once I thought I know something after watching a summary video. When I tried to explain the topic to somebody else I struggled. If you can’t explain you know nothing. Simple as that.


> If you can’t explain you know nothing.

Strongly agree. And I'll raise you: if you can't explain in simple, plain language that a 12-year-old could understand, then you are not enlightened.

I concede that some subjects are intrinsically complex; e.g. the cosmological history of the Universe. But a large part of the reason that topic is complex is because it's not settled; we haven't got to the bottom of it, so there are loads of unanswered questions. How can you explain cosmological inflation in language a 12-year-old could understand? Well, I can't explain it to myself, so I sure as hell can't explain it to a 12-year-old.


>If you can’t explain you know nothing. Simple as that.

This is still a pretty good bar, but if I just repeat what the video said, it seems like I'm explaining it, and then I can give the impression I do know something. I shouldn't get credibility for just repeating a video.

The US has devolved into somewhat of a reputation-driven expertise market, and there are plenty of ways to gain reputation without the expertise. There's still plenty of real experts, but I don't blame them if they want to focus on their work instead of fighting the endless tide of easily produced misinformation.


I agree. There's a healthy factor form entry cost. There's also an healthy inertia into asking more than short term investment from your mind.

Map is not territory and I say this after believing it far too long.

It's a big fallacy behind the information highway roots of internet.

And that's half of it.


I also believe it is a trend not to go over the head of consumers. Can be seen in older documentaries and political discussion. Perhaps this is just a result of everyone already feeling like a expert and should not be insulted by complex language etc.


I wonder if this is just an inevitable outcome of the majority of society being online. The beginnings of the internet are rooted in academia and hobbyists. Early adopters were experts in their respective fields with analytical minds. Now that everyone is online, perhaps the average user just better reflects the average member of society. In other words, maybe it's not the content of the internet, it's the users.


What is the most bizarre thing to me about the online zeitgeist is how so many people will allow comments of anonymous or pseudonymous strangers on places like Reddit or Twitter to shape their world view. Including journalists. An extreme or inaccurate view may start on social media and be normalised through repetition on social media and subsequent validation by main stream journalist.

In the case of Reddit in particular what is it that gets people to trust anonymous strangers? It is bizarre and seems like a mind virus. If an anonymous stranger tells you what you want to hear then you are apt to ingest it uncritically.

For example on social media there is a notion that nuclear brinksmanship with Russia over Crimea is acceptable.


> What is the most bizarre thing to me about the online zeitgeist is how so many people will allow comments of anonymous or pseudonymous strangers on places like Reddit or Twitter to shape their world view. Including journalists.

I totally second this, and have a recent concrete example where at the beginning of this year Sweden's (probably) most serious newspaper (what I'd consider a 'journal of record' whose articles should be a point of historical reference), published a long-form retrospective article comparing Sweden's handling of Covid with the way other countries had handled the epidemic, and included several internet 'myths' that had bandied-around on social media, citing them as facts, and even including one 'interview' with what purported to be an eyewitness of one event, which turned-out to be taken from a Facebook post.

I wrote and complained to the responsible editor with citations showing how and where the article was wrong, and a few very grudging emendations were made (effectively saying that even though the reports were still probably true, they couldn't be 'verified').

Totally horrified me that, in wanting something to fit their facts, journalists simply accepted fiction they read on Facebook and regurgitated it in their articles.

I used to hold them in greater regard than that.


> I wrote and complained to the responsible editor with citations showing how and where the article was wrong, and a few very grudging emendations were made (effectively saying that even though the reports were still probably true, they couldn't be 'verified').

That's interesting, because it possibly led to a sort of well-meaning destruction. Was the content surgically edited, or did it grow a warning that some of the info originally published was incorrect, or (more perniciously) both? Consider the affect this has on researchers trying to do studies on misinformation—they certainly want to be able to access the originals themselves.


> what is it that gets people to trust anonymous strangers? It is bizarre and seems like a mind virus. If an anonymous stranger tells you what you want to hear then you are apt to ingest it uncritically.

I guess it has something to do with lack of trust in Established Sources of Information—TV, newspapers, experts and other various figures of authority. The biases of the Established Sources has become more apparent over the years, to the point where scepticism, doubt, and even defensive cynicism are fairly common default attitudes when dealing with the information they provide.

The trust in anonymous strangers is, at least in part, the result of them not being Established Sources. It's not a lying politician, or a deceitful news anchor or journalist, it's just another well-meaning regular person like you. That alone makes you more receptive to their message. If their message happens to align with your existing beliefs, even better. Of course, it can get into cultish/conspiracy territory if any of those beliefs are directly opposed to the mainstream narrative.


Swap out HN for Reddit and your comment still stands. This place isn’t really all that different.


Lack of solid real world social networks. Religion used to be a solid defense but people are now less religious.


Id say it's gone the other way too. Expert opinion echo chambers are rampant on the internet preventing paradigm shifts in science.


Isn't "expert opinion" a fallacy?

If the expert is right it's hardly an "opinion"; it's more a fact. But we still talk in terms of "expert opinion" because there are other experts with, sometimes, diametrically opposite views. Do we get to hear all these experts? Not a chance.


The problem is our egos get in the way of real facts. Experts are humans and they are just as vulnerable to being wrong as the rest of us. Its great that there are people who dedicate there lives to a discipline, but sometimes experts use there positions of being right to distort larger truths, especially in areas where we don't have all the facts, but think we do, and when careers are on the line, experts can ban together to protect themselves. Id rather find my own way to the truth, then to rely on an expert that may or may not have an interest in protecting themselves from career breaking situations.

Expert positions can create a false sense of security that we know all there is to know about a subject, and that is just as damaging to society as not knowing the truth.


>When it was more difficult to access and publish information, that imposed a much higher bar on what was being consumed.

And yet fringe theories, from quack medicine to religions, weren’t any less widespread than now. They were just fewer of them.

EDIT: Now I think about it, they were more prevalent than now. Look at homophobia - it got global, affected everyone, even non-Christian cultures like China, for centuries.


> religion ... homophobia

Do you really consider that knowledge?

> homophobia - it got global, affected everyone, even non-Christian cultures like China

Do you think that aversion of homosexuality starts and ends with Christianity?


Homophobia generally spread throughout the world with Christianity, carried by colonialism. Again, China is a good example. And fixing homophobia in western societies strongly corresponds to decreasing importance of religion.

It’s not limited to homophobia of course - pretty much every single Catholic claim about human sexuality is antiscientific bull - but I think homophobia is a good enough example of a harmful, false belief that got more popular than anything post-internet.


Blogs are still an underrated goldmine of knowledge, especially in tech. I find academic papers often too abstract or opaque, textbooks are good but generalized, and documentation is reference-like. Stumbling across a tech blog where someone explains some fairly specific and difficult problem they had, and an interesting solution they found, can be exactly what you needed to solve a problem.

The web has its problems for sure, but don't forget there are still gems out there which the web has made possible. I'd love to see a comeback of blogging culture, but I guess a lot of that has been sucked in to social networks now.


A great resource to find new blogs is the Thinking About Things newsletter. Been getting it for a while and it's a great way to find new blogs to read.


Blogs are wonderful and still there - they’re just much harder to find because there is so much OTHER content now. The web used to be almost nothing but blogs.

For example, this blog is pertinent to what I’m doing and I didn’t find it for weeks: and only found it by a link to a YouTube video from another one from another one.




I definitely worry about the future of scholarship as our print media becomes more and more fragmented and hidden. It's not necessarily about ensuring essential knowledge is carried forward, but rather a "sense of the past". In researching history, the further back you go the more fragmented and unreliable your sources usually become. So it becomes harder and harder to figure out the broad sweep of these past cultures, what life was like, the things they believed, and so on. And I think we're doing that to our present historical moment by deleting the recent past and creating a continuous present.

We have the Internet Archive, we have Wikipedia page histories, but everything else is ephemeral. If ever goes the way of the Library of Alexandria, we'll have lost irreplaceable knowledge of the web itself, and the cultures that existed on it. It will live on only in living memory, but this is also transient, and soon nobody will live who remembers what this time period was like. Wikipedia will not reflect it, you'd have to dig through page histories to find fragments, like historians and archaeologists sifting through ancient manuscripts and ruins for clues.

A potential dark age is forming, and avoiding it right now hinges entirely on the continued efforts of two donation-funded organizations, one of which makes it increasingly harder to view the past, the other facing legal disputes that could see it shut down. I think we need to make archiving the present important, so it does not become a mysterious, inscrutable past for our descendants.


while I have the same general concerns, both wikipedia and have offline backup and running options for free. and, they are surprisingly small. all one has to do is write a simple script to auto download these backups daily/weekly/whatever and you can access all that info of your solar powered raspberry pi.

while I think there could be a short period where this info is largely unavailable, I don't believe it will be lost forever. if the Internet goes away, once some new Internet like technology comes around to replace it, these data repos will likely get out back up very quickly. just a matter of how long that blip is


Technologists and those in an adjacent professional class (which should account for a lot here on HN) should also do their part to help make sure that the present is easily archivable.


>Ever tried to look up some news from 12 years ago?

I have a better one for you. Ever wondered why it's so hard? Why web protocols have nothing related to archiving? Why web browsers are a hellscape for aggregating information over time in a meaningful way? Why this continues to be true, despite countless Microsoft and Google engineers writing all these heartfelt posts about knowledge?

If your answer is "because it's hard to implement" than you understand nothing.


From my admittedly limited understanding, the failure of the semantic web is one of mankind's biggest missed opportunities. Now the knowledge graph is just locked behind Google's neural network layers and only being used for ads.


Maybe something like a 'WikiInfo' (or another better name :) ), that contains a hierarchy of (potentially all) known pages and topics? I think the only way to tackle this problem is collaboratively and distributedly.

You could add for example a 'Newspapers' topic, and then say 'The Springfield Times' and then have 'Articles by date', 'Articles by topic', etc. like a huge database (browsed hierarchically like "WHERE dates BETWEEN '20121211' and '20121213'", etc.). The primary datastructure could be a database, and users can add hierarchies as queries to the underlying database -- a collaborative index (in the literal sense, like a Homepage of the internet) is shown. Any unique 'object' (like a specific newspaper) gets an UUID and a row in the db. I don't know how modern dbs handle sparse data, but that'd definitely be a requirement (i.e. each object can have a handful of millions of possible properties, like publication date, location, author, colour, etc.).


I've looked online and there's WikiData[1], which I didn't know and looks very nice. Although it seems to be more of a plain database, not concerned with Indexing. It also doesn't seem to contain objects such as all newspaper articles (without the text body of course), I wonder if that would be accepted data. Maybe we could build upon WikiData as a backend and present a hierarchical index.

As a humble suggestion, I'd divide all information in: (1) News (all news articles), (2) Publications (books, blogs, magazines, etc.) (3) Ideas and things (countries, planets, people, theories). Any object can belong to multiple categories/classes. I think it's not important that categories be perfectly devised, only that they contain all objects, and objects can be found reasonably well within them. The main point of objects would of course be a link to an actual web page that contains what you're looking for.

Please, someone do it! (I have way too many projects right now)



EdgeDB is half of what you describe on the database front. You can add nested queries with filters and calculated query types. Every item in the database is given a unique UUID and there's support for complex custom types and constraints, included calculated constraints I believe.

You could have an Article type with a link to a Person as the author, and many more types of Works besides. You could find any Work by a Person traversing backlinks to find any object linked to that person. Any work where they contributed. Any social media link they posted.

Then query that by a duration of time starting from a specific date. A specific place, if it has it, a group of sites and so on.

If my understanding of it from my time playing with it is correct. I haven't experimented with too broad a dataset yet.


The idea behind semantic web was inspiring and great, however it required considerable work on the part of people creating stuff for the web and that was never going to happen. Maybe it could have happened in some things like academia based or knowledge based websites, but on the larger scale it was doomed.


(Warning: Personal plug incoming)

I fully agree, especially when it comes to the "semantic" part of the semantic web. Reusing and publishing ontologies that define those semantics always seemed like an afterthought of the semantic web, when it should be part of the foundation that things on the semantic web are built on.

In most other parts that make up a website (JS and HTML) we figured out how to make reuse (mostly) work by replacing flimsy web references with package management. Ontologies never had something like that, and thus were stuck in an early 00s era of software/ontology development.

Where I work, we are building Plow, a package manager for ontologies ( as part of our tech stack to improve that situation and allow people to build applications with large-scale stable semantics at the core.

As part of building Plow we are aiming to make the process of creating and sharing ontologies easier and with that also lowering the barrier of entry to that domain.


Your father's books are real treasures. Were it my library, they would have a place of honor.

I want to put your post in the context of yesterday's HN front-page post lambasting the EU for trying to build a better search engine. The bulk of the comments suggested that a government effort could never be as good as a commercial effort.

Your post is a strong counter-argument. All of the points you mention on the massive decline in quality of web content are due to the web being driven by commercial efforts.

Likewise, comments on this current HN front-page post "Despite faster broadband every year, web pages don't load any faster" also seem to explain the poor state of the web as being due to commercialization of everything (even the comments about the need for cookie banners-- the logic behind the cookie consent is to regulate the commercial collection of user data).

Google started as a DARPA project, and was a great engine while it stayed true to that ethos. It was the need to commercialize it, thus setting perverse business incentives, which has destroyed it.

Your post praises libraries. Which are seldom commercial ventures.

The economics are simple. I don't know why this even continues to be a debate on HN.

- A socially created entity (corporation, government, "charitable" organization, ...) needs money to function.

- Money comes from capturing a portion of the value created by the organization.

- The most efficient organizational structure depends on the relationship between value creation and value capture.

Thus: if the product/service generates immediate and focused value, the value capture can be directly linked to the product and a business is optimal. Think: a hamburger.

If the product/service generates long-term and diffuse value, the value capture also needs to be diffused, i.e. taxes. Thus a government. Think: the road network which allows the raw materials and the customers to get to the hamburger store.

I leave the case for charitable orgs as an exercise for the reader :)

disclaimer: strongly pro-business, have founded 2 personally, assisted several others.


>Google started as a DARPA project, and was a great engine while it stayed true to that ethos. It was the need to commercialize it, thus setting perverse business incentives, which has destroyed it.

Google started as a what now? This is an interesting thing to say during a discussion of access to information. I'd be happy to read your explanation of Google's "DARPA roots", and your citations to sources explaining how that came about and how they were "destroyed" when they "strayed" from DARPA... that should be a fascinating read.


It is indeed a fascinating read. Here you go!

>A second grant—the DARPA-NSF grant most closely associated with Google’s origin—was part of a coordinated effort to build a massive digital library using the internet as its backbone. Both grants funded research by two graduate students who were making rapid advances in web-page ranking, as well as tracking (and making sense of) user queries: future Google cofounders Sergey Brin and Larry Page.

>The research by Brin and Page under these grants became the heart of Google: people using search functions to find precisely what they wanted inside a very large data set.


I always wondered how their initial python web scraper was fast enough to index the entire internet on that old hardware, even given the small size of the internet at the time. I guess the answer is that they had a local backup at Stanford. Thanks for sharing!


You also asked 'how they were "destroyed"'

I refer you to some interesting discussion here:

Google Search is Dying 1561 points

Every Google result now looks like an ad 972 commments

Google no longer producing high quality search results in significant categories 1275 comments


Uh, DARPA had nothing to do with the founding of Google. Brin and Page had NSF funding though Brin's graduate fellowship and the Digital Library Initiative, and that was before the official founding of the company.



You may (or may not) be right about DARPA, but you assert they were government funded.

And my main point is that government (funding) is the economically optimal approach for services which produce diffuse value.

See also the comment from marginella_nu in the post I linked to. Marginella is building a fantastic alternative search engine Their view: "Arguably the biggest most unsolved problem in search is how to make a profit"

i.e., capturing the value produced.

kagi attempts to do this with a paid tier. I hope it works for them, great product and really responsive team.

Bing tries to capture value by collecting the Microsoft tax; not exactly government-level, but on those lines.


I don't get how the author can value all the things in the article and yet work at Microsoft on a bloatware like Edge which is not possible to remove, is pushed hard by Windows against other browsers and used as a part of a giant ads engine in itself


I don't see the unique selling point of Edge. It's just a re-skin of Chromium with Microsoft tracking added. It's largely a data grab by M$.


Part of why the web is shit is low barrier to entry. More often than not, if you read classics or older books, it's information dense. Every prose and sentence constructed had some economy baked in.

Now everyone and their mom, as well as bots and marketers,can spam right on over.

We either need a search portal with aligned incentives, or perhaps a new internet with none of this crap.


Check out Gopher/ Gemini protocol


Just going to leave this gem of a video[0] here. Whilst I agree the web has been walled gardened into various silos like social media, and people think Facebook and Twitter are the Internet, it's still read/write. The blogosphere is still ticking along nicely and last time I checked it's thriving.

Yes, people have gamified Google and search to get traffic and the blogosphere of old has largely been co-opted by profiteering gluttons, but there's still hope. Surf Hackernews enough and you'll find little gem posts that don't have an ulterior motive and are not 'monetizing' their content and sprinkling it with affiliate links and ADs. They just want to vent, exchange techne, and share knowledge.

Then there's Wikipedia which has remained AD-free for as long as I can remember (apart from their donation banners which I don't mind). Wikipedia is the coolest thing ever and my IQ has probably gone up a few notches over the years because of it. It is the closest thing to getting home-schooled without going through formal education, and you can verify all the claims made in its entries by going to the footer section and reading various citations usually written by esteemed scholars.

The web is in a sorry state due to the commercialization and walled garden silos, and also because of the proliferation of smartphones which are mere consumption devices IMHO and not designed for producing any meaningful or substantial content, apart from maybe uploading photos/videos to Instagram or writing tweets etc I can't write a blogpost on a phone because I have bad dexterity, and I typically have to have 100+ tabs open to verify claims, provide sources, do cross-referencing, find relevant images etc...all something done best on a workstation PC or laptop.

Some context: I have professionally blogged for more than five years but due to reasons I won't go into here, I have stopped. I'm thinking of jumping back in, only this time armed with the wisdom of my previous blogging shenanigans. Failure is an opportunity to start again more intelligently!



Good point about smart photos being hooked into the web, and trying to use them to create some well thought out then is really difficult. And as far as the web goes it's really up to us how that plays out, so I think yea, there will always people that choose the quick way around the net using mobile, but I think that is a by product of living out of touch with the world anyway.


i think to change this, you have to start with a basic, fundamental respect for the liberal arts, the humanities, and for human beings in general. and that means you have to pay people in money for the emotional and mental labor of organizing information. and stop trying to replace them with algorithms to maximize clicks and watchtime.


> and I now work on the browser that comes out of the box with any Windows machine (working on a Mac most of the time).

Is this why Microsoft products are getting progressively worse every year? How can you work on a product and then never use it natively and expect other people to enjoy using it? Or worse, like with windows 11, it leads to the product morphing into something the users never wanted. Because developers want to conform it to what they're used to using. I don't know, it kind of baffles me the way most developers view the products they make.


If we all had the Memex that Vannevar Bush proposed[1,1*], many of the losses we all discuss in these threads may have been avoided. We now have massive local storage, and should be able to freely share data by hosting our own stuff on our own machines. We could have done what the editor of the magazine implored us to do:

"As Director of the Office of Scientific Research and Development, Dr. Vannevar Bush has coordinated the activities of some six thousand leading American scientists in the application of science to warfare. In this significant article he holds up an incentive for scientists when the fighting has ceased. He urges that men of science should then turn to the massive task of making more accessible our bewildering store of knowledge. For years inventions have extended man's physical powers rather than the powers of his mind. Trip hammers that multiply the fists, microscopes that sharpen the eye, and engines of destruction and detection are new results, but not the end results, of modern science. Now, says Dr. Bush, instruments are at hand which, if properly developed, will give man access to and command over the inherited knowledge of the ages. The perfection of these pacific instruments should be the first objective of our scientists as they emerge from their war work. Like Emerson's famous address of 1837 on "The American Scholar," this paper by Dr. Bush calls for a new relationship between thinking man and the sum of our knowledge. — THE EDITOR"

We don't have that, and it makes me sad. We should fix this inadequacy, but there are now so many interests in the entrenched model that I think they would squash something that freely allows copying like a bug.



[edits - quoted editor's introduction, revised/extended]


Is it ironic that I can read the article you posted because it's behind a paywall?


I was lazy... thanks for prodding me to do better.