Emerging Architectures for Modern Data Infrastructure

Daily Digest email

Get the top HN stories in your inbox every day.

jandrewrogers

This article has a large gap in the story: it ignores sensor data sources, which are both the highest velocity and highest volume data models by multiple orders of magnitude. They have become ubiquitous in diverse, medium-sized industrial enterprises and it has turned them into some of the largest customers of cloud providers due to the data intensity. Organizations routinely spend $100M/year to deal with this data, and the workloads are literally growing exponentially. Almost no one provides tooling and platforms that address it. (This is not idle speculation, I’ve run just about every platform you can name through lab tests in anger. They are uniformly inadequate for these data models, everyone relies on bespoke platforms designed by specialists if they can afford the tariff.)

If you add real-time sensor data sources to the mix, the rest of the architecture model kind of falls apart. Requirements upstream have cascading effects on architecture downstream. The deficiencies are both technical and economic.

First, you need a single ordinary server (like EC2) to be able to ingest, transform, and store about 10M events per second continuously, while making that data fully online for basic queries. You can’t afford the latency overhead and systems cost of these being separate systems. You need this efficiency because the raw source may be 1B events per second; even at that rate, you’ll need a fantastic cluster architecture. Most of the open source platforms tap out at 100k events per second per server for these kinds of mixed workloads and no one can afford to run 20k+ servers because the software architecture is throughput limited (never mind the cluster management aspects at that scale).

Second, storage cost and data motion are the primary culprits that make these data models uneconomical. Open source tends to be profligate in these dimensions, and when you routinely operate on endless petabytes of data, it makes the entire enterprise problematic. To be fair, this is not to blame open source platforms per se, they were never designed for workloads where storage and latency costs were critical for viability. It can be done, but it was never a priority and you would design the software very differently if it was.

I will make a prediction. When software that can address sensor data models becomes a platform instead of bespoke, it will eat the lunch of a lot of adjacent data platforms that aren’t targeted at sensor data for a simple reason: the extreme operational efficiency of data infrastructure required to handle sensor data models applies just as much to any other data model, there simply hasn’t been an existential economic incentive to build it for those other data models. I've seen this happen several times; someone pays for bespoke sensor data infrastructure and realizes they can adapt it to run their large-scale web analytics (or whatever) many times faster and at a fraction of the infrastructure cost, even though it wasn't designed for it. And it works.

dima_vm

> 10M events per second

Disclaimer: I work at VictoriaMetrics open source.

VictoriaMetrics ingest rates are around 300k / per second / PER CORE. So theoretically you should be fine with just a single n1-standard-32 or *.8xlarge node. Though I would recommend cluster version for reliability, of course, and to scale storage/ingestion/querying independently.

Here's the benchmarks with charts: https://medium.com/@valyala/measuring-vertical-scalability-f...

defen

I don't doubt you, but it's surprising that there's really that much value / inefficiency laying around that many "medium sized" industrial enterprise can justify spending tens or hundreds of millions of dollars a year just to collect an insane amount of sensor data (and presumably take some action based on that). How big is medium sized and what kind of industries?

jandrewrogers

There is a long and fascinating discussion to be had about how the economics of many industrial sectors are evolving. The short version is hardware differentiation is no longer economically viable, margin is going to zero, and pivoting to sensor analytics and exploitation is broadly seen as the primary means of generating margin going forward. I've had this same conversation across several industrial sectors. Anything tangentially related to transportation (automotive, logistics, telematics, aviation, and all related supply chains) is a good example.

As to what these companies want to do with sensor data, it is often considerably more interesting than what people imagine. Many of the applications have an operational real-time or low-latency tempo. (I can't be too specific here.)

For my purposes, I put "medium-sized" on the order of $1B annual revenue. As to why a company would literally spend 10+% of its revenue on sensor data infrastructure, it is difficult to overstate the extent to which getting this right is viewed as near- to medium-term existential for these companies. The CFO has run the models and this is their best chance at survival.

Here is the interesting thing: to the extent they've been able to put this sensor data infrastructure in place, it has been successful at generating margin. If they could bend the infrastructure cost curve down a bit, most would spend even more on it. I've seen the financial models at several companies, there is a tremendous amount of money to be made in this transformation.

Aeolun

I just fail to see how sensor data gives you an edge or (ultimately) a higher margin.

Is this about optimizing things to a precision that is impossible when humans try to decide whether something is too much or too little of something?

Siira

Your comment has no examples of how these margins will be generated.

bsder

Medium-sized can be in the $10 million per year range.

Let's take plastic injection molding since it's such a good example of a really broken industry (there are a small number of excessively competent injection molders and a vast legion of incompetent ones).

You're shooting a part every couple of seconds (or faster), and that injection molding machine has lots of knobs to dial in. Temperature of incoming plastic pellets, water content of incoming plastic pellets, dye feed rate, plastic feed rate, mixing chamber temperature, feed screw motor load, initial injection pressure, plateau injection pressure, release injection pressure, actual pressure inside the mold, time spent cooling--I can go on and on and on.

Most injection molding problems generally get solved one way: increase injection time. It's fairly straightforward to adjust, isn't likely to make things go wrong, and the people on the line don't get paid to experiment. They've got 100K parts to shoot in 72 hours, and an hour lost is a thousand or so parts they're going to get yelled at for. Better to dial the time up 10% and take 79 hours rather than experimenting for 7 hours and not shooting or waste a bunch of plastic.

Of course, if this is your only hammer, you can see where this is going. Every single time something goes wrong, that mold gets another 10% added to its cycle time. And it never goes the other way without "A Pronouncement From God, Himself(tm)". Eventually, your entire business is running at 50% productivity because all the molds are shooting so slow and you think you need to build another factory when what you need to do it fix your molding times.

Now, back to sensors--the problem is that nobody with incentive has a way to identify AT THE TIME IT CAN BE DEBUGGED that "something is going wrong". Someone on the line dialing up the injection time should cause an immediate dump of ALL the data on that machine (probably a week or more+) up to an engineer who can go through it looking for anomalies. Even better would be for the machine to flag to an engineer any "likely anomalies" (an increase in incoming plastic pellet water content should get flagged, for example) so that they can be corrected before they affect the injection process and cause failures/wastage.

This is, of course, all predicated on logging an enormous amount of data and being able to run an analysis against it in almost real-time.

Handling this data is non-trivial.

wpietri

Thanks for this. A long-ago summer job was running those machines to make things like mirror rims and taillights. I still have a couple of the fantastic blobs produced when changing over from one job to another.

One of the things that struck me then is how much the line workers were treated like furniture, when many of them were quite sharp. They took such pride in getting things done well and at speed, in continually improving. I really wish I could put that kind of data in the hands of a couple of the people who trained me. Just an app on their phones. Spending 40+ hours/week on a machine means you really get to know it. I'd love to see how many of them would get great first-pass analysis and remediation.

mrloba

I don't understand why such a system needs to look at every data point. Are the failures so rare that you can't get away with sampling?

tfigment

I'm in a small energy company and log 50k/s events for several million points (temperature, pressure, voltage, current, power, ...). And most of what was said is true for us but we dont pay millions per year to our vendor. Its not cheap though and will pay millions over years easy. Horizontal scaling is not what these databases do as mentioned (thinking of Wonderware, IP.21, Honeywell, PI, ...). I have some hope for AWS Timestream for cloud but still think price will be high and they only ingest near live data so nothing older than what fits in memory. Most of the open source like influx, timeseriesdb, prometheus, lack features I expect but they are getting closer.

deepsun

Check out VictoriaMetrics, with 50k/s ingest rate you can just use one single core machine (300k/s/core).

sradman

> it ignores sensor data sources, which are both the highest velocity and highest volume data models by multiple orders of magnitude.

This has long been the main marketing message used to promote Complex Event Processing (CEP) [1] systems. There is no shortage of enterprise and Open Source solutions for this space; what is missing is strong demand/adoption which in itself undermines the next-big-thing claim.

One can argue that sensor data is included in the ETL category.

[1] https://en.m.wikipedia.org/wiki/Complex_event_processing

jandrewrogers

For many sensor data models, CEP is a core element of the data flow but the constraint/query data model is much larger and more dynamic than is typically supported in classic off-the-shelf CEP systems.

This isn't necessarily an issue, complex constraint matching is typically a fundamental part of the ingest path anyway given the algorithms used; making it support more generalized CEP is a fairly straightforward extension of the same computer science mechanics that make polygon search scale efficiently.

ianeliot

Interesting. This may be a naive question — this is very far from my area of expertise — but is there a reason sensor data can't be sampled? It seems gratuitous to store that many events.

jandrewrogers

You don't know what you need until you need it. The signal you need to dig out of the data often isn't known until some other event provides the context. Also, for some industries and some applications, there are regulatory reasons you retain the data. In some cases these are sampled data feeds, even at the extreme data rates seen, because the available raw feed would break everything (starting with the upstream network).

In virtually all real systems, data is aged off after some number of months, either truncated or moved to cold storage. Most applications are about analyzing recent history. Everyone says they want to store the data online forever but then they calculate how much it will cost to keep exabytes of data online and financial reality sets in. Several tens of petabytes is a more typical data model given current platform capabilities. Expensive but manageable.

may4m

I interesting worked on a project as a data scientist with a client who worked in high precision manufacturing. Their signals (sensors) and actuators were stored in a historian which couldn't handle data 100ms samples even though the data was collected at a 10ms rate. One of the problems required us to look at the process that took just 85ms. The problem was the historian was showing signals up to 20ms it took a while to realise that it was extrapolating when you tried getting finer resolution. The company was using this historian for more than 20 years they had to commission another project to change the historian. So you're right, you don't know what you need until you need it.

fatbird

Sometimes tens of scalars per second is the sampled data. It depends upon your requirements for accuracy and responsiveness for alarms, threshold checks, etc. I work with paper making machines that only give us a profile every 30 seconds--but that profile is a thousand floats, and we need to be constantly resampling it both spatially and temporally, and we're doing that for tens or hundreds of profiles for a single system--and we're supposed to handle hundreds of systems.

The more fundamental point that the GP is making is that the realm of industrial sensor data scales in ways that people haven't really grasped yet. It's much less about brute storage than it is about the interplay between bandwidth, storage, and concurrent processing power.

bsder

The problem is that you are generally looking for "Something's different" rather than "Smooth ALL The Points".

So, the problem is that you threw away 90% of your data, and that's where the problem was. Oops. Now you have to switch on "Save all the data" and hope it repeats. So, given that you have to have a "Save all the data" switch anyhow, you might as well turn it on from the start.

In addition, changepoint analysis is an entire field of research in and unto itself.

Look at how many articles there are about analyzing "Did something break in my web service or am I really doing 10% more real traffic?"

undefined

[deleted]

acadien

Depends on the application. Often down-sampled data is useful for drawing trends but not so useful for better understanding failure events.

deepsun

For server monitoring data (mostly counters) is usually saved at 10 to 15 seconds intervals. It rarely queried at full resolution, it’s almost always sampled, yes.

moab

Thanks for this great comment. What kind of workloads are people trying to run on sensor data that arrives at such a high velocity? Time series analysis? Anomaly detection? I wish I had a better idea of what kind of specific problems users you've run into are trying to solve, which fail on the existing software stack.

stingraycharles

Not OP, but I work for QuasarDB and we deal with a lot of customers in this sector.

It’s typically a mix of everything, but predictive maintenance, anomaly detection and failure analysis are the most common. For example, there is one process that does trend analysis and tries to “predict” acceptable boundaries of a certain sensor’s measurements, and this is then compared in real-time with the actual sensor readings. If things fail for some reason, a technical engineer will dive into the data with dashboards (think: Grafana), zoom in, compare the readings with other sensors, etc.

The sheer volume of the data makes it fairly painful. Downsampling does happen, but only after a few weeks. This means that you still need enough storage capacity to deal with the full stream of data in real-time.

jandrewrogers

The data models for any non-trivial sensor analysis are intrinsically spatiotemporal -- every measurement or event happens at a place and time. Spatial relationships are central to the proposition of analytically reconstructing the dynamics of the physical world from disparate entities and sensors. The objective is to sample enough pixels and their relationships to sketch an accurate picture of reality as it happens. For example, a car is trying to understand its relationship to every relevant static and dynamic entity in its environment that affects its ability to operate safely. There is no business that does not benefit from having a model of reality that converges on ground truth in real-time, if you can take advantage of it.

Most of the analysis that is done usually falls under one of two categories. First, inferring (you can rarely measure it directly) when something has changed in the real world that is relevant to your business so that you can adapt to it immediately -- the applies to everything from autonomous driving to agricultural supply chains. Second, detecting anomalies -- the unknown unknowns -- so that risks can be managed when the real world appears to not conform to the models upon which you base decisions. A third category is support of industrial automation, which benefits immensely from high-resolution multimodal sensor data models, though this is largely a cost reduction measure. These categories are hand-wavy but in practice, boring industrial companies have concrete metrics they are trying to achieve or risks they are trying to manage in the most efficient way possible.

maximilianburke

That's one of the big challenges we've been running to at UrbanLogiq. We've built bespoke storage and processing pipelines for this data because existing options in this space both didn't fit our needs and also would bankrupt our company while we tried to sort it out.

Having "cost" on the board as a factor we were actively trying to optimize for during design pulled us in a direction that is quite foreign compared to off-the shelf solutions.

That last paragraph rings true -- one of our big challenges specifically was in ingesting and indexing data that needs to be queried across multiple dimensions, things like aircraft or drone position telemetry. But once we found a workable solution for that, it specializes quite well to simpler workloads very well.

StreamBright

>> Almost no one provides tooling and platforms that address it

I think this is due to the nature of the mentioned companies are not being too common (yet?). There are tools and systems that you can use, especially from high frequency trading which has somewhat similar challenges. KDB+ and co. would be my first stop to check if there is something that I could use. The question is the financial structure and scaling of the problem, to determine if these tools are in game. There are other interesting projects in the space:

- https://github.com/real-logic/aeron

- https://lmax-exchange.github.io/disruptor/

Of course these are not exactly what you need, long term storage and querying (like KDB) is largely unsolved.

The other tools that you might be referring to by "most of the opensource platforms" indeed are not capable doing this. I spent the last 10 years on optimizing such platforms but it is not even remotely close to what you need, you (or anybody who thinks these could be optimized) are wasting your time.

ransom1538

"You can’t afford the latency overhead and systems cost of these being separate systems. You need this efficiency because the raw source may be 1B events per second;"

We do this. Have a load balancer with a fleet of nginx machines insert into bigquery. Inserts scale well and the large queries work since it is columnar. The issue is price. It's terribly expensive.

ethanwillis

While this is an article about data infrastructure I feel like we're missing the forest for the trees.

What is most important here in my opinion is that the underlying data is useful. If your underlying data wasn't collected, collected properly, or even worse the wrong data was collected.. then setting up data infrastructure will be a boondoggle that will cause your organization to be data hostile.

Just as much, if not more effort, needs to go into collecting the right data in the right way to fill your data infrastructure with. Most of the projects I've seen or heard of are just people taking the same old data that Ted in accounting, Jill in BI, etc. are already pretty proficient at using. So the gains you get by moving that into a modern infrastructure are marginal. How many more questions can you really ask of the same data that people have decades of experience with and an intuitive sense for?

teej

The biggest shift has been towards data lake (store everything) away from data cubes (store aggregates). This makes it orders of magnitude easier to diagnose, debug, and assert the correctness of data.

So these trends aren’t in a vacuum, they directly support the issues you discuss.

> Most of the projects I've seen or heard of are just people taking the same old data ...

I don’t disagree with you here. But in my experience it’s about getting Frank in marketing to use the same numbers as everyone else.

When you have 5 different ads platforms that all take revenue credit for a single conversion and have conflicting attribution models, and none of them add up to what accounting says is in the bank account. That’s a hairy problem.

There are different flavors of that class of problem at lots of companies.

altdatathrow

> The biggest shift has been towards data lake (store everything) away from data cubes (store aggregates).

I don't think this is any shift. The "store everything" has always existed in my experience, that's how the aggregates were built in the first place. The aggregates were for speed and convenience, and you drill-down as necessary, including to the individual record level.

Maybe the shift is people thinking that it's cheaper to just analyze the entire corpus on-demand because we can throw a spark cluster at it?

hestefisk

I agree, data warehouse was what the data lake is today. Data cube is the aggregation of data in the warehouse, and then you can drill down and roll up. Difference between warehouse and lake is the emphasis on correctness (one canonical data model) and deduplication of data (when warehouses were invented, storage was expensive so one tried to normalise it into a star schema with as little duplication as possible — when emergence of cheap storage, this is less important and we can spend less time developing fancy ETL processes to make everything fit into one, conformant data model).

dima_vm

You miss the critical difference -- nowadays people don't store aggregates, they just scan sharded data very fast. That simplifies a lot of things, because you don't need to keep two databases in sync (raw and aggregated).

gfodor

I was working in data analytics + data science a decade ago and we stored everything, not aggregates, and pushed them through hadoop. I have been "out of the game" since then. What has changed that is making people saying "store everything" is a new phenomenon? (genuine question bc I am clearly missing something.)

teej

It’s not a new phenomenon so much as it has emerged as an important shift from the status quo 20 years ago.

What’s changed in the last 10 years are the access patterns. There’s increased demand to have arbitrary query access over the raw data. The most impactful technology changes have been about pushing the access layer (queries, stream & batch processing, dashboards, BI tools, etc) down as close to the raw data as possible and making that performant. What’s fallen out of that are better MPP OLAP databases (snowflake), new columnar formats (parquet), SQL as the transform layer (dbt).

goBackwards00

The problem of data confusion you describe is resolved by replacing management. That’s not an engineering issue that requires new technology (consider the source of social power for author; selling technology).

That’s an engineering issue that needs new engineering management who don’t enable wasting company resources making incompatible APIs in the first place.

We already did the monolithic DB design, I used to name those hosts “ocean”. And we already know the math. “Data lake” is just more jargon by a salesman to obfuscate peddling the same old abstraction, and wow fresh grads with new words for hyping the same old habits.

While not the author of this piece, Bezos is quoted as pointing out how circular social behavior is.

What do you think the odds of this author being on a similar page?

Have humans evolved much in 100 years? Or does the con simply get rewritten for the next generation to hide a simple truth?

What’s keeping people going in this circle isn’t logistical necessity. It’s us.

huy

I think you have a point, but there are more nuiances than that.

There are typically 2 types of data to collect: Transactional data and behavioural data.

Most transactional data, due to their important nature, are already generated and captured by the production applications. Since the logic is coded by application engineer, it's usually hard to get this data wrong. These data are then ETL-ed (or EL-ed) over to a DW, as described by the article.

For behavioural data, this is where your statement will most apply to. This is where tools like Snowplow, Posthog, Segment, etc come in to set up the proper event data collection engine. This is also where it's important to "collect data properly", as these kinds of event data changes structure fast, and hard to keep track over time. I'd admit this space (data collection management) is still nascent, with only tools like iterative.ly on the market.

NightMKoder

I completely agree - there's only so many ways to slice the data. The caveat is - the type of data matters quite a bit for the data architecture. There's another thread that mentions sensor data as a source of complexity since the data has a theoretical delay between events (i.e. period) of 0 - something few systems are built to handle, even if you sample approximations at some fixed frequency. Algorithmic trading is a similar domain that still has a huge bar for entry - a sign that _this isn't easy_.

The fidelity of the data is of course important, but I would claim it's not a blocker. Yes, you need to trust the data you collect. That's table stakes - if you can't collect data correctly at all, even without worrying about the past, you're in for a world of hurt. It's P0. That said, a lot of people assume you also need to do this historically - and that's not the case - at least for ML.

Reinforcement learning has been making great strides in recent years. If you're in this situation - you have a flow where you want to use a model without having any past data to train with - use something like VW's contextual bandits [1]. You don't need historical data to build your model, just real-time decision point & reward signals. Once deployed, the model converges over time to the optimal model using real-time feedback.

All that said - baby steps are important. If you're in this situation, start by getting fidelity and then expand scope slowly without sacrifice to fidelity. It's a lot easier to backfill than to "fix" data - get that right and it get's easier from there. You'll need fixups regardless - mistakes happen and requirements change - but you have to start with something you trust, at least in the moment it's deployed.

[1] https://vowpalwabbit.org/tutorials/contextual_bandits.html

smt1

Well, if you remember, say, for example, hypothetically speaking arbitrary bits of entropy if you were to imagine it just as an information theory space generalization thing we commonly understand as a bit. But due to lost knowledge for example in communicantions theory say like broadband/wideband spectrum singnal broadcasting and general optimization with bounded linear optimization. Can know enough about say, reversible computing and many paradoxes for example about lost knowledge about codecs and such, still don't let the "decades of experience" exchange . Remember, according in modern understand of a general physical flip-flops synthetic clocks in "more special" general relativity and we can infuse modern ways of collaboration to reduce regularity, we can get better at sensing same patterns in cognitive processes to change the clock rate of terms of service over with power saving mega-log discrete hz. If anyone wants to collobrate further let me know at sashmit@gmail.com. We likely have some shared understanding, but not completely, but have some "free" knowledge about y-combinatoric that can excite digital logarithms but still have to not error-code all singularities away. I can understand it clearly because of the "ancient wisdom" of the polymorphic dual gods of say dharma (allow everything) or the transjuctions of the buddha (allow 1 thing). Most of whatever paradox can be solved via l-r modules to be placed to be postmarked for the future, but to allow that I just will meditate and focus on my own well being. :)

an_opabinia

Is there any evidence that the vast amounts of clicks and user interactions companies have been collecting are worth anything at all?

Let’s say I deleted every time series whose Y axis isn’t measuring US dollars in every tech company’s database everywhere. Maybe for all those time series you just store the most recent value. Describe to me what would be lost.

You’re onto something but you’re not going far enough! Most, if not all, historic metadata, analytics and behavioral data collection - when it is not measuring literal dollar amounts - is completely worthless.

mgraczyk

This is completely wrong, and nobody who works with data at any scale could possibly believe anything like this.

We literally run long term A/B tests with thousands of variations of what you're describing. The purpose of these tests is to measure the effect of losing some data. The tests show (to nobody's surprise) that each piece of data is useful. These tests tell us exactly how useful each piece of data is.

Honestly when I read comments like this I have to wonder, do you really believe that thousands of companies spend trillions of dollars a year for something that doesn't work? Maybe talk to somebody who works on this stuff a bit?

abernard1

> Honestly when I read comments like this I have to wonder, do you really believe that thousands of companies spend trillions of dollars a year for something that doesn't work?

Joking/not-joking. Have you ever been to the Bay Area?

Yes. Emphatically yes it is the case companies spend trillions of dollars unnecessarily.

We've seen this with people who didn't know how to build microservices and farcical "LMNOP" [1] type services that might as well be a joke. We've seen it with gigantically-valued unicorns that over-engineered tons of crap and hired too many people and still can't make a profit. We've seen it with CMOs and massively overpriced marketing technology because budgets and statuses are related. We'll see it with tons more iterations of this exact same affluenza.

The history of our industry is that the margins on software are so good that people can afford to do crazy nonsense.

[1] https://www.youtube.com/watch?reload=9&v=y8OnoxKotPQ

an_opabinia

I’m not trying to get into a flame war with you, there’s no reason to be hostile. It sounds like you “work on this stuff a bit,” you’re welcome to share concrete examples, it would be really interesting!

I think it’s an intriguing thought exercise. For example, does one need the entire history of interactions with e.g., an Instagram post, or just aggregated measurements? I’m not like, against measuring. Just against warehousing of non financial timeseries.

gfodor

yes consider RPC:

CORBA

EJBs

SOAP

Microservices <--- You are here

Supermancho

> Is there any evidence that the vast amounts of clicks and user interactions companies have been collecting are worth anything at all?

Yes. Every advertising platform ever uses this information. In Europe, you have to have regulation that makes account costing (what the US might call forensic accounting) possible. The presentations on A/B tests by FANG companies might also interest you. They are on Youtube.

malisper

For a post detailing the modern data infrastructure I'm surprised they intentionally leave out SaaS analytics tools. I find this especially surprising given a16z has invested >$65M into Mixpanel.

Based on my experience working at an analytics company and running one myself, what this post misses out is that an increasing number of people working with data today are not engineers. These people can range from product managers who are trying to figure out what features the company should focus on building, marketers to figure out how to drive more traffic to their website, or even the CEO trying to understand how their business as a whole is doing.

For that reason, you'll still see many companies pay for full stack analytics tools (Mixpanel, Amplitude, Heap) in addition to building out their own data stack internally. It's becoming more and more important that the data is accessible to everyone at your company including the non-technical users. If you try to get everyone to use your own in-house built system, that's not going to happen.

whoisjuan

I don’t think Mixpanel fits here. Mixpanel it’s just one end-to-end suite, mostly behavioral data that is captured from user sessions or user derived events/sub-events. Basically web analytics.

The whole point of data infrastructure is that sometimes you’re collecting data from the most random places. Many of that data is not necessarily user behavior. Sometimes it’s things like temperatures, latencies, CPU usage or instrument tallies. Sometimes it’s a stream of minute to minute weather data or timings or anything, really. Besides many companies have been collecting data for decades but it all live in silos where it can’t be used for anything.

Mixpanel can’t capture all that data, or query it, or analyze it. Mixpanel is just capturing a super small subset of web event data and it happens to provide an analysis suite on top that data they collect.

That’s why Segment shows up in this list instead. They help to move a lot of siloed data into a common systems. Mixpanel is just another source of data. You need something like Snowflake to put everything together and be able to do queries across multiple datasets.

soumyadeb

That's a great point. On similar vein, marketing teams too are increasingly data driven and would tools like Braze, CustomerIO etc to run personalized data driven campaigns. Support teams are using tools like GainSight

All these tools need to be fed data about user behavior - from apps, server backends, other tools etc. It's a messy data connection problem, not just one way from SaaS to warehouse. Mobile App->SaaS; SaaS->SaaS; Warehouse->SaaS; SaaS->Warehouse and so on.

shostack

Don't forget the challenges around identity resolution and privacy compliance to properly join it all up effectively and accurately.

soumyadeb

Indeed. Even when you have the full identity graph in warehouse and just want to assign a cannonical-ID (by doing a transitive closure), it is not easy in SQL. We wrote a blog on it (sorry for the shameless plug) https://rudderstack.com/blog/identity-graph-and-identity-res...

Creating the ID graph is a next level problem altogether!!. How do you know a record in Salesforce is the same as the anonymous visitor on your website. Requires joining across at-least 3 (possibly more IDs) - anonymousID, userID email (if the user signs up) and Salesforce record email.

Should the data pipe do this automatically? If not, what API abstraction should be exposed to the user?

huy

For those who're interested in learning more about the history and evolution of data infrastructure/BI - basically why and how it has come to this stage - check out this short guidebook [1] that my colleagues and I put together a few months back.

It goes into details how much relevance the practices of the past (OLAP, Kimball's modeling) has with the current changes in by the cloud era (MPP, cheap storage/compute, etc). Chapter 4 will be most interesting for HN audience: It walks through the different waves of data adoption ever since BI was invented in the 60-70s.

https://holistics.io/books/setup-analytics/

sradman

This sounds like an in-depth discussion of what the a16z document calls Blueprint 1: Modern Business Intelligence. I don’t know if the other two blueprints for Multimodal and AI are explored.

tuckerconnelly

The ELT (rather than ETL) insight was really cool, hadn't heard of that before.

Unless though, you're on a massive, massive scale, Just Use Postgres, and write your ETL (ELT now?) queues normally. Keep It Simple Stupid.

cageface

While I think data science is a very interesting field with a lot of beneficial applications it also seems to be the one that's right at the heart of a lot of the negative impact some tech is having on society right now. I seriously considered specializing in it for a while but ultimately decided it was too likely I'd be asked to work on things that make me uncomfortable.

malux85

Power(ful tools) can be wielded for good or evil, the courageous thing to do is to learn it AND act ethically, not shy away from it.

Otherwise the spoils of war go to the unethical evil because they are now unchallenged.

jlokier

I disagree; I think that approach does not work.

Building powerful tools and then using them ethically doesn't reduce the amount of "unethical evil" done by others. Quite the contrary. And it doesn't deny them "spoils", as though there's a zero-sum prize, because there isn't one.

If you're really good at building tools, it will result in the creation of new, powerful tools which may be wielded for good or evil. If most other actors out there will wield those tools you're building for more evil than good, the mere act of building those tools will lead to more evil than good.

So I'm with cageface on this.

Deciding which tools to build does have consequences, and it's other people who primarily decide how those tools will be used, not the toolmaker. Sometimes you can already see what choices others look likely to make.

Some would argue this doesn't place an ethical burden on the toolmaker, because you can't and shouldn't control other people. That's a different argument though. Ethical or not, there are undeniably consequences from building tools when you can see how they are likely to be used.

cageface

Maybe so but I'm not really in a position to martyr myself professionally right now so I'm just avoiding it instead.

malux85

False dichotomy

dm03514

I'm really excited about the state of data infrastructure and the emergence of the data lake. I feel like the technical aspects of data engineering is reduced to getting data into some cloud storage (s3) as parquet. Transforms are "solved" using ELT from the data lake, or streaming using kafka/spark.

I think executing this in orgs with legacy data technologies is hard but it is much more a people problem than a tech problem. In orgs that have achieved this foundation it's really cool to see the business and analytic impact to the company.

chrisweekly

"it is much more a people problem than a tech problem"

^ This holds true for nearly every aspect of nearly every company.

spullara

Snowflake (and others) will let you either pull that in and query it or as an external query that queries it in place. You can, if it makes sense for your use case, now just T from the data lake.

msolujic

Good start for this vast and complex topic. One thing that pops out here as missing is Data Mesh [1] It is emerging pattern for complex data management and data exchange between multiple products and product components/services.

[1] https://martinfowler.com/articles/data-monolith-to-mesh.html

m3kw9

I wonder how many of those companies in the proposed architecture have A16z as investors?

huy

I counted 6.

Fivetran, dbt, Preset (Superset/Airflow), Sisu, Imply and Databricks.

Though, as someone who's in this space a while, I think they did a decently fair job at articulating the 'modern' data infrastructure landscape.

fouc

The recent HN threads about excel made me think there's definitely room for a new kind of excel that works well for big data.

manigandham

That's just SQL.

And there are dozens of charting/visualization/business-intelligence vendors to do whatever you want beyond or on top of that SQL structure.

webmaven

> The recent HN threads about excel made me think there's definitely room for a new kind of excel that works well for big data.

Check out Google Connected Sheets: https://cloudblog.withgoogle.com/products/g-suite/connected-...

plaidfuji

This, to me, is now the rate limiting step in this architecture; there are probably 1000x as many people who can operate in Excel than people who can operate on a “data stack”. Yes, the fundamental goal of these data stacks is to enable insight and decisions “at scale”.. but beyond that you have probably hundreds or thousands of employees who just need to do quick analyses for one-off decisions that can be handled by Excel. But there’s usually a benefit to those analyses being “operationalized” and integrated into the broader architecture.. having a live connection to the central database, and having results piped back... so many Excel spreadsheets get emailed back and forth, completely out of the stack’s purview.

Will MS modify Excel 365 fast enough to meet this need? Will another spreadsheet program disrupt Excel’s dominance? Will another player come in with the ability to “ingest” arbitrary Excel files? Another major issue is Excel’s massive failure when it comes to handling uncertainty in data. I’ll be curious to see how it all plays out.

I remember reading about Looker, before Google bought them out. I never used it myself, but it may have fit the description.

fluffy87

Citation needed?

We connect all our sensors to an edge AI Server that handles sensor data, and only uploads to the cloud what’s actually relevant.

It works quite well, and there are many OEMs that offer such systems, with accelerators for inference, sensor data compression, 5G, etc.

nicholast

I considered this piece as sort of a loose validation that the Automunge library is filling an unmet need for data scientists. Intended for tabular data preprocessing in the steps immediately preceding the application of machine learning.

Daily Digest email

Get the top HN stories in your inbox every day.