Pandas vs. Julia – cheat sheet and comparison

Daily Digest email

Get the top HN stories in your inbox every day.

joeman1000

The thing that keeps me coming back to Julia is the ability to pipe (or whatever you want to call it). It makes DataFrame operations a lot cleaner since I don't need to modify in place or create new DFs at intermediate steps in a process. Here's a video showing this sort of workflow in R:

https://youtu.be/W3e8qMBypSE

ChrisRackauckas

Julia has a pipe syntax (|>). But I think the bigger part here is more generally APIs built around it, which people are doing some things to port tidy syntax (https://github.com/TidierOrg/Tidier.jl).

joeman1000

This is very similar to DataFramesMeta:

https://github.com/JuliaData/DataFramesMeta.jl

time_to_smile

I'm still not entirely convinced that pipes aren't an anti-pattern. Absolutely an improvement over nested function calls:

a(b(c(d))) vs d |> c |> b |> a

but I'm not convinced pipes are better than more verbose code that explains each step:

step1 = c(d)

step2 = b(step1)

result = a(step2)

I've written a lot of tidy R and do understand the specific use cases where it really doesn't make sense to use the more verbose format, but generally find when I'm building complex mathematical models the verbose method is much easier to understand.

joeman1000

I think having intermediate variables is sort of 'littering', and requires extra work in the naming which might not be necessary. Also, with pipes, you can just take out any intermediate step by commenting out a line or deleting it. You cannot do this with your method above without then going and rewriting many different arguments. I also like piping because you can quickly increment and build a solution - quicker than naming intermediate steps anyway.

_0w8t

Naming intermediate steps require some non-trivial efforts. It can even distract from the main task of getting the results.

In programming the code will be read multiple times and good names will help the future readers. But in data science the calculation will be most likely will not be reused. So efforts to name things will be waste of time.

geokon

I suggest trying to lean into it more

I suggest that trying to strictly only bind output to a symbol if it will be used in multiple places.

So when I read code and I see some "intermediary" value bound - it tells me immediately "this thing will be used in several spots". Thereby bindings actually start to convey extra information

Anyway, it's just something that's worked for me. In all other scenarios I will use threading/pipeline (maybe Clojure specific). If steps are confusing/complex then you make a local named lambda or add in the extreme case.. comments

Max-Limelihood

If nothing else, you can just pipe the code and then write comments explaining what's left after each step. But the verbose code can be substantially slower (which happens when piping can be used to perform all these operations lazily).

dillydogg

> The thing that keeps me coming back to Julia is the ability to pipe

> Provides link to R.

Is there an example of this in Julia? I use R now, and every time I give Julia a shot I go back to R because of the insane TTFP. I don't use anything remotely close to big data, and the 90-120s compile times just to replot my small data (using AlgebraOfGraphics.jl in a Pluto notebook) just kill me.

ChrisRackauckas

Did you try v1.9 or v1.10 yet? From others I'm hearing that the code caching changed Makie from about 70 seconds down to 10 in v1.9, and then the loading time improvements brought it to like 5 (unreleased of course, though v1.10 should be branching in a few weeks). Makie load times were of course one of the ones highlighted in the release notes of v1.9: https://julialang.org/blog/2023/04/julia-1.9-highlights/. So while Makie won't be "instant" by v1.10 (<1 second), it was one of the worst offenders before and has gone from "wtf" to "bad but manageable".

dillydogg

I haven't! I didn't realize that code caching was part of 1.9. Looks like I'll have to check it out. Thanks

tpoacher

I just use a simple chaining function for python like so https://sr.ht/~tpapastylianou/chain-ops-python/

joeman1000

A neat solution, but you can’t alter the position of the argument per function.

tpoacher

Of course you can. In fact I'm doing just that in two places in the example.

(Yes I know what you mean, but yes you know what I mean!)

In the end, chains are about readability and logical flow; even if you don't like pre-wrapping in more meaningfully named functionals like the example, and accept the slight readability cost of using the occasional in-spot lambda or partial, I feel that this still becomes a lot more readable than "treat this symbol unconventionally in this context as a positional placeholder" hacky syntax stuff.

stoniejohnson

Is that not also available in Pandas?

https://pandas.pydata.org/docs/reference/api/pandas.DataFram...

joeman1000

Yes, but try using this and then try Julia's way. I tried this pandas implementation once and never touched it again.

chaxor

In pandas you can chain commands by wrapping the whole command in (). Personally IMO looks far 'cleaner' than all of the ugly %>% everywhere.

civilized

The flip side is that in pandas, chaining is less uniform because it is based on methods.

In R you can pipe a data frame into any function from any package or one you just wrote, so you use %>% for any piping that happens. In pandas, you have special pandas methods that don't need the pipe, but to pipe with any other function, you have to write .pipe.

The comparison is not really between %>% and ., it's between "you just use %>% for everything" and "you use . for a bloated, somewhat arbitrary collection of special pandas methods, and .pipe for everything else".

civilized

The sad thing about the conventional object-oriented programming paradigm is how it put the really cool syntactic idea of piping/chaining in the straitjacket of classes and objects.

The ability to pipe shouldn't be tied to whether a function is a method of a class.

joeman1000

What do you mean by wrapping the command in ()? I haven't seen this before. Do you have a link to where they mention this in the docs?

SilverBirch

Yeah this is basically why I keep trying and bouncing off Julia. I understand the real performance reasons why you'd choose to use Julia but the syntax is the perfect distance from python to make it extremely difficult to me. It's just close enough to get constantly confused. So if I really wanted to do much work in it I'd have swear off python - and I can't do that because for trivial stuff python is more convenient.

cookieperson

Pretty sure dataframes jl isn't the fastest dataframes library out there. Think it's Polars, which has bindings in Rust and python. If I remember correctly the runner up is data.table. Similarly SQL/SQLite can often beat all of these So switching to Julia for speed in this context may not even make sense anyways...

martinsmit

I agree with your conclusion but want to add that switching from Julia may not make sense either.

According to these benchmarks: https://h2oai.github.io/db-benchmark/, DF.jl is the fastest library for some things, data.table for others, polars for others. Which is fastest depends on the query and whether it takes advantage of the features/properties of each.

For what it's worth, data.table is my favourite to use and I believe it has the nicest ergonomics of the three I spoke about.

ChrisRackauckas

Indeed DataFrames.jl isn't and won't be the fastest way to do many things. It makes a lot of trade offs in performance for flexibility. The columns of the dataframe can be any indexable array, so while most examples use 64-bit floating point numbers, strings, and categorical arrays, the nice thing about DataFrames.jl is that using arbitrary precision floats, pointers to binaries, etc. are all fine inside of a DataFrame without any modification. This is compared to things like the Pandas allowed datatypes (https://pbpython.com/pandas_dtypes.html). I'm quite impressed by the DataFrames.jl developers given how they've kept it dynamic yet seem to have achieved pretty good performance. Most of it is smart use of function barriers to avoid the dynamism in the core algorithms. But from that knowledge it's very clear that systems should be able to exist that outperform it even with the same algorithms, in some cases just by tens of nanoseconds but in theory that bump is always there.

In the Julia world the one which optimizes to be fully non-dynamic is TypedTables (https://github.com/JuliaData/TypedTables.jl) where all column types are known at compile time, removing the dynamic dispatch overhead. But in Julia the minor performance gain of using TypedTables vs the major flexibility loss is the reason why you pretty much never hear about it. Probably not even worth mentioning but it's a fun tidbit.

> For what it's worth, data.table is my favourite to use and I believe it has the nicest ergonomics of the three I spoke about.

I would be interested to hear what about the ergonomics of data.table you find useful. if there are some ideas that would be helpful for DataFrames.jl to learn from data.table directly I'd be happy to share it with the devs. Generally when I hear about R people talk about tidyverse. Tidier (https://github.com/TidierOrg/Tidier.jl) is making some big strides in bringing a tidy syntax to Julia and I hear that it has had some rapid adoption and happy users, so there are some ongoing efforts to use the learnings of R API's but I'm not sure if someone is looking directly at the data.table parts.

nevereasonfroma

duckdb's fork, updated 2023.04 (h2oai is 2021.06): https://duckdblabs.github.io/db-benchmark/

repo: https://github.com/duckdblabs/db-benchmark

freilanzer

I have done both complex and trivial stuff in both languages and Julia isn't more inconvenient for trivial things.

cookieperson

Just make sure you find the appropriate documentation because the package changes it's syntax an awful lot over the past four years or so and there are lots of tutorials, videos, and blogs that don't apply anymore.

Similarly make sure you research the ecosystem because everything in Julia is very fragmented, IE pandas.loadcsv will require two or more packages in it's Julia equivalent.

Nilshg

To be clear on this: DataFrames, like most of the Julia ecosystem, follows SemVer. DataFrames 1.0 was released over two years ago (March 2021), and the API has been stable ever since.

Furthermore, Bogumil Kaminski, one of the main developers behind DataFrames, makes sure that the DataFrames tutorials he has created here (https://github.com/bkamins/Julia-DataFrames-Tutorial) are updated on every new release.

affinepplan

I notice you coming into every single thread about Julia to criticize the language and the community. Do you have a vendetta or something?

chaxor

This is definitely clouded by personal preferences far too much. Acting as if python is void of issues in change over time? We all know the incredible pain of trying to get some ML package running written 3 months ago (let alone 3 years ago) and how much time is spent remaking some conda env inside a docker inside qemu inside... just to get the stupid thing to load. So don't act like python doesn't have it's problems in change over time.

hpcjoe

> Julia is very fragmented

No, it isn't. I'm using Pandas, DF.jl, and even polars at work. DF.jl is by far the best/easiest/quickest to use, as its syntax is consistent. Polars is a bit more annoying as its syntax is further along the learning curve that I have gotten yet.

Pandas ... what to say about a library that will happily return a pd.Series in one moment, and a pd.DataFrame in another, for the same function call. This means you need extra code like

if type(ret_thing) = pd.Series: # then do something to coax it back to a df.

lest your actual code break.

This is of course the same language that has API differences that make no sense in, say, re.match vs re.findall vs re.search. I've been burned by all of those.

So, look, we get you hate Julia. That's fine. Go live your python life to its best. But really, stop with the misinformation/FUD. This speaks volumes about you, and tends to make the case precisely the opposite of what you think.

And yes, I use Python, Julia, C++, and many other languages in the $day_job.

Nanana909

Any examples? I've found Julia far easier for simple things than python. Most modern problems are mathematical and nature and I think it's pretty objective that julia looks closer to the mathematics. ----Some tests of very simple tasks in both languages.

At its most basic the obviouses becomes different:

Let's try to get a very simple object: a 3,3 matrix of random booleans in both languages: Julia:

A = rand(Bool,3,3)

Python: No standard support for Matrices. I could really do it a disservsice and comapre the "core language", but that''s obivously stupid so we'll bring in some external libraries to make it easier. Of many ways to skin the cat here's one.. Python

import numpy.numpy as np gen = np.random.default.default_rng() B = gen.choice([True,False],(3,3))

BTW julia has this choice function built into the command as well, so rand(["Which", "Word", "Will", "I", "get?"]) produces exactly what you'd expect.

----

Acutally I can't think of any cases at all off the top of my head. Sorry, I mean my np.somenamespace.another.namespace.sparce head :) I mean just going down the list of things that make code easier in Julia..

* Python requires a third party library for any kind of linear algebra. and matrix multiplication:*

  A =[1 4;6 7"; B = [2 ;3]; A*B doesn't work, i.e. you literally even multiply a matrix! This is madness.you'll need yet again a third partly library

Python doesnt have broadcasting. let's apply sin(x) to a matrix a Pythonic(+ required third party libraries)

  import numpy as np
 import math #sigh
      x = np.array([1, 2, 3, 4, 5])
      f = lambda x: sin(x)

Now in Julia (notice the . after sin) x=1:5 sin.(x) or more explicity we could write broadcast(sin, x)

Even basic string interpolation in Julia is a much nicer "trivial task" than in python alone

No special brackets, just clean "$myvar"

# Reading files is easier

Julia

    readlines("my_test.txt")

Python

open("my_test.txt").readlines()

What a stupid option in 2? if I call readlines on a filename, 99% of people, 99% of the time, want to read the lines on the file at that path. Why require two function calls?

mrtranscendence

It’s funny that people used to get on Python’s case for not being object oriented enough, and now we’ve come around to folks thinking Python should just throw a function for everything into the default namespace …

patrick451

> Sorry, I mean my np.somenamespace.another.namespace.sparce head :)

I really don't understand what you are trying to complain about. Namespaces are nice. Dumping everything into the global namespace sucks.

tombert

I have not done anything even remotely significant in Julia, but the little I played with didn't seem to indicate to me that it would be bad for trivial stuff...what trivial stuff is hard in Julia but easy in Python?

Nanana909

I responded in detail him about, but honestly I can't think of many that answer your qiestion. It's amost always the opposite. A I like to present this as stereotypical example of the types of differences you find the two, IMO. Generating a 3x3 matrix of random booleans.

>Julia

  A = rand(Bool,3,3)

>Python

   import numpy.numpy as np
      gen = np.random.default.default_rng()
      B = gen.choice([True,False],(3,3))

I find myself just taking `rand(["Msg1",.....,"MsnN"])` to get a random string often. And small things like this are why you see people Julia is so nice to write and become so defendant of it.

undefined

[deleted]

sundarurfriend

I think they just mean that with trivial projects, it's not worth trying them in a new language since the performance benefits are probably going to be minimal, and wouldn't really show off Julia's strengths.

tombert

That's fair; I have done my fair share of scripting in Node.js just because I'm familiar with it and it's fast enough to do most anything.

henlab

DataFrames.jl was for me on of the primary reasons to use Julia. I personally the syntax way easier than pandas, especially for more complex operations. I find this cheat sheet doesn't do in justice.

lukego

Tip: If you want these capabilities and your favorite language doesn't have them natively then you can consider embedding DuckDB for something comparable.

jstx1

This seems very poor - the comparison is between pandas and DataFrames.jl, not Julia; syntax comparison is very surface-level; cheatsheats are low resolution; the learning curve section says nothing about the learning curve; and the conclusion is "do whatever you like".

misja111

Well Pandas is a framework, not a language, so it only makes sense to compare it to DataFrames.jl and not to Julia as a lanugage.

But I agree this should have been reflected in the title of the article.

ChrisRackauckas

I both agree and disagree. It does look weird as library vs language. However at the same time, in Pandas everything tends to be in the Pandas library, whereas when using DataFrames.jl you tend to mix it with a lot of features that are external. Most of the calls just use overloaded functions from Julia's Base library (mean, first, last, findall). The Pandas model is to look at the Pandas docs and find the Pandas dataframe function that does your job. The DataFrames.jl model is to do whatever you would have done normally in Julia, like use the sort function, but now just use it on a DataFrame. The idea of DataFrames.jl is that you know the language and so it extends/adds as few functions as possible (joins, groupby, split-apply-combine I think are it?). This plus many other calls use functions from the more general Julia data science ecosystem (CSV.jl, JSON.jl, ...). So the title ends up being a bit apples and oranges, but the usage is also quite apples and oranges and the cheat sheet does accurately reflect that.

xgdgsc

But there's CSV.jl so I didn't change title. I don't see any low resolution, maybe a font choice issue? I'd say the conclusion is the right thing to say for such a short comparison.

jstx1

You have many errors in the tables, for example the pandas indexing is obviously wrong.

xgdgsc

It' s not my website.

rgavuliak

The cheatsheet goes wrong already for the first example of declaring a df: - you could do a range in python (range(11, 14)) - columns are called col_1 & col_2 vs a & b (both sets are horrible names) - pandas defines index of 0, 1, 3, while Julia would most likely have 0, 1, 2?

jstx1

Also df.loc[1:3, :] doesn't get the first N rows. First because of 0-indexing, second because when your index isn't ordered integers, you'll get completely unexpected results with .loc.

joeman1000

What do you mean? Julia is 1-indexed.

rgavuliak

Ok, in that case Python would be 0, 1, 3 and Julia 1, 2, 3. My point is that the example explicitly skips an index in the definition of a data frame for Python, but it doesn't for Julia.

markkitti

Surely Python would not skip the 2.

kloch

I've just started getting into Julia for one of it's best use cases: It's super easy to do arbitrary precision math. But you have to be very careful when using string literals with BigInt or BigFloat:

  julia> setprecision(1024)
  julia> a=BigFloat(1.0E-300)
  1.000000000000000025059091835208759685696146807703705249925342319900466043184051484676302812181950100894962306270278254148910311464998804130812246091606190182719426627934584275510414782787015070222639260603793613924359775094030143866141479125513590882591017341692222921220404918621822029155619541859418525883262e-300

Notice without quotes on the literal you only get ~15 decimal digits of precision because the parser treats the literal as a double and then passes that to the BigFloat variable.

  julia> a=BigFloat("1.0E-300")
  9.999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999988e-301

With quotes we get the full ~308 decimal digits of precision for the configured 1024-bit binary precision.

Now we can add it to 1.0 to validate the precision of a calculation and use the @printf macro for C-style formatting to round the output to 308 decimal digits:

  julia> b=BigFloat("1.0")
  julia> using Printf
  julia> @printf("%.308f\n", (a+b))
  1.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000

I'm not sure why this is the default behavior, it seems like a really easy way for people to screw up their calculations, especially scientists that don't do a lot of programming.

DNF2

I actually find the string macro syntax even more convenient: `big"1e-300"`.

wnoise

> I'm not sure why this is the default behavior,

Um. You said the answer earlier:

> because the parser treats the literal as a double

BigFloats aren't built in to the syntax of the language (and probably should't be, so you need to escape the parser somehow -- either pass a string to the constructor, or use the @big_str macro to get a non-standard string parsed into a BigFloat.

kloch

This makes perfect sense from the perspective of a language designer/computer scientist who is trying to keep their design clean and consistent.

It makes no sense to an end user that expects an argument you pass to BigFloat to be treated as a BigFloat. As an end-user I would rather have a warning or even error than to have my argument silently treated as a double.

sundarurfriend

It's a tricky case because they do provide the `big""` macros for literals, and mention in BigFloat's docs that:

      BigFloat(x::AbstractString) is identical to parse. This is provided for convenience since decimal literals are converted to
      Float64 when parsed, so BigFloat(2.1) may not yield what you expect.

      ...
      Examples
      ≡≡≡≡≡≡≡≡≡≡

      julia> BigFloat(2.1) # 2.1 here is a Float64
      2.100000000000000088817841970012523233890533447265625

However, saying RTFM is not a solution, especially for not-too-frequent parts of the language like BigFloats. It's still a trap many people are going to fall for.

The solution here is a good linter though, not adding more work to the already overstressed compiler. It comes back to the issue of Julia needing more mature, easy-to-work-with tooling, that could say "hey, this is technically allowed, but you probably didn't mean this".

DNF2

The argument isn't silently treated as a double, it is explicitly and loudly treated as a double, because it is a literal double.

And this is not an advantage to the designer exclusively, it is very much an advantage for the end user that the treatment is explicit, consistent and predictable, instead of 'magically' reinterpreting the meaning of literals based on guessing the intent of the user.

Basically, you seem to be saying that when passing x to BigFloat, x should not be treated as the value x, but as some nearby value that might be the one the caller intended (based on some rounding logic perhaps?) Or are you perhaps saying that

    x = 1e-300
    y = BigFloat(x)

should be different from

    y = BigFloat(1e-300)

? In other words, completely discarding referential transparency?

xigoi

The site's cookie banner doesn't give an option to reject tracking cookies. Isn't that illegal?

nologic01

Anxiously waiting for pandas to get its mojo. If there is any Python library that needs it, this is it.

make3

I wish Python catches up to Julia in performance. No sense rewriting a trillion lines of code for what is a really pleasant syntax & ecosystem already.

But this is a language flamewar thing, probably not a constructive comment, sorry.

wdroz

You can just use Polars[0] instead of Pandas and easily beat both Pandas and DataFrames.jl

Pure Julia is faster than pure python, but there are non-pure python tools available in the python ecosystem for a ton of things.

[0] -- https://www.pola.rs/

affinepplan

Polars definitely doesn't "easily" beat DF.jl on all tasks.

Yes, I agree, on average polars is a bit faster for many of the simple workflows, but I certainly don't think that's unconditionally true. It's especially less true when you might want to do something out of the ordinary with your series --- in Julia it's trivial to just extract that as a vector and loop over it (fast!). In polars, one would have to make sure their function can be appropriately vectorized.

adammarples

It's strange, I would think that Julia's multiple dispatch would make something more like this desirable

``` df = DataFrame(CSV(File("name.csv")))

data = JSON(File("name.json")) ```

instead of the usual hodge podge of methods ``` df = CSV.read("file.csv", DataFrame)

data = JSON.parsefile("file.json") ```

affinepplan

Actually, that works too :) I think `DataFrame(CSV.File("name.csv"))` is what you're looking for

retrochameleon

Unreadable on mobile

marginalia_nu

You do a lot of software development on mobile?

newswasboring

Look up the termux community. There are actually people in third world countries who are learning programming using their phones. I've seen awesome builds which use Android phones as their primary CPU unit and cobble on scavenged monitors and keyboards and mice. It's honestly fucking awesome

Alifatisk

How’s that related to the article not being mobile friendly?

marginalia_nu

It's a reference sheet for a programming language. Like it's clearly designed to be read as you're writing code. I don't understand in which circumstance you'd optimize such a document for mobile use.

EuAndreh

We should nurture more accessibility, in this case, mobile compatibility.

For instance, consider someone who has limited access to desktop computers and have to go by with a mobile device. These individuals do exist, and their access is as legitimate as any other.

marginalia_nu

Mobile compatible websites are strictly worse though. It's why almost all desktop websites are just a hideous jumble of boxes these days.

It's virtually impossible to make a website that is well designed on both desktop and mobile. As long as the affordances of mouse+keyboard and touchscreen are as different as they are, one of the user groups needs will suffer a detrimental compromise.

undefined

[deleted]

danuker

I guess you could find the cheat sheet useful on a phone/tablet while you are learning the correspondences.

smohare

[dead]

_aaed

Nah, I'll do it with SQL

rectang

One nice thing about solutions like pandas or Julia is that they’re much easier to write tests for or otherwise validate. I can’t tell you how many times I’ve been handed a big ball of SQL which doesn’t behave like its author thinks it does, diverging in subtle or not-so-subtle ways.

ivirshup

ibis in python is a really nice middle ground. Nice API + in a programming language, but executes on database backends (which could be polars or duckdb on in memory arrow tables).

akdor1154

I agree, and interestingly so does the author of the article so it's a bit weird that you're receiving downvotes.

Julia's DataFrames library is more consistent than Pandas by a mile, but it's still a bit weird.

JuliaDB's IndexedTable and NDTable was a really awesome API design, it's quite a pity that JuliaDB is now unmaintained. :(

agacera

Same here. But have you tried duckdb? You can do sql in the pandas dfs and it is fast af.

https://duckdb.org/2021/05/14/sql-on-pandas.html

avnigo

  mydf = pd.DataFrame({'a' : [1, 2, 3]})
  print(duckdb.query("SELECT SUM(a) FROM mydf").to_df())

I can see the appeal, but if you're working in Python, something doesn't sit right with me when having to write out variable names as strings. E.g., if I want to refactor the code, my LSP or parser won't pick up those references.

> The SQL table name mydf is interpreted as the local Python variable mydf [...] Not only is this process painless, it is highly efficient.

It might be painless and convenient at first, but I feel like this could get you in trouble down the line. Is there a way to avoid this?

freilanzer

https://duckdb.org/docs/guides/python/ibis.html

cookieperson

Duckdb is sick. You can also do queries on parquet, etc.

cookieperson

SQLite is often much faster then dataframes jl and pandas.

ayhanfuat

I highly doubt that SQLite is faster than pandas, let alone dataframes.jl, for analytical workloads.

cookieperson

Might surprise you how many people are using pandas or dataframes for OLTP on a daily basis because they don't know better.

cjalmeida

SQLite is good for a bunch of stuff, but it's terrible for analytic workloads. Not even in the same ballpark for pandas, let alone Julia

pjmlp

Same here, I don't get the point of this other that "don't want to learn SQL".

cookieperson

A lot of data scientists don't know SQL and don't understand why people use it. That said, there are cases where in memory workloads and certain manipulations are less encumbered by dataframes APIs. But in a lot of cases... They get abused by people who really do need to push themselves a little bit to learn something new.

cjalmeida

You use both. Once your data fits comfortably in memory it's naive to try to build histograms, pivots and charts using pure SQL.

smabie

You're saying all Pandas usage (an incredibly popular library) is because people don't want to use SQL?

cookieperson

It's a broken argument with some truth too it. IE you can run a SQL query put the result in a dataframe to dump it to an interchange format. But in the same breathe... If you learn to use SQL a great deal of workloads often used via dataframes APIs kind of disappear, and in doing so learning a new language to use a new dataframes API isn't really worth it.

pjmlp

As far as I am aware, plenty of use cases can also be done via OLAP.

laratied

[dead]

Daily Digest email

Get the top HN stories in your inbox every day.