Miller: Like Awk, sed, cut, join, and sort for CSV, TSV, and tabular JSON

Daily Digest email

Get the top HN stories in your inbox every day.

zX41ZdbW

I recommend using clickhouse-local[1] for these tasks.

It does SQL; it supports all imaginable data formats, streaming processing, and connecting to external data sources. It also outperforms every other tool[2].

[1] https://clickhouse.com/blog/extracting-converting-querying-l...

[2] https://colab.research.google.com/github/dcmoura/spyql/blob/...

krick

> curl https://clickhouse.com/ | sh

Jesus, this is disgusting. I'm not that picky and don't really complain about "... | sh" usually, but at least I took it for granted that I can always look at the script in the browser and assume that is has no actual evil intentions and doesn't rely on some fucking client-header magic to be modified on the fly.

maximilianroos

Why be so cantankerous?

    curl https://clickhouse.com/ > install.sh
    cat install

zamnos

User agent "sniffing" (I mean, it's right there. It's not exactly subtle.) has been going on since the before the IE6 days. That it now extends to make things easier for us as command line users is... kinda convenient? Another site where I've seen it done to good effect is http://ifconfig.me . Hit that with a web browser and get a page of accompanying information. Hit it with curl, and get back your ip in ascii - not even an extra newline character is returned!

The underlying question is do you trust clickhouse.com or not? You don't have to; I've never met the team or talked to them, and I can't make that decision for you. But whether you go to the site, laboriously find the download page, right click, download a binary, install the deb/rpm, ask your package manager for what files just got installed, then find and run the clickhouse binary, or just let your computer do it for you via a shell script, the end result is the same. Code from clickhouse (and we're sure it was from clickhouse because of TLS) was downloaded to the target machine, and then got run. Does things have to be difficult and annoying in order for you to like it? (Psychology studies say yes, actually.)

zX41ZdbW

We provide .deb, .rpm, .tgz, Docker, or single-binary, for x86-64, AArch64, for Linux, Mac and FreeBSD.

zX41ZdbW

https://clickhouse.com/docs/en/install/#available-installati...

cerved

pipe it though less

jsyolo

or sqlite for CSV/TSV, haven't tried it for json.

ydant

duckdb and being able to write:

   select a,b,c from '*.jsonl.gz'

has been a huge improvement to my workflows.

zX41ZdbW

How does it look from command-line for streaming processing of CSV/TSV?

undefined

[deleted]

jsyolo

you can pipe to it

password4321

Discussion of a similar tool last month:

yq: command-line YAML, JSON, XML, CSV and properties processor

https://news.ycombinator.com/item?id=34656022

Also mentions gojq, Benthos, xsv, Damsel, a 2nd yq, htmlq, cfn-flip, csvq, zq, and zsv.

networked

I have made a repository cataloguing tools like this: https://github.com/dbohdan/structured-text-tools.

account-5

Thanks for this, good job. I always mean to do something similar but just end up bookmarking HN threads I then never look at.

samstave

Guilty as charged... I have so many bookmarks I forget to look at...

APPlet Idea: an app that picks a random bookmark from your saves and sends you a reminder to click it and read it... and schedule when you want to see it - like every 7AM send me a link from my bookmarks. Set an alarm for 15 minutes to get me back on schedule.

damascus

So chatgpt and I made this: https://github.com/erasei/randombookmark Load in developer mode and when you click the icon it will open a new tab with a random bookmark.

jefftk

There's also GoAWK, which supports CSV natively ("-i csv"): https://github.com/benhoyt/goawk

horseRad

You should include qsv (https://github.com/jqnatividad/qsv) and goawk (https://github.com/benhoyt/goawk) into the CSV section! Goawk got a csv/tsv mode and qsv is loaded with features being developed.

glomgril

Big shouts to visidata -- very underrated tool imo.

elesiuta

Thank you! That list is perfect since I don't use these tools often enough to remember them and some are hard to find again without knowing exactly what to search for.

I also discovered pawk on it which looks interesting, I made a somewhat related tool [1] which is probably out of scope for your list, but can be used in some of the same ways [2] and may of of interest nonetheless.

[1] https://github.com/elesiuta/pyxargs

[2] cat /etc/hosts | pyxargs -d \n --im json --pre "d={}" --post "print(dumps(d))" --py "d['{}'.split()[0]] = '{}'.split()[1]"

donio

Benthos is a really cool tool when you want to take this idea to the next level. It can do the usual text/csv/json/yaml mangling you'd do with awk or jq (since it includes a version of those in addition to its own processing langauge) but it also has decoders and encoders for a bunch of different binary formats like protobuf or Avro. And in addition to stdin/out and files it can talk to Kafka, MQTT, HTTP, ES and a bunch of other stuff. I was able to put together a log processor that consumes from Kafka, does a bunch of field mangling and then ingests into Elasticsearch in a couple dozen lines of yaml.

dmix

Anyone else find jq really hard to use? I found the syntax/interface bewildering at times. Although not as bad as awk.

password4321

I'm pretty sure that's one big reason for all these other tools!

dima55

There's also https://github.com/dkogan/vnlog/ which is a wrapper around the existing coreutils, so all the options work, and there's nothing to learn

asicsp

Seems like you meant https://github.com/TomWright/dasel (instead of damsel)

password4321

Correct; thanks!

nextaccountic

the obvious missing reference is jq https://github.com/stedolan/jq

mdaniel

I've been getting a lot of mileage out of https://github.com/itchyny/gojq#readme recently due to two things: its vastly superior error messages and the (regrettably verbose) `--yaml-input` option

I also have https://github.com/01mf02/jaq#readme installed but just haven't needed it

donio

-y and -Y unused so I patch those in as short flags for --yaml-input/output in my copy.

password4321

True, for me that one's a given!

elesiuta

Another similar discussion last year where Miller was mentioned:

New(ish) command line tools

https://news.ycombinator.com/item?id=31009313

rustyf

Even without aliases, I still prefer PowerShell on Windows. Once you've turned the text into an object, you can pipeline it to hell.

Get-Content .\example.csv | ConvertFrom-Csv | Where-Object -Property color -eq red

wankle

No thank you. I appreciate the power, speed, simplicity and flexibility of UNIX/GNU style text tools. I-Also-Don't-Want-To-Be-Locked-Into-This-Ridiculous-Syntax-Nightmare.

shortrounddev

It's funny to me that people hate how verbose posh is when bash syntax looks like my cat walked over my keyboard

stinos

When I read I-Also-Don't-Want-To-Be-Locked-Into-This-Ridiculous-Syntax-Nightmare, bash and sed and cut and the likes indeed are the first things which come to mind. I really feel kind of bad sometimes for once having spent time learning them, only to later find out there are many alternatives and nearly all of them have a shorter learning curve. Many of these tools also have zero discoverability as well, meaning you effectively get locked into learning their syntax. And then another one for the next tool. Whereas at least in PS you can try to use common words and tab completion and sometimes get there before having to reach to 'how do I x in y'.

chongli

It's a matter of recognizing your use case. If you're going to write a program that you expect to maintain for years, sure, go ahead and make it as verbose as possible. Unix tools support this with long-form flags (usually prefixed with -- rather than -). On the other hand, if you're doing exploration and iterating interactively on the fly (which bash is best at) then you want very terse syntax to keep lines short.

password4321

Thank goodness for tab completion!

I appreciate that it auto-corrects capitalization and slash direction.

gabrielsroka

Here it is with aliases. `?` Means where:

  ipcsv example.csv | ? color -eq red
 
  color shape  flag index
  ----- -----  ---- -----
  red   square 1    15
  red   circle 1    16
  red   square 0    48
  red   square 0    7

7thaccount

Ever try this on large files? A lot of PowerShell commands I make like this can take minutes to run when a combination of Linux commands and Awk might take a couple of seconds.

unxdfa

I much prefer this approach as you only have to parse stuff once during the processing pipeline rather than at every step.

Unfortunately after working with it for a few years I utterly despise PowerShell for many other reasons.

cbolton

Can you share some of the reasons you don't like it?

unxdfa

Yes: memory ceiling is huge, WinRM is buggy and unreliable, performance is "variable", totally inconsistent handling of dates and times, being .Net it's UTF-16 internally so streams are painful to work with, escape characters argh, variable scoping is horrible, most of the API you have to hit to get stuff done is wrapping ancient bits of weird win32 and DCOM and WMI thus has inconsistent error handling and ability to actually do what it is told.

I've probably forgotten a few things.

hobs

While powershell has gotten a lot faster in the most recent versions its still pretty slow for anything involving computation; most of the type my equivalent python code beats the pants off of it.

The expressiveness is nice, but oftentimes modules won't support it or require weird ways of using the data to get the performance you want (mostly by dropping out of the pipe.)

The choices around Format- vs Out- vs Convert- are Very Confusing for new people and the "object in a shell but also text sometimes" way of displaying things is weird and until recently things like -NoTypeInformation or managing encoding in files was just pointlessly weird.

The module support and package management is still entirely in the stone ages and I regularly see people patching in C# in strings to get the behavior they want.

"Larger" modules tend to get Really Slow - the azure modules especially are just an example of how not to do it.

The way it automatically unwraps collections is cool, but gets weird when you have a 1 vs many option for your output, and you might find yourself defensively casting things to lists or w/e.

The typing system in general is nice to get started on but declaring a type does not fix it, so assignment can just break your entire world.

There's still a lot to love about the language when you are getting things done in a windows environment its great to glue together the various pieces of the system but I find the code "breaks" more often than equivalent python code.

MayeulC

I agree that this is something missing on classic UNIX shells: typed output streams. I had this discussion a while back on HN, though I can't find it ATM (I wish there was a comment history search function). I am far from the first one who thought of that, and there are a few implementations of this idea.

Searching for that comment, I came across relevant stories:

https://news.ycombinator.com/item?id=27535357

https://news.ycombinator.com/item?id=25225113

mdaniel

Algolia supports searching comments, and while I don't know the "facet" syntax to pin it to your username, thankfully it's a pretty distinct word and https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... coughed up https://news.ycombinator.com/item?id=26797583 from 2 years ago. Is that the one?

MayeulC

Ah, interesting feature I didn't know about, thank you.

It's one of the parents in that comment chain (top comment on the story): https://news.ycombinator.com/item?id=26791597

gregjor

Typed output streams aren’t missing from Unix. The authors of Unix clearly describe, repeatedly, that Unix tools use plain text as the common format, and that everything should look like a file.

You’re right that standard Unix tools don’t have a concept of types in streams, but that decision got made deliberately. Types and formats got left as output details.

Analogously my Macbook Air doesn’t have a fan, by design, not by accidental omission.

MayeulC

Right, that was an unfortunate choice of words. It's not missing, but leads to (IMO) a proliferation of under-specified text-based data and stream exchange formats.

Having human-readable text as the lowest common denominator is a laudable goal. Shell scripting would however probably be improved if most tools offered alternative typed streams, or something similar. I am not convinced Powershell's approach is the best, but their approach is at least interesting.

snidane

Great tool.

BUT, leaks memory like crazy.

Despite documentation stating the verbs are fully streaming.

> Fully streaming verbs > These don't retain any state from one record to the next. They are memory-friendly, and they don't wait for end of input to produce their output.

https://miller.readthedocs.io/en/6.7.0/streaming-and-memory/...

coldtea

>BUT, leaks memory like crazy.

Huh?

a) Isn't it written in Golang, which has a GC? Does it do custom buffer based management?

b) Isn't it supposed to be run on a file and get some output - as opposed to an interactice session? Why would it matter if it leaks, then, and how could it leak, as the memory is returned to the OS when it ends?

john_kerl

It's indeed written in Go now, which indeed has GC, and ultimately all memory is freed ... but there are indeed some issues around intermediate retention of memory, taking more memory than one would have expected.

Some gains were made on https://github.com/johnkerl/miller/pull/1133 and https://github.com/johnkerl/miller/pull/1132

john_kerl

asicsp

Previous discussion: https://news.ycombinator.com/item?id=28298729 (273 points | Aug 25, 2021 | 66 comments)

dang

Thanks! Macroexpanded:

Miller – tool for querying, shaping, reformatting data in CSV, TSV, and JSON - https://news.ycombinator.com/item?id=29651871 - Dec 2021 (33 comments)

Miller CLI – Like Awk, sed, cut, join, and sort for CSV, TSV and JSON - https://news.ycombinator.com/item?id=28298729 - Aug 2021 (66 comments)

Miller v5.0.0: Autodetected line-endings, in-place mode, user-defined functions - https://news.ycombinator.com/item?id=13751389 - Feb 2017 (20 comments)

Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV - https://news.ycombinator.com/item?id=10066742 - Aug 2015 (76 comments)

carb

This is nice. I use `column` for pretty printing CSV/TSV but it fixes two tiny gaps in `sort` (skipping header lines) and `jq` (parsing CSV input. `jq` supports `@csv` for output conversion but not input).

  $ cat example.csv 
  color,shape,flag,index
  yellow,triangle,1,11
  red,square,1,15
  red,circle,1,16
  red,square,0,48
  purple,triangle,0,51
  red,square,0,77
  
  # pretty printing
  $ column -ts',' example.csv 
  color   shape     flag  index
  yellow  triangle  1     11
  red     square    1     15
  red     circle    1     16
  red     square    0     48
  purple  triangle  0     51
  red     square    0     77
  
  # sorting with skipped headers is a mess. 
  $ (head -n 1 example.csv && tail -n +2 example.csv | sort -r -k4 -t',') | column -ts','
  color   shape     flag  index
  red     square    0     77
  purple  triangle  0     51
  red     square    0     48
  red     circle    1     16
  red     square    1     15
  yellow  triangle  1     11

darrenf

> `jq` supports `@csv` for output conversion but not input

Actually, `jq` can cope with trivial CSV input like your example, - `jq -R 'split(",")'` will turn a CSV into an array of arrays. To then sort it in reverse order by 3rd column and retain the header, the following fell out of my fingers (I'm beyond certain that a more skilled `jq` user than me could improve it):

     jq -R 'split(",")' example.csv | jq -sr '[[.[0]],.[1:]|sort_by(.[4])|reverse|.[]]|.[]|@csv'

    "color","shape","flag","index"
    "red","square","0","77"
    "purple","triangle","0","51"
    "red","square","0","48"
    "red","circle","1","16"
    "red","square","1","15"
    "yellow","triangle","1","11"

NB. there is also an entry in the `jq` cookbook for parsing CSVs into arrays of objects (and keeping numbers as numbers, dealing with nulls, etc) https://github.com/stedolan/jq/wiki/Cookbook#convert-a-csv-f...

asicsp

Another option (based on https://unix.stackexchange.com/q/11856/109046):

    (sed -u '1q' ; sort -r -k4 -t',') <example.csv | column -ts','

> sorting with skipped headers is a mess

I like these command line tools, but I think they can cripple someone actually learning programming language. For example, here is a short program that does your last example:

https://go.dev/play/p/9bASZ97lLWv

cosmojg

It's a matter of perspective.

I like programming languages, but I think they can cripple someone actually learning Unix!

At the end of the day, you should just use whatever tools make you the most productive most quickly.

coldtea

The whole point of UNIX userland is to not have to write a custom program for every simple case that just needs recombining some existing basic programs in a pipeline...

> simple case

that's just it though, the last example is not a simple case, hence why the last example is awkward by the commenters own admission. command line tools are fine, but you need to know when to set the hammer down and pick up the chainsaw.

gabrielsroka

I thought the Go code looked way too complex and Python would be simpler. Yes and no.

  import csv
  
  filename = 'example.csv'
  sort_by = 'index'
  reverse = True
  
  with open(filename) as f:
      lines = [d for d in csv.DictReader(f)]
  
  for line in lines:
      line['index'] = int(line['index'])
  lines.sort(key=lambda line: line[sort_by], reverse=reverse)
  print(','.join(lines[0].keys()))
  for line in lines:
      print(','.join(str(v) for v in line.values()))

cmdlineluser

Perhaps a `DictWriter` would simplify things:

    import csv
    import sys
    
    filename = "example.csv"
    sort_by = "index"
    reverse = True
    
    with open(filename, newline="") as f:
        reader = csv.DictReader(f)
        writer = csv.DictWriter(sys.stdout, fieldnames=reader.fieldnames)
        writer.writeheader()
        writer.writerows(sorted(reader, key=lambda row: int(row[sort_by]), reverse=reverse))

undefined

[deleted]

melx

+1 for easy install (`dnf install miller` on my Fedora).

But seems like it cannot handle a simple use case: CSV without header.

$ mlr --csv head -n 20 pp-2002.csv

mlr: unacceptable empty CSV key at file "pp-2002.csv" line 1.

You have to explicitly pass it (FYI `implicit-csv-header` is terrible arg name)

$ mlr --csv --implicit-csv-header head -n 20 pp-2002.csv

While `head` obliges rightly

$ head -n 20 pp-2002.csv

coldtea

Well, head just reads lines. A CSV row can optionally be multiple lines.

Also head doesn't do anything with the data or the format, aside from printing line by line, so doesn't need to know any column names.

melx

head reads CSV row that has multiple lines.

Miller is being offered as "like awk, sed, head.." (emphasis on head - mine) and yes it offers more, but it does not behave "like" the *nix tools it refers to.

0cf8612b2e1e

Miller is designed around the idea of structured data. This is a higher bar than naively manipulating text, and the user has to be more deliberate to extract fields. Doing cli manipulation of a csv that contains quoted commas is challenging with the standard tools.

coldtea

>head reads CSV row that has multiple lines.

It just reads lines, doesn't know anything about rows.

So if you want, say, the first 100 rows of a csv that has multiline rows, miller can give it, while head can't.

Head will just print the "N" first lines - whether those are 1, 22, 36 or N actual csv rows.

>Miller is being offered as "like awk, sed, head.." (emphasis on head - mine) and yes it offers more, but it does not behave "like" the nix tools it refers to.*

That's the whole idea.

That it behaves in a way more suited to the csv format, and more coherent than 5-6 different text-focused tools.

The claim is not "this is awk, sed, head, sort remade for csv with identical interfaces and behavior" but "this is a tool to work with csv files and do what you'd normally have to jump through hoops to do with awk, sed, head, sort which don't understand csv structure".

kbouck

you could shorten to:

$ mlr -N --csv head -n 20 pp-2002.csv

-N is a shortcut for --implicit-csv-header and --headerless-csv-output

WinstonSmith84

This is a cool tool. And it works well with GPT4. I asked ChatGPT4 the following: ``` I'd like to use Miller from John Kerl on the CSV called test.csv. It has 3 columns, Name, City and State. I'd like to filter City where it's either Chicago or Boston. I want to remove the State column and add a new column called NewColumn. This new column shall be filled with the value of the Name column appended with the letter `_new`. Finally the output shall be saved into the file called myfile.csv. Write the command to do that ```

And it replied with

mlr --csv filter '$City == "Chicago" || $City == "Boston"' then cut -x -f State then put '$NewColumn = $Name . "_new"' test.csv > myfile.csv

I'm sure that someone familiar with Miller would write this command faster than writing the text to GPT, but as a newbie I'd have spent much longer. Also each steps is described perfectly

0xbadcafebee

Csvkit is the usual way to handle working with CSVs (or Excel, or JSON, or.... Postgres): https://csvkit.readthedocs.io/en/latest/index.html#why-csvki...

brew install csvkit and enjoy

hindsightbias

Shell tool favorite from the past: https://github.com/stefan-schroedl/tabulator

pcthrowaway

Miller is a welcome addition to my toolbox. I just wish it was as 'easy' to use as jq... part of that might be how infrequently I work with CSVs

coldtea

Huh? jq has far more difficult syntax...

mywittyname

The syntax is pretty crazy, so there's a learning curve. But once your over the hump, jq is easy to use for transformations or "querying" of large volumes of json data.

hesdeadjim

Yea agreed. The moment I need to do anything complicated I start getting this gut feeling I’d be better off just writing a quick Go cli tool.

ar9av

Looks cool! It'd help to have some more full examples (including output) of what some common use-cases do in the README. After looking through this, I'm still scratching my head trying to think of what problem this could solve in my own practice, and I work with a lot of csv files daily

mdaniel

In some sense, there's no answer to your question because everyone's tool workflow is their own.

That said, the most recent invocation for me was `mlr --icsv --ojson cat < a515b308-9a0e-4e4e-99a2-eafaa6159762.csv` to fix up CloudTrail csv and after that I happen to be more muscle-memory with jq but conceptually next I could have `filter '$eventname == "CreateNodegroup"'`

Daily Digest email

Get the top HN stories in your inbox every day.