Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

graphe

Here's a thread on performance vs rg (ripgrep). https://github.com/BurntSushi/ripgrep/discussions/2597 didn't know about hypergrep either.

ashvardanian

Haven't benchmarked *grep implementations, but assuming those are just CLI wrappers around RegEx libraries, I'd expect the RegEx benchmarks to be broader and more representative.

There, hyperscan is generally the king, which means hypergrep numbers are likely accurate: https://github.com/p-ranav/hypergrep?tab=readme-ov-file#dire...

Disclaimer: I rarely use any *grep utilities, but often implement string libraries.

burntsushi

I'm the author of ripgrep and its regex engine.

Your claim is true to a first approximation. But greps are line oriented, and that means there are optimizations that can be done that are hard to do in a general regex library. You can read more about that here: https://blog.burntsushi.net/ripgrep/#anatomy-of-a-grep (greps are more than simple CLI wrappers around a regex engine).

If you read my commentary in the ripgrep discussion above, you'll note that it isn't just about the benchmarks themselves being accurate, but the model they represent. Nevertheless, I linked the hypergrep benchmarks not because of Hyperscan, but because they were done by someone who isn't the author of either ripgrep or ugrep.

As for regex benchmarks, you'll want to check out rebar: https://github.com/BurntSushi/rebar

You can see my full thoughts around benchmark design and philosophy if you read the rebar documentation. Be warned though, you'll need some time.

There is a fork of ripgrep with Hyperscan support: https://sr.ht/~pierrenn/ripgrep/

Hyperscan also has some preculiarities on how it reports matches. You won't notice it in basic usage, but it will appear when using something like the -o/--only-matching flag. For example, Hyperscan will report matches of a, b and c for the regex \w+, where as a normal grep will just report a match of abc. (And this makes sense given the design and motivation for Hyperscan.) Hypergrep goes to some pain to paper over this, but IIRC the logic is not fully correct. I'm on mobile, otherwise I would link to the reddit thread where I had a convo about this with the hypergrep author.

haberman

> I'm on mobile, otherwise I would link to the reddit thread where I had a convo about this with the hypergrep author.

From some searching I think you might mean this: https://www.reddit.com/r/cpp/comments/143d148/hypergrep_a_ne...

burntsushi

OK, now that I have hands on a keyboard, this is what I meant by Hyperscan's match semantics being "peculiar":

    $ echo 'foobar' | hg -o '\w{3}'
    1:foobar
    $ echo 'foobar' | grep -E -n -o '\w{3}'
    1:foo
    1:bar
Here's the aforementioned reddit thread: https://old.reddit.com/r/cpp/comments/143d148/hypergrep_a_ne...

I want to be clear that these are intended semantics as part of Hyperscan. It's not a bug with Hyperscan. But it is something you'll need to figure out how to deal with (whether that's papering over it somehow, although I'm not sure that's possible, or documenting it as a difference) if you're building a grep around Hyperscan.

frankjr

It might be the intended behavior of Hyperscan but it really feels like a bug in Hypergrep to report the matches like this - you cannot report a match which doesn't fully match the regex...

I also wonder if there's a performance issue when matching a really long line because Hyperscan is not greedy and will ping back to Hypergrep for every sub match. I guessing this is the reason for those shenanigans in the callback [0].

  $ python -c 'print("foo" + "bar" * 3000)' | hg -o 'foo.*bar'
[0] https://github.com/p-ranav/hypergrep/blob/ee85b713aa84e0050a...

kazinator

How about: use Hyperscan to round up all the lines that contain matches, and process those again with regex for the "-o" semantics.

cozzyd

is that an alias, or does hypergrep really use the same command name as mercurial?

infocollector

I think you should try it before you read these conflicting benchmarks from the authors: https://github.com/Genivia/ugrep-benchmarks

1vuio0pswjnm7

rg uses a lot of memory in the OpenSubtitles test. 903M vs 29M for ugrep. Unlike the previous test, we are not told the size of the file being searched.

Would be interesting to see comparisons where memory is limited, i.e., where the file being searched will not fit entirely into memory.

Personally I'm interested in "grep -o" alternatives. The files I'm searching are text but may have few newlines. For example I use ired instead of grep -o. ired will give the offsets of all matches, e.g.,

      echo /\"something\"|ired -n 1.htm
Quick and dirty script, not perfect:

      #!/bin/sh
      test $# -gt 0||echo "usage: echo string|${0##*/} file [blocksize] [seek] [match-no]"
      {
      read x;
      x=$(echo /\""$x"\"|ired -n $1|sed -n ${4-1}p); 
      test "$x"||exit 1;
      echo
      printf s"$x"'\n's-${3-0}'\n'x$2'\n'|ired -n $1;
      echo;
      printf s"$x"'\n's-${3-0}'\n'X$2'\n'|ired -n $1;
      echo;
      echo w$(printf s"$x"'\n's-${3-0}'\n'X$2'\n'|ired -n $1)|ired -n /dev/stdout;
      echo;
      }
Another script I use loops through all the matches.

burntsushi

> rg uses a lot of memory in the OpenSubtitles test. 903M vs 29M for ugrep. Unlike the previous test, we are not told the size of the file being searched.

Which test exactly? That's just likely because of memory maps futzing with the RSS data. Not actually more heap memory. Try with --no-mmap.

I'm not sure I understand the rest of your comment about grep -o. Grep tools usually have a flag to print the offset of each match.

EDIT: Now that I have hands on a keyboard, I'll demonstrate the mmap thing. First, ugrep:

    $ time ugrep-4.4.1 -c '\w+\s+Sherlock\s+Holmes\s+\w+' sixteenth.txt
    72

    real    22.115
    user    22.015
    sys     0.093
    maxmem  30 MB
    faults  0
    $ time ugrep-4.4.1 -c '\w+\s+Sherlock\s+Holmes\s+\w+' sixteenth.txt --mmap
    72

    real    21.776
    user    21.749
    sys     0.020
    maxmem  802 MB
    faults  0
And now for ripgrep:

    $ time rg-14.0.3 -c '\w+\s+Sherlock\s+Holmes\s+\w+' opensubtitles/2018/en/sixteenth.txt
    72

    real    0.076
    user    0.046
    sys     0.030
    maxmem  779 MB
    faults  0
    $ time rg-14.0.3 -c '\w+\s+Sherlock\s+Holmes\s+\w+' opensubtitles/2018/en/sixteenth.txt --no-mmap
    72

    real    0.087
    user    0.033
    sys     0.053
    maxmem  15 MB
    faults  0
It looks like the difference here is that ripgrep chooses to use a memory map by default here. I don't think it makes much of a difference here.

If the file were bigger than available memory, then the OS would automatically handle paging.

undefined

[deleted]

1vuio0pswjnm7

ripgrep is not for me.

burntsushi

I never argued otherwise. Especially since you clearly don't mind false negatives. ;-)

1vuio0pswjnm7

task: printing non-repeating patterns in relatively small files to the screen, optionally with some context

context should be printed exactly as it appears in the file, i.e., newlines should be printed

ired vs ripgrep, which one is better suited for this task

one uses regular expressions, the other does not

one is a 76k static binary that fits in 2MB L2 cache, the other is a 5.7MB dynamically-linked binary

2 shell scripts to demonstrate differences

usage: echo pattern|1.sh [num chars before] [num chars after]

1. "1.sh" using 5.7MB binary, PCRE2

      #!/bin/sh
      read x;
      case $# in :)
      ;;0)exec echo "usage: ${0##*/} file [num chars before] [num chars after]"
      ;;1)exec rg -uuu -no-unicode --block-buffered --color=never -NUo "$x" $1
      esac
      case $# in 2|3)printf "((.)|(\\\\n)|(\\\\r)){"$2"}$x((.)|(\\\\n)|(\\\\r)){"${3-0}"}";esac \
      |rg -f/dev/stdin -uuu --no-unicode --block-buffered --color=never -NUo $1
2. "2.sh" using 76k static binary, no regular expressions

      #!/bin/sh
      read x;
      len=${#x};
      b=$(($2+$len));
      case $# in 0)exec echo "usage ${0##*/} file [num chars before] [num chars after]"
      ;;2)b=$(($2+$len))
      ;;3)b=$(($3+$2+$len))
      esac
      echo "$x" > .x
      { printf /;ired -n -c X1 .x;} \
      |ired -n $1 \
      |sed  "s/.*/s&@s-${2-0}@b$b@X/;" \
      |tr @ '\12' \
      |ired -n $1 \
      |sed 's/.*/w&0a/' \
      |ired -n /dev/stdout \
      |sed -e '/^Invalid hexpair/d' 
    
generate test data:

      curl -4si0 -A "" https://www.google.com > test.html   
 
usage example: find the pattern "(" in test.html, display results to screen

      echo \(|1.sh test.html
 
      regex parse error:
          (?:()
          ^
      error: unclosed group
 
      echo '[(]'|1.sh test.html
 
      echo \(|2.sh test.html
 
      cat .x
observation: the task is simple but 1.sh may require more typing and knowledge of regular expressions

observation: 2.sh does not require knowledge of PCRE; the pattern requires no extra chars, e.g., brackets

usage example: find the pattern "(a" in test.html, display results to screen with 0 chars before and 3 chars after

     echo '[(]a'|1.sh test.html 0 3|sed -n l|less -N

     echo \(a|2.sh test.html 0 3|sed -n l|less -N
observation: 1.sh does not include the newline after match #187; some workaround is required for 1.sh

conclusion: for me, ripgrep is too large and complicated for this simple task involving relatively small files; it's overkill. it does not feel any faster than ired at the command line. in fact, it feels slower. like python or java, or other large rust/go binaries, there is a small initial delay, a jank. whereas ired feels very smooth.

burntsushi

I love how you continue to ignore the fact that ired produces incorrect results.

Also:

You can use -F to make the argument to ripgrep be interpreted as a literal. No knowledge of regex is needed. It's a standard grep flag.

You also aren't using PCRE. You're using ripgrep's default engine, which is the regex crate. You need to pass -P to use PCRE2. Although I don't see the point in doing so.

I find your overall comparison here to be disengenuous personally. You can't even be arsed to acknowledge that ired returns incorrect results. And every benchmark I've run has shown ripgrep to be faster or just as fast. There's no jank.

I already acknowledged that the rg binary is beefy. It is actually statically linked by default (although it may dynamically link C libraries). I don't care if rg is 5MB. If you do, then rg isn't for you. You can keep using broken software instead.

1vuio0pswjnm7

   xbps-query -RS ripgrep |sed -n 11,21p

      pkgname: ripgrep
      pkgver: ripgrep-14.0.3_1
      repository: https://repo-default.voidlinux.org/current/musl
      run_depends:
      libgcc>=4.4.0_1
      libpcre2>=10.22_1
      musl>=1.1.24_7
      shlib-requires:
      libc.so
      libgcc_s.so.1
      libpcre2-8.so.0
It would be nice to have a ripgrep without libpcre2.

It also would be nice to use BRE by default and make ERE optional, similar to grep.

What would compiling ripgrep from source entail. Would it be as easy as compiling ired.

ired compiles in seconds and compiling requires less than 1MB of disk space. No connection to any server is required to compile the program.

Let's edit the 1.sh script to add the -F option and try our example search again to see what happens.

       #!/bin/sh
       read x;
       case $# in :)
       ;;0)exec echo "usage: ${0##*/} file [chars before] [chars after]"
       ;;1)exec rg -F --no-unicode --block-buffered --color=never -NUo "$x" $1
       esac
       case $# in 2|3)printf "((.)|(\\\\n)|(\\\\r)){"$2"}$x((.)|(\\\\n)|(\\\\r)){"${3-0}"}";esac \
       |rg -f/dev/stdin -F --no-unicode --block-buffered --color=never -NUo $1

       echo \(|1.sh test.html

       echo \(a|1.sh test.html 0 3
As expected, this produces no output.

We cannot add the surrounding context characters as literals because we do not know the identity of these characters. That is what we are attempting to find out.

Would I ever search for a repeating pattern such as \(a\(a using ired. The answer is no; I am looking for context. I would search for \(a and then add a request for context, a number characters before and/or after, as in the examples. Again, I do not know what those characters will be; that is what I am searching for. If the pattern repeats, this would be visible from viewing the context.

For line-delimited files where data is presented in a regular format, grep -A -B and -C work great for printing context. But for files that can be idiosyncratic in how they present data and/or files that lack consistent newline delimeters, for me, grep -o is inadequate for printing context.

undefined

[deleted]

1vuio0pswjnm7

The failure of grep/ripgrep to display the newline character contained in the context in match #178 could be characterised as a "false negative".

1vuio0pswjnm7

1. Retrieve test.json

     curl -i40A "" "https://api.crossref.org/works?query=unix&rows=1000" > test.json
2. Create shell script

      #!/bin/sh
      # usage: echo string|1.sh file [blocksize] [seek]"
      read x;
      x=$(echo -n $x|od -An -tx1|tr -d '\40');
      echo /$x \
      |ired -n $1 \
      |sed "s/.*/s&@s-${3-0}@X$2/" \
      |tr @ '\12' \
      |ired -q -i /dev/stdin $1 \
      |sed 's/.*/w&0a/' \
      |ired -n /dev/stdout
We can make the script slightly faster by using busybox

      #!/bin/sh
      # usage: echo string|1.sh file [blocksize] [seek]"
      read x;
      x=$(echo -n $x|busybox od -An -tx1|busybox tr -d '\40');
      echo /$x \
      |ired -n $1 \
      |busybox sed "s/.*/s&@s-${3-0}@X$2/" \
      |busybox tr @ '\12' \
      |ired -q -i /dev/stdin $1 \
      |busybox sed 's/.*/w&0a/' \
      |ired -n /dev/stdout
NB. If redirecting output to a file, replace /dev/stdout with the file name.

ired is available on Void Linux

https://ftp.lysator.liu.se/pub/voidlinux/static/

      xbps-query.static -Rs ired-0
      xbps-install.static ired
3. Test grep v3.6, ripgrep v14.0.3 and shell script; busybox is v1.34.1

     busybox time grep -Eo .{35}https:.{4} test.json;

     busybox time rg -o .{35}https:.{4} test.json;

     busybox time sh -c "echo https:|1.sh 45 35 test.json"
We can make the script slower by using bash

     busybox time bash -c "echo https:|1.sh 45 35 test.json"
Program size

     du -h /usr/bin/grep
     216K/usr/bin/grep

     du -h /usr/bin/rg
     5.7M/usr/bin/rg

     du -hc /usr/bin/ired /bin/dash /usr/bin/tr /usr/bin/sed /usr/bin/od
     456K/bin/dash
     40K/usr/bin/ired
     56K/usr/bin/tr
     68K/usr/bin/od
     104K/usr/bin/sed
     724Ktotal

     du -h /usr/bin/busybox /usr/bin/ired
     772K/usr/bin/busybox
     40K/usr/bin/ired
     812Ktotal

     readelf -d /bin/dash /usr/bin/busybox

     File: /bin/dash

     There is no dynamic section in this file.

     File: /usr/bin/busybox

     There is no dynamic section in this file.

burntsushi

OK, so I'll try your commands:

    $ busybox time grep -Eo .{35}https:.{4} test.json
    real    0m 0.15s
    user    0m 0.15s
    sys     0m 0.00s

    $ busybox time rg-14.0.3 -o .{35}https:.{4} test.json
    real    0m 0.00s
    user    0m 0.00s
    sys     0m 0.00s

    $ busybox time dash -c "echo https:|./1.sh test.json 45 35"
    real    0m 0.01s
    user    0m 0.01s
    sys     0m 0.00s

    $ busybox time bash -c "echo https:|./1.sh test.json 45 35"
    real    0m 0.00s
    user    0m 0.00s
    sys     0m 0.00s

    $ busybox time dash -c "echo https:|./busy-1.sh test.json 45 35"
    real    0m 0.00s
    user    0m 0.01s
    sys     0m 0.00s

    $ busybox time bash -c "echo https:|./busy-1.sh test.json 45 35"
    real    0m 0.01s
    user    0m 0.01s
    sys     0m 0.00s
So grep -o takes 150ms, but both ripgrep and ired are seemingly instant. But if I use zsh's builtin `time` command with my own TIMEFMT[1], it gives me numbers greater than 0:

    $ time grep -Eo .{35}https:.{4} test.json
    real    0.324
    user    0.317
    sys     0.007
    maxmem  16 MB
    faults  0

    $ time rg-14.0.3 -o .{35}https:.{4} test.json
    real    0.008
    user    0.003
    sys     0.003
    maxmem  16 MB
    faults  0

    $ time dash -c "echo https:|./1.sh test.json 45 35"
    real    0.010
    user    0.011
    sys     0.007
    maxmem  16 MB
    faults  0

    $ time bash -c "echo https:|./1.sh test.json 45 35"
    real    0.011
    user    0.014
    sys     0.004
    maxmem  16 MB
    faults  0
Would you look at that. ripgrep is faster! By a whole 2 milliseconds! WOW!

OK, since I'm a software developer and thus apparently cannot understand the lowly needs of an "ordinary user," I'll hop over to my machine with a i5-7600, which was released 6 years ago. Is that ordinary enough, or still too super charged to do any meaningful comparison whatsoever?

    $ time grep -Eo .{35}https:.{4} test.json
    real    0.641
    user    0.620
    sys     0.017
    maxmem  6 MB
    faults  0

    $ time rg-14.0.3 -o .{35}https:.{4} test.json
    real    0.010
    user    0.008
    sys     0.000
    maxmem  8 MB
    faults  0

    $ time dash -c "echo https:|./1.sh test.json 45 35"
    real    0.011
    user    0.009
    sys     0.011
    maxmem  6 MB
    faults  0

    $ time bash -c "echo https:|./1.sh test.json 45 35"
    real    0.013
    user    0.021
    sys     0.003
    maxmem  6 MB
    faults  0
(I ran the commands above each several times and took the minimum.)

OK, so ripgrep is still 1ms faster even on "ordinary user" hardware.

All right, so your other comment also shared another benchmark:

    $ time grep -Eo .{100}https:.{50} test.json
    real    1.777
    user    1.772
    sys     0.003
    maxmem  6 MB
    faults  0

    $ time rg-14.0.3 -o .{100}https:.{50} test.json
    real    0.013
    user    0.006
    sys     0.000
    maxmem  8 MB
    faults  0

    $ time rg-14.0.3 --color never -o .{100}https:.{50} test.json
    real    0.006
    user    0.006
    sys     0.000
    maxmem  8 MB
    faults  0

    $ time dash -c "echo https:|./1.sh test.json 156 100"
    real    0.015
    user    0.024
    sys     0.004
    maxmem  7 MB
    faults  0

    $ time bash -c "echo https:|./1.sh test.json 156 100"
    real    0.016
    user    0.028
    sys     0.000
    maxmem  7 MB
    faults  0
(Notice that disabling color and line numbers for ripgrep improves its speed a fair bit. ired isn't doing either of those things, so it's only fair. GNU grep doesn't count line numbers by default and disabling color doesn't improve its perf here.)

This one is more interesting because it exposes the fact that many regex engines have trouble dealing with bounded repeats. Something like `.{100}` for example is not executed particularly efficiently in most regex engines. And indeed, in ripgrep by default, `.` actually matches the UTF-8 encoding of any Unicode scalar value (so between 1 and 4 bytes) and not any arbitrary byte. You'd need to pass the `--no-unicode` flag or prefix your pattern with `(?-u)` to match any arbitrary byte. And indeed, even then, `.` doesn't match `\n`. So you might even want `(?s-u)`. But since this is a grep and *greps are line oriented*, you'd need to enable multi-line mode in ripgrep (GNU grep doesn't have this):

    $ time rg-14.0.3 -Uo '(?s-u).{100}https:.{50}' test.json
    real    0.057
    user    0.041
    sys     0.006
    maxmem  8 MB
    faults  0

    $ time rg-14.0.3 --color never -N -Uo '(?s-u).{100}https:.{50}' test.json
    real    0.042
    user    0.041
    sys     0.000
    maxmem  8 MB
    faults  0
This actually runs slower, I believe, because it disables the line oriented optimizations that ripgrep uses. In this case, it isn't as good at detecting the `https:` literal and looking for that first. That's where `ired` can do (a lot) better, because it isn't line oriented and doesn't need to support arbitrary regex patterns. greps are.

To complete this analysis, I'm going to do something that I realize is blasphemous to you and increase the input size by ten-fold. This will help us understand where time is being spent:

    $ time grep --color=never -Eo .{100}https:.{50} test.10x.json
    real    17.931
    user    17.906
    sys     0.017
    maxmem  7 MB
    faults  0

    $ time rg-14.0.3 --color never -N -o '.{100}https:.{50}' test.10x.json
    real    0.032
    user    0.017
    sys     0.010
    maxmem  23 MB
    faults  0

    $ time rg-14.0.3 --color always -N -o '.{100}https:.{50}' test.10x.json
    real    0.137
    user    0.034
    sys     0.019
    maxmem  23 MB
    faults  0

    $ time dash -c "echo https:|./1.sh test.10x.json 156 100"
    real    0.067
    user    0.089
    sys     0.069
    maxmem  7 MB
    faults  0
I compared the profiles of `rg --color=never` and `rg --color=always`, and they look about the same to me. This suggests to me that color is slower simply because rendering it in my terminal emulator is slower.

For grins, I also tried ugrep:

    $ time ugrep-4.4.1 --color=never -o '.{100}https:.{50}' test.10x.json
    real    6.003
    user    5.977
    sys     0.007
    maxmem  6 MB
    faults  0
Owch. But not as bad as GNU grep.

So with a bigger input, we can see that `rg -o` is about twice as fast as ired, even on "ordinary" hardware.

And IMO, for inputs of the size you've provided, the difference is not meaningful.

Going back to your original prompt:

> Personally I'm interested in "grep -o" alternatives.

It seems to me like `rg -o` is quite serviceable in that regard, and at the very least, substantially better than GNU grep.

At this point, I wondered what ired did for substring search[2]. That immediately stuck out to me as something that looked wrong. Indeed:

    $ cat haystack
    ABAABAB
    $ echo -n BAB | od -An -tx1 | sed 's>^>/>;s/ //g' | ired -n haystack
    0x4
    $ echo -n ABAB | od -An -tx1 | sed 's>^>/>;s/ //g' | ired -n haystack
    $ rg -o ABAB haystack
    1:ABAB
So ired is a toy. One wonders how many search results you've missed over the years because of ired's feature "it's so minimal that it's wrong!" I mean sometimes tools have bugs. ripgrep has had bugs too. But this one has been in ired since 2009.

What is it that you said? YIKES. Yeah. Seems appropriate.

[1]: https://github.com/BurntSushi/dotfiles/blob/eace294fd80bfde1...

[2]: https://github.com/radare/ired/blob/a1fa7904e6ad239dde950de5...

1vuio0pswjnm7

About grep -o.

    # stat -c %s file
    6297285

    # file file
    file: ASCII text, with very long lines (1545), with CRLF, LF line terminators
Imagine file as a wall of text.

1. Printing byte offsets.

    # time grep -ob string file

    0.03user 0.08system 0:00.22elapsed 52%CPU (0avgtext+0avgdata 1104maxresident)k
    0inputs+0outputs (0major+86minor)pagefaults 0swaps

    # rg -V
    ripgrep 13.0.0
   
    # time rg -ob string file

    0.10user 0.17system 0:01.11elapsed 25%CPU (0avgtext+0avgdata 7804maxresident)k
    0inputs+0outputs (0major+559minor)pagefaults 0swaps

    # time sh -c "echo -n string|od -An -tx1|sed 's>^>/>;s/ //g'|ired -n file"

    0.03user 0.09system 0:00.18elapsed 67%CPU (0avgtext+0avgdata 720maxresident)k
    0inputs+0outputs (0major+189minor)pagefaults 0swaps
2. Printing some "context" around the matched string. For example, add characters immediately preceding string.

Baseline.

     # time grep -o string file

     0.02user 0.07system 0:00.15elapsed 65%CPU (0avgtext+0avgdata 1068maxresident)k
     0inputs+0outputs (0major+84minor)pagefaults 0swaps
Add one character.

     # time grep -o .string file

     0.21user 0.08system 0:00.36elapsed 83%CPU (0avgtext+0avgdata 1088maxresident)k
     0inputs+0outputs (0major+87minor)pagefaults 0swaps
Add another character.

     # time grep -o ..string file

     0.29user 0.09system 0:00.46elapsed 82%CPU (0avgtext+0avgdata 1064maxresident)k
     0inputs+0outputs (0major+88minor)pagefaults 0swaps

     # time rg -o ..string file

     0.13user 0.13system 0:00.90elapsed 28%CPU (0avgtext+0avgdata 9012maxresident)k
     0inputs+0outputs (0major+574minor)pagefaults 0swaps
Yikes.

Now let's try ired. Another shell script. This one will print all occurences of string.

     cat > 1.sh << eof
     #!/bin/sh
     # usage: echo string|1.sh file [blocksize] [seek]"
     read x;
     x=$(echo -n $x|xxd -p);
     echo /$x \
     |ired -n $1 \
     |sed "s/.*/s&@s-${3-0}@X$2/" \
     |tr @ '\12' \
     |ired -q -i /dev/stdin $1 \
     |sed 's/.*/w&0a/' \
     |ired -n /dev/stdout
     eof
Baseline.

     # echo string|time sh 1.sh 6

     0.11user 0.10system 0:00.17elapsed 127%CPU (0avgtext+0avgdata 772maxresident)k
     0inputs+0outputs (0major+466minor)pagefaults 0swaps
Add one character before string.

     # echo string|time sh 1.sh 7 1

     0.12user 0.09system 0:00.16elapsed 131%CPU (0avgtext+0avgdata 740maxresident)k
     0inputs+0outputs (0major+473minor)pagefaults 0swaps
Add another.

     # echo string|time sh 1.sh 8 2

     0.12user 0.11system 0:00.20elapsed 112%CPU (0avgtext+0avgdata 744maxresident)k
     0inputs+0outputs (0major+461minor)pagefaults 0swaps
Perhaps grep or ripgrep might be slightly faster at printing byte offsets.

But ired is faster at printing matches with context. (NB. Context here means characters, not lines.)

Try using ripgrep to print offsets for ired.

    #!/bin/sh
    read x; 
    rg --no-mmap -ob $x $1 \
    |cut -d: -f1 \
    |sed "s/.*/s&@s-${3-0}@X$2/" \
    |tr @ '\12' \
    |ired -q -i /dev/stdin $1 \
    |sed 's/.*/w&0a/' \
    |ired -n /dev/stdout

    # time sh -c "echo string|1.sh file 8 2"

    0.11user 0.06system 0:00.18elapsed 101%CPU (0avgtext+0avgdata 5972maxresident)k
    0inputs+0outputs (0major+905minor)pagefaults 0swaps

    # stat -c %s /usr/bin/ired /usr/bin/grep /usr/bin/rg

    37544
    219248
    5074800

burntsushi

OK, so first of all, let's get one thing cleared up. What the heck is ired? It isn't in the Archlinux package repos. I found this[1], but it looks like an incomplete and abandoned project. It doesn't even have proper docs:

    $ ired -h
    ired [-qhnv] [-c cmd] [-i script] [-|file ..]
    $ ired --help
    $
So like, I don't even know what `ired -n` is doing. From what I can tell from your commands, it's searching for `string`, but you first need to convert it to a hexadecimal representation.

But okay, let's also check the output between the commands and make sure they're the same. I used my own file:

    $ time grep -ob string 1-2048.txt
    333305:string
    333380:string
    920494:string
    5166701:string
    5210094:string
    6775219:string

    real    0.006
    user    0.006
    sys     0.000
    maxmem  15 MB
    faults  0

    $ time rg -ob string 1-2048.txt
    13123:333305:string
    13124:333380:string
    33382:920494:string
    159885:5166701:string
    161059:5210094:string
    211466:6775219:string

    real    0.003
    user    0.000
    sys     0.003
    maxmem  15 MB
    faults  0

    $ time sh -c "echo -n string|od -An -tx1|sed 's>^>/>;s/ //g'|ired -n 1-2048.txt"

    0x515f9
    0x51644
    0xe0bae
    0x4ed66d
    0x4f7fee
    0x6761b3

    real    0.013
    user    0.010
    sys     0.004
    maxmem  15 MB
    faults  0
Indeed, the hexadecimal offsets printed by ired line up with the offsets printed by grep and ripgrep. Notice also the timing. ired is slower here for me.

OK, now let's do context:

    $ time grep -ob string 1-2048.txt
    [..snip..]
    real    0.006
    user    0.006
    sys     0.000
    maxmem  16 MB
    faults  0

    $ time grep -ob .string 1-2048.txt
    [..snip..]
    real    0.005
    user    0.003
    sys     0.003
    maxmem  16 MB
    faults  0

    $ time grep -ob ..string 1-2048.txt
    [..snip..]
    real    0.006
    user    0.003
    sys     0.003
    maxmem  16 MB
    faults  0

    $ time rg -ob string 1-2048.txt
    [..snip..]
    real    0.004
    user    0.003
    sys     0.000
    maxmem  16 MB
    faults  0
    $ time rg -ob .string 1-2048.txt
    [..snip..]
    real    0.004
    user    0.000
    sys     0.003
    maxmem  16 MB
    faults  0
    $ time rg -ob ..string 1-2048.txt
    [..snip..]
    real    0.004
    user    0.004
    sys     0.000
    maxmem  16 MB
    faults  0
I don't see anything worth saying "yikes" about here.

One possible explanation for the timing differences is that your search has a lot of search results. The match count is a crucial part of benchmarking, and you've made the same mistake as the ugrep author by omitting them. But okay, let me try a search with more hits.

    $ time rg -ob the 1-2048.txt | wc -l
    60509

    real    0.011
    user    0.006
    sys     0.006
    maxmem  16 MB
    faults  0

    $ time rg -ob .the 1-2048.txt | wc -l
    60477

    real    0.014
    user    0.014
    sys     0.000
    maxmem  16 MB
    faults  0

    $ time rg -ob ..the 1-2048.txt | wc -l
    60359

    real    0.014
    user    0.014
    sys     0.000
    maxmem  16 MB
    faults  0
A little slower, but that's what you'd expect with the higher match frequency. Now let's try your script for 1.sh:

    $ echo the | time sh 1.sh 1-2048.txt 6 | wc -l
    63304

    real    0.048
    user    0.072
    sys     0.052
    maxmem  16 MB
    faults  0

    $ echo the | time sh 1.sh 1-2048.txt 7 1 | wc -l
    63336

    real    0.056
    user    0.096
    sys     0.042
    maxmem  16 MB
    faults  0

    $ echo the | time sh 1.sh 1-2048.txt 8 2 | wc -l
    63419

    real    0.053
    user    0.079
    sys     0.049
    maxmem  16 MB
    faults  0
(The counts are a little different because `..the` matches fewer things than `the` when given to grep, but presumably `ired` doesn't care about that.)

But in any case, ired is quite a bit slower here.

OK, let's pop up a level. Your benchmark is somewhat flawed. For three reasons. First is because the timings are so short that the differences here are generally irrelevant to human perception. It reminds me of the time when ripgrep came out, and someone would respond with a "gotcha" that `ag` was faster because it ran a search on a tiny repository in 10ms where as ripgrep took 12ms. That's not quite exactly the same as what's happening here, but it's close. The second is that the haystack is so short that overhead is likely playing a role here. The timings are just too short to be reliable indicators of performance as the haystack size scales. See my commentary on ugrep's benchmarks[2].

Let's try a bigger file:

    $ stat -c %s eigth.txt
    1621035918

    $ file eigth.txt
    eigth.txt: ASCII text

    $ time rg -ob Sherlock eigth.txt | wc -l
    1068

    real    0.154
    user    0.103
    sys     0.050
    maxmem  1551 MB
    faults  0

    $ time rg -ob .Sherlock eigth.txt | wc -l
    935

    real    0.156
    user    0.096
    sys     0.060
    maxmem  1551 MB
    faults  0

    $ time rg -ob ..Sherlock eigth.txt | wc -l
    932

    real    0.154
    user    0.107
    sys     0.047
    maxmem  1551 MB
    faults  0
And now ired:

    $ echo Sherlock | time sh 1.sh eigth.txt 6 | wc -l
    1068

    real    1.393
    user    0.671
    sys     0.729
    maxmem  16 MB
    faults  0

    $ echo Sherlock | time sh 1.sh eigth.txt 7 1 | wc -l
    1201

    real    1.391
    user    0.604
    sys     0.793
    maxmem  16 MB
    faults  0

    $ echo Sherlock | time sh 1.sh eigth.txt 8 2 | wc -l
    1204

    real    1.395
    user    0.578
    sys     0.823
    maxmem  16 MB
    faults  0
Yikes. Over an order of magnitude slower.

Now that the memory usage reported for ripgrep is high just because it's using file backed memory maps. It's not actual heap usage. You can check this by disabling memory maps:

    $ time rg -ob ..Sherlock eigth.txt --no-mmap | wc -l
    932

    real    0.179
    user    0.063
    sys     0.116
    maxmem  16 MB
    faults  0
And if we increase the match frequency on the same large haystack, the gap closes a little, but ired is still about 4x slower:

    $ time rg -ob ..the eigth.txt | wc -l
    13141187

    real    2.470
    user    2.418
    sys     0.050
    maxmem  1551 MB
    faults  0

    $ echo the | time sh 1.sh eigth.txt 8 2 | wc -l
    13894916

    real    10.027
    user    16.293
    sys     8.122
    maxmem  402 MB
    faults  0
I'm not clear on why you're seeing the results you are. It could be because your haystack is so small that you're mostly just measuring noise. ripgrep 14 did introduce some optimizations in workloads like this by reducing match overhead, but I don't think it's anything huge in this case. (And I just tried ripgrep 13 on the same commands above and the timings are similar if a tiny bit slower.)

[1]: https://github.com/radare/ired

[2]: https://github.com/BurntSushi/ripgrep/discussions/2597

comex

Interesting, it supports an n-gram indexer. ripgrep has had this planned for a few years now [1] but hasn't implemented it yet. For large codebases I've been using csearch, but it has a lot of limitations.

Unfortunately... I just tried the indexer and it's extremely slow on my machine. It took 86 seconds to index a Linux kernel tree, while csearch's cindex tool took 8 seconds.

[1] https://github.com/BurntSushi/ripgrep/issues/1497

dtgriscom

That's close to a gig of disk reads; I trust you didn't try ugrep first and then cindex second, without taking into account caching.

comex

I ran both multiple times, alternating (and making sure to clean out the indexes in between). Results were reasonably consistent across runs.

jgalt212

If you're gonna go the csearch route, you should also consider hound. I use it many times per day.

https://github.com/hound-search/hound

bishfish

It creates per-directory index files on its first run. ugrep-indexer is also labeled as beta. A couple of relevant quotes from its GitHub site:

“Indexing adds a hidden index file ._UG#_Store to each directory indexed.”

“Re-indexing is incremental, so it will not take as much time as the initial indexing process.”

o11c

Important note: not actually compatible. It took me seconds to find an option that does something completely different than the GNU version.

burntsushi

Indeed. And here are some concrete examples around locale:

    $ grep -V | head -n1
    grep (GNU grep) 3.11
    $ alias ugrep-grep="ugrep-4.4.1 -G -U -Y -. --sort -Dread -dread"
    $ echo 'pokémon' | LC_ALL=en_US.UTF-8 grep 'pok[[=e=]]mon'
    pokémon
    $ echo 'pokémon' | LC_ALL=en_US.UTF-8 ugrep-grep 'pok[[=e=]]mon'
    $ echo 'γ' | LC_ALL=en_US.UTF-8 grep -i 'Γ'
    γ
    $ echo 'γ' | LC_ALL=en_US.UTF-8 ugrep-grep -i 'Γ'
BSD grep works like GNU grep too:

    $ grep -V
    grep (BSD grep, GNU compatible) 2.6.0-FreeBSD
    $ echo 'pokémon' | LC_ALL=en_US.UTF-8 grep 'pok[[=e=]]mon'
    pokémon
    $ echo 'γ' | LC_ALL=en_US.UTF-8 grep -i 'Γ'
    γ

fwip

Which option is that? I'm scanning the ugrep page, but nothing is popping out to me.

e12e

I would assume compatible meant posix/bsd - unless explicitly advertised AS "GNU grep compatible"?

burntsushi

From the OP: "Ugrep is compatible with GNU grep and supports GNU grep command-line options."

zaidhaan

A little off-topic, but I'd love to see a tool similar to this that provides real-time previews for an entire shell pipeline which, most importantly, integrates into the shell. This allows for leveraging the completion system to complete command-line flags and using the line editor to navigate the pipeline.

In zsh, the closest thing I've gotten to this was to bind Ctrl-\ to the `accept-and-hold` zle widget, which executes what is in the current buffer while still retaining it and the cursor position. That gets me close (no more ^P^B^B^B^B for editing), but I'd much rather see the result of the pipeline in real-time rather than having to manually hit a key whenever I want to see the result.

wazzaps

Sounds similar to this: https://github.com/akavel/up

tacone

I guess Alt+a is the default zsh shortcut for that.

ijustlovemath

Any particular reason why newer tools don't follow the well-established XDG standard for config files? Those folder structures probably already exist on end user machines, and keep your home directory from getting cluttered with tens of config files

xcdzvyn

Slight rant/aside but Firefox is bad for this. You can point it to a custom profile path (e.g. .config/mozilla) but ~/.mozilla/profile.ini MUST exist. Only that one file - you can move everything else.

ijustlovemath

In my mind, this is fine, as Firefox predates the standard by a long time. But newer tools specifically should know better.

tedunangst

XDG isn't recognized as an authority outside of XDG.

burntsushi

For ripgrep at least, you set an environment variable telling it where to look for a config file. You can put it anywhere, so you don't need to put it in $HOME.

I didn't do XDG because this route seemed simpler, and XDG isn't something that is used everywhere.

smaudet

Standard should be - tool tells you where it's configured, how to change the config, and choose a 'standard' default config, such as XDG.

Assuming you aren't doing weird things with paths, I can work around 'dumb lazy' developers releasing half-assed tools with symlinks/junctions, but I really don't want to spend a ton of time configuring your tool or fighting its presumptions.

burntsushi

Oh okay, I guess you've got it figured out. Now specify it in enough detail for others to implement it, get all stakeholders to agree and get everyone to implement it exactly to the spec.

Good luck. You're already off to a rough start with XDG, since that isn't what is used on Windows. And it's unclear whether it ought to be used on macOS.

Hendrikto

> I didn't do XDG because this route seemed simpler

Simpler how? This requires custom config, instead of following what I set system-wide.

> and XDG isn't something that is used everywhere.

Yeah, that‘s why it defines defaults to fall back on.

burntsushi

It's far simpler to implement.

No, you don't understand. I'm not saying The XDG variables might not be defined. Give me a little credit here lol. I have more than a passing familiarity with XDG. I've implemented it before. I'm saying the XDG convention itself may not apply. For example, Windows. And its controversial whether to use them on macOS when I last looked into it.

I don't see any significant problem with defining an environment variable. You likely already have dozens defined. I know I do.

I'm not trying to convince you of anything. Someone asked why. This is why for ripgrep at least.

Joel_Mckay

Someone please just standardize the grep flags across all platforms.

Specifically -P / --perl-regexp support on MacOS and FreeBSD

It really would reduce the WTF moments for the students.

Insert jokes about standards below... =)

burntsushi

That's what POSIX was supposed to be.

It's easier IMO to just use the same tool on all platforms. Which you can of course do.

Joel_Mckay

Not sure if brew's grep is as NERF'ed, but POSIX standard often is just a subset of minimal features for the GNU version.

Cheers, =)

burntsushi

Yes, that's the problem. You need to maintain a close attention level to know which things are POSIX. And in the case of GNU grep, you actually need to set POSIXLY_CORRECT=1. Otherwise its behavior is not a subset.

POSIX also forbids greps from searching UTF-16 because it mandates that certain characters always use a single byte. ripgrep, for example, doesn't have this constraint and thus can transparently search UTF-16 correctly via BOM sniffing.

karakanb

Slightly off topic, but how does one publish so many installable versions of a binary across all the package managers? I figured out how to do it for Brew, but the rest seems like a billion different steps that need to be done and I feel like I am missing something.

wint3rmute

You only have to set up CI/CD once for each package type, afterwards all the packaging work is done for you automatically.

Ripgrep is also quite a large project (judging on both star count and contribution cout), so people probably volunteer to support their platform/package manager of choice.

mathverse

Also look at https://github.com/stealth/grab from Sebastian Krahmer.

meowface

ripgrep, grab, ugrep, hypergrep... Any of the four are probably fast enough for any of my use cases but I suddenly feel tempted to micro-optimize and spend ages comparing them all.

infamia

Ugrep is also available in Debian based repos, which is super nice.

louwrentius

I will never learn this tool

I will not even contemplate using this tool.

The reason is very simple: I can trust 'grep' to be on any system I ever touch. Learning ugrep doesn't make any sense as I can't trust it to be available.

I could still use it on my own systems, but I work on customer systems which won't have this tool installed.

And I'm proficient enough with grep that it's 'good enough', I'm not focussing on a better grep. I'm focussing on fixing a problem, or trying something new.

I'd rather invest my time into something that will benefit me across all environments I work with.

Because a tool may be 'better' (whatever that means) doesn't mean it will see adoption.

This is not about being closeminded, but it's about focus on what's really important.

jftuga

I really like the fuzzy match feature. Useful for typos or off by 1-2 characters.

https://github.com/Genivia/ugrep#fuzzy

Daily Digest email

Get the top HN stories in your inbox every day.