Get the top HN stories in your inbox every day.
graphe
ashvardanian
Haven't benchmarked *grep implementations, but assuming those are just CLI wrappers around RegEx libraries, I'd expect the RegEx benchmarks to be broader and more representative.
There, hyperscan is generally the king, which means hypergrep numbers are likely accurate: https://github.com/p-ranav/hypergrep?tab=readme-ov-file#dire...
Disclaimer: I rarely use any *grep utilities, but often implement string libraries.
burntsushi
I'm the author of ripgrep and its regex engine.
Your claim is true to a first approximation. But greps are line oriented, and that means there are optimizations that can be done that are hard to do in a general regex library. You can read more about that here: https://blog.burntsushi.net/ripgrep/#anatomy-of-a-grep (greps are more than simple CLI wrappers around a regex engine).
If you read my commentary in the ripgrep discussion above, you'll note that it isn't just about the benchmarks themselves being accurate, but the model they represent. Nevertheless, I linked the hypergrep benchmarks not because of Hyperscan, but because they were done by someone who isn't the author of either ripgrep or ugrep.
As for regex benchmarks, you'll want to check out rebar: https://github.com/BurntSushi/rebar
You can see my full thoughts around benchmark design and philosophy if you read the rebar documentation. Be warned though, you'll need some time.
There is a fork of ripgrep with Hyperscan support: https://sr.ht/~pierrenn/ripgrep/
Hyperscan also has some preculiarities on how it reports matches. You won't notice it in basic usage, but it will appear when using something like the -o/--only-matching flag. For example, Hyperscan will report matches of a, b and c for the regex \w+, where as a normal grep will just report a match of abc. (And this makes sense given the design and motivation for Hyperscan.) Hypergrep goes to some pain to paper over this, but IIRC the logic is not fully correct. I'm on mobile, otherwise I would link to the reddit thread where I had a convo about this with the hypergrep author.
haberman
> I'm on mobile, otherwise I would link to the reddit thread where I had a convo about this with the hypergrep author.
From some searching I think you might mean this: https://www.reddit.com/r/cpp/comments/143d148/hypergrep_a_ne...
burntsushi
OK, now that I have hands on a keyboard, this is what I meant by Hyperscan's match semantics being "peculiar":
$ echo 'foobar' | hg -o '\w{3}'
1:foobar
$ echo 'foobar' | grep -E -n -o '\w{3}'
1:foo
1:bar
Here's the aforementioned reddit thread: https://old.reddit.com/r/cpp/comments/143d148/hypergrep_a_ne...I want to be clear that these are intended semantics as part of Hyperscan. It's not a bug with Hyperscan. But it is something you'll need to figure out how to deal with (whether that's papering over it somehow, although I'm not sure that's possible, or documenting it as a difference) if you're building a grep around Hyperscan.
frankjr
It might be the intended behavior of Hyperscan but it really feels like a bug in Hypergrep to report the matches like this - you cannot report a match which doesn't fully match the regex...
I also wonder if there's a performance issue when matching a really long line because Hyperscan is not greedy and will ping back to Hypergrep for every sub match. I guessing this is the reason for those shenanigans in the callback [0].
$ python -c 'print("foo" + "bar" * 3000)' | hg -o 'foo.*bar'
[0] https://github.com/p-ranav/hypergrep/blob/ee85b713aa84e0050a...kazinator
How about: use Hyperscan to round up all the lines that contain matches, and process those again with regex for the "-o" semantics.
cozzyd
is that an alias, or does hypergrep really use the same command name as mercurial?
infocollector
I think you should try it before you read these conflicting benchmarks from the authors: https://github.com/Genivia/ugrep-benchmarks
1vuio0pswjnm7
rg uses a lot of memory in the OpenSubtitles test. 903M vs 29M for ugrep. Unlike the previous test, we are not told the size of the file being searched.
Would be interesting to see comparisons where memory is limited, i.e., where the file being searched will not fit entirely into memory.
Personally I'm interested in "grep -o" alternatives. The files I'm searching are text but may have few newlines. For example I use ired instead of grep -o. ired will give the offsets of all matches, e.g.,
echo /\"something\"|ired -n 1.htm
Quick and dirty script, not perfect: #!/bin/sh
test $# -gt 0||echo "usage: echo string|${0##*/} file [blocksize] [seek] [match-no]"
{
read x;
x=$(echo /\""$x"\"|ired -n $1|sed -n ${4-1}p);
test "$x"||exit 1;
echo
printf s"$x"'\n's-${3-0}'\n'x$2'\n'|ired -n $1;
echo;
printf s"$x"'\n's-${3-0}'\n'X$2'\n'|ired -n $1;
echo;
echo w$(printf s"$x"'\n's-${3-0}'\n'X$2'\n'|ired -n $1)|ired -n /dev/stdout;
echo;
}
Another script I use loops through all the matches.burntsushi
> rg uses a lot of memory in the OpenSubtitles test. 903M vs 29M for ugrep. Unlike the previous test, we are not told the size of the file being searched.
Which test exactly? That's just likely because of memory maps futzing with the RSS data. Not actually more heap memory. Try with --no-mmap.
I'm not sure I understand the rest of your comment about grep -o. Grep tools usually have a flag to print the offset of each match.
EDIT: Now that I have hands on a keyboard, I'll demonstrate the mmap thing. First, ugrep:
$ time ugrep-4.4.1 -c '\w+\s+Sherlock\s+Holmes\s+\w+' sixteenth.txt
72
real 22.115
user 22.015
sys 0.093
maxmem 30 MB
faults 0
$ time ugrep-4.4.1 -c '\w+\s+Sherlock\s+Holmes\s+\w+' sixteenth.txt --mmap
72
real 21.776
user 21.749
sys 0.020
maxmem 802 MB
faults 0
And now for ripgrep: $ time rg-14.0.3 -c '\w+\s+Sherlock\s+Holmes\s+\w+' opensubtitles/2018/en/sixteenth.txt
72
real 0.076
user 0.046
sys 0.030
maxmem 779 MB
faults 0
$ time rg-14.0.3 -c '\w+\s+Sherlock\s+Holmes\s+\w+' opensubtitles/2018/en/sixteenth.txt --no-mmap
72
real 0.087
user 0.033
sys 0.053
maxmem 15 MB
faults 0
It looks like the difference here is that ripgrep chooses to use a memory map by default here. I don't think it makes much of a difference here.If the file were bigger than available memory, then the OS would automatically handle paging.
undefined
1vuio0pswjnm7
ripgrep is not for me.
burntsushi
I never argued otherwise. Especially since you clearly don't mind false negatives. ;-)
1vuio0pswjnm7
task: printing non-repeating patterns in relatively small files to the screen, optionally with some context
context should be printed exactly as it appears in the file, i.e., newlines should be printed
ired vs ripgrep, which one is better suited for this task
one uses regular expressions, the other does not
one is a 76k static binary that fits in 2MB L2 cache, the other is a 5.7MB dynamically-linked binary
2 shell scripts to demonstrate differences
usage: echo pattern|1.sh [num chars before] [num chars after]
1. "1.sh" using 5.7MB binary, PCRE2
#!/bin/sh
read x;
case $# in :)
;;0)exec echo "usage: ${0##*/} file [num chars before] [num chars after]"
;;1)exec rg -uuu -no-unicode --block-buffered --color=never -NUo "$x" $1
esac
case $# in 2|3)printf "((.)|(\\\\n)|(\\\\r)){"$2"}$x((.)|(\\\\n)|(\\\\r)){"${3-0}"}";esac \
|rg -f/dev/stdin -uuu --no-unicode --block-buffered --color=never -NUo $1
2. "2.sh" using 76k static binary, no regular expressions #!/bin/sh
read x;
len=${#x};
b=$(($2+$len));
case $# in 0)exec echo "usage ${0##*/} file [num chars before] [num chars after]"
;;2)b=$(($2+$len))
;;3)b=$(($3+$2+$len))
esac
echo "$x" > .x
{ printf /;ired -n -c X1 .x;} \
|ired -n $1 \
|sed "s/.*/s&@s-${2-0}@b$b@X/;" \
|tr @ '\12' \
|ired -n $1 \
|sed 's/.*/w&0a/' \
|ired -n /dev/stdout \
|sed -e '/^Invalid hexpair/d'
generate test data: curl -4si0 -A "" https://www.google.com > test.html
usage example: find the pattern "(" in test.html, display results to screen echo \(|1.sh test.html
regex parse error:
(?:()
^
error: unclosed group
echo '[(]'|1.sh test.html
echo \(|2.sh test.html
cat .x
observation: the task is simple but 1.sh may require more typing and knowledge of regular expressionsobservation: 2.sh does not require knowledge of PCRE; the pattern requires no extra chars, e.g., brackets
usage example: find the pattern "(a" in test.html, display results to screen with 0 chars before and 3 chars after
echo '[(]a'|1.sh test.html 0 3|sed -n l|less -N
echo \(a|2.sh test.html 0 3|sed -n l|less -N
observation: 1.sh does not include the newline after match #187; some workaround is required for 1.shconclusion: for me, ripgrep is too large and complicated for this simple task involving relatively small files; it's overkill. it does not feel any faster than ired at the command line. in fact, it feels slower. like python or java, or other large rust/go binaries, there is a small initial delay, a jank. whereas ired feels very smooth.
burntsushi
I love how you continue to ignore the fact that ired produces incorrect results.
Also:
You can use -F to make the argument to ripgrep be interpreted as a literal. No knowledge of regex is needed. It's a standard grep flag.
You also aren't using PCRE. You're using ripgrep's default engine, which is the regex crate. You need to pass -P to use PCRE2. Although I don't see the point in doing so.
I find your overall comparison here to be disengenuous personally. You can't even be arsed to acknowledge that ired returns incorrect results. And every benchmark I've run has shown ripgrep to be faster or just as fast. There's no jank.
I already acknowledged that the rg binary is beefy. It is actually statically linked by default (although it may dynamically link C libraries). I don't care if rg is 5MB. If you do, then rg isn't for you. You can keep using broken software instead.
1vuio0pswjnm7
xbps-query -RS ripgrep |sed -n 11,21p
pkgname: ripgrep
pkgver: ripgrep-14.0.3_1
repository: https://repo-default.voidlinux.org/current/musl
run_depends:
libgcc>=4.4.0_1
libpcre2>=10.22_1
musl>=1.1.24_7
shlib-requires:
libc.so
libgcc_s.so.1
libpcre2-8.so.0
It would be nice to have a ripgrep without libpcre2.It also would be nice to use BRE by default and make ERE optional, similar to grep.
What would compiling ripgrep from source entail. Would it be as easy as compiling ired.
ired compiles in seconds and compiling requires less than 1MB of disk space. No connection to any server is required to compile the program.
Let's edit the 1.sh script to add the -F option and try our example search again to see what happens.
#!/bin/sh
read x;
case $# in :)
;;0)exec echo "usage: ${0##*/} file [chars before] [chars after]"
;;1)exec rg -F --no-unicode --block-buffered --color=never -NUo "$x" $1
esac
case $# in 2|3)printf "((.)|(\\\\n)|(\\\\r)){"$2"}$x((.)|(\\\\n)|(\\\\r)){"${3-0}"}";esac \
|rg -f/dev/stdin -F --no-unicode --block-buffered --color=never -NUo $1
echo \(|1.sh test.html
echo \(a|1.sh test.html 0 3
As expected, this produces no output.We cannot add the surrounding context characters as literals because we do not know the identity of these characters. That is what we are attempting to find out.
Would I ever search for a repeating pattern such as \(a\(a using ired. The answer is no; I am looking for context. I would search for \(a and then add a request for context, a number characters before and/or after, as in the examples. Again, I do not know what those characters will be; that is what I am searching for. If the pattern repeats, this would be visible from viewing the context.
For line-delimited files where data is presented in a regular format, grep -A -B and -C work great for printing context. But for files that can be idiosyncratic in how they present data and/or files that lack consistent newline delimeters, for me, grep -o is inadequate for printing context.
undefined
1vuio0pswjnm7
The failure of grep/ripgrep to display the newline character contained in the context in match #178 could be characterised as a "false negative".
1vuio0pswjnm7
1. Retrieve test.json
curl -i40A "" "https://api.crossref.org/works?query=unix&rows=1000" > test.json
2. Create shell script #!/bin/sh
# usage: echo string|1.sh file [blocksize] [seek]"
read x;
x=$(echo -n $x|od -An -tx1|tr -d '\40');
echo /$x \
|ired -n $1 \
|sed "s/.*/s&@s-${3-0}@X$2/" \
|tr @ '\12' \
|ired -q -i /dev/stdin $1 \
|sed 's/.*/w&0a/' \
|ired -n /dev/stdout
We can make the script slightly faster by using busybox #!/bin/sh
# usage: echo string|1.sh file [blocksize] [seek]"
read x;
x=$(echo -n $x|busybox od -An -tx1|busybox tr -d '\40');
echo /$x \
|ired -n $1 \
|busybox sed "s/.*/s&@s-${3-0}@X$2/" \
|busybox tr @ '\12' \
|ired -q -i /dev/stdin $1 \
|busybox sed 's/.*/w&0a/' \
|ired -n /dev/stdout
NB. If redirecting output to a file, replace /dev/stdout with the file name.ired is available on Void Linux
https://ftp.lysator.liu.se/pub/voidlinux/static/
xbps-query.static -Rs ired-0
xbps-install.static ired
3. Test grep v3.6, ripgrep v14.0.3 and shell script; busybox is v1.34.1 busybox time grep -Eo .{35}https:.{4} test.json;
busybox time rg -o .{35}https:.{4} test.json;
busybox time sh -c "echo https:|1.sh 45 35 test.json"
We can make the script slower by using bash busybox time bash -c "echo https:|1.sh 45 35 test.json"
Program size du -h /usr/bin/grep
216K/usr/bin/grep
du -h /usr/bin/rg
5.7M/usr/bin/rg
du -hc /usr/bin/ired /bin/dash /usr/bin/tr /usr/bin/sed /usr/bin/od
456K/bin/dash
40K/usr/bin/ired
56K/usr/bin/tr
68K/usr/bin/od
104K/usr/bin/sed
724Ktotal
du -h /usr/bin/busybox /usr/bin/ired
772K/usr/bin/busybox
40K/usr/bin/ired
812Ktotal
readelf -d /bin/dash /usr/bin/busybox
File: /bin/dash
There is no dynamic section in this file.
File: /usr/bin/busybox
There is no dynamic section in this file.burntsushi
OK, so I'll try your commands:
$ busybox time grep -Eo .{35}https:.{4} test.json
real 0m 0.15s
user 0m 0.15s
sys 0m 0.00s
$ busybox time rg-14.0.3 -o .{35}https:.{4} test.json
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
$ busybox time dash -c "echo https:|./1.sh test.json 45 35"
real 0m 0.01s
user 0m 0.01s
sys 0m 0.00s
$ busybox time bash -c "echo https:|./1.sh test.json 45 35"
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
$ busybox time dash -c "echo https:|./busy-1.sh test.json 45 35"
real 0m 0.00s
user 0m 0.01s
sys 0m 0.00s
$ busybox time bash -c "echo https:|./busy-1.sh test.json 45 35"
real 0m 0.01s
user 0m 0.01s
sys 0m 0.00s
So grep -o takes 150ms, but both ripgrep and ired are seemingly instant. But if I use zsh's builtin `time` command with my own TIMEFMT[1], it gives me numbers greater than 0: $ time grep -Eo .{35}https:.{4} test.json
real 0.324
user 0.317
sys 0.007
maxmem 16 MB
faults 0
$ time rg-14.0.3 -o .{35}https:.{4} test.json
real 0.008
user 0.003
sys 0.003
maxmem 16 MB
faults 0
$ time dash -c "echo https:|./1.sh test.json 45 35"
real 0.010
user 0.011
sys 0.007
maxmem 16 MB
faults 0
$ time bash -c "echo https:|./1.sh test.json 45 35"
real 0.011
user 0.014
sys 0.004
maxmem 16 MB
faults 0
Would you look at that. ripgrep is faster! By a whole 2 milliseconds! WOW!OK, since I'm a software developer and thus apparently cannot understand the lowly needs of an "ordinary user," I'll hop over to my machine with a i5-7600, which was released 6 years ago. Is that ordinary enough, or still too super charged to do any meaningful comparison whatsoever?
$ time grep -Eo .{35}https:.{4} test.json
real 0.641
user 0.620
sys 0.017
maxmem 6 MB
faults 0
$ time rg-14.0.3 -o .{35}https:.{4} test.json
real 0.010
user 0.008
sys 0.000
maxmem 8 MB
faults 0
$ time dash -c "echo https:|./1.sh test.json 45 35"
real 0.011
user 0.009
sys 0.011
maxmem 6 MB
faults 0
$ time bash -c "echo https:|./1.sh test.json 45 35"
real 0.013
user 0.021
sys 0.003
maxmem 6 MB
faults 0
(I ran the commands above each several times and took the minimum.)OK, so ripgrep is still 1ms faster even on "ordinary user" hardware.
All right, so your other comment also shared another benchmark:
$ time grep -Eo .{100}https:.{50} test.json
real 1.777
user 1.772
sys 0.003
maxmem 6 MB
faults 0
$ time rg-14.0.3 -o .{100}https:.{50} test.json
real 0.013
user 0.006
sys 0.000
maxmem 8 MB
faults 0
$ time rg-14.0.3 --color never -o .{100}https:.{50} test.json
real 0.006
user 0.006
sys 0.000
maxmem 8 MB
faults 0
$ time dash -c "echo https:|./1.sh test.json 156 100"
real 0.015
user 0.024
sys 0.004
maxmem 7 MB
faults 0
$ time bash -c "echo https:|./1.sh test.json 156 100"
real 0.016
user 0.028
sys 0.000
maxmem 7 MB
faults 0
(Notice that disabling color and line numbers for ripgrep improves its speed a fair bit. ired isn't doing either of those things, so it's only fair. GNU grep doesn't count line numbers by default and disabling color doesn't improve its perf here.)This one is more interesting because it exposes the fact that many regex engines have trouble dealing with bounded repeats. Something like `.{100}` for example is not executed particularly efficiently in most regex engines. And indeed, in ripgrep by default, `.` actually matches the UTF-8 encoding of any Unicode scalar value (so between 1 and 4 bytes) and not any arbitrary byte. You'd need to pass the `--no-unicode` flag or prefix your pattern with `(?-u)` to match any arbitrary byte. And indeed, even then, `.` doesn't match `\n`. So you might even want `(?s-u)`. But since this is a grep and *greps are line oriented*, you'd need to enable multi-line mode in ripgrep (GNU grep doesn't have this):
$ time rg-14.0.3 -Uo '(?s-u).{100}https:.{50}' test.json
real 0.057
user 0.041
sys 0.006
maxmem 8 MB
faults 0
$ time rg-14.0.3 --color never -N -Uo '(?s-u).{100}https:.{50}' test.json
real 0.042
user 0.041
sys 0.000
maxmem 8 MB
faults 0
This actually runs slower, I believe, because it disables the line oriented optimizations that ripgrep uses. In this case, it isn't as good at detecting the `https:` literal and looking for that first. That's where `ired` can do (a lot) better, because it isn't line oriented and doesn't need to support arbitrary regex patterns. greps are.To complete this analysis, I'm going to do something that I realize is blasphemous to you and increase the input size by ten-fold. This will help us understand where time is being spent:
$ time grep --color=never -Eo .{100}https:.{50} test.10x.json
real 17.931
user 17.906
sys 0.017
maxmem 7 MB
faults 0
$ time rg-14.0.3 --color never -N -o '.{100}https:.{50}' test.10x.json
real 0.032
user 0.017
sys 0.010
maxmem 23 MB
faults 0
$ time rg-14.0.3 --color always -N -o '.{100}https:.{50}' test.10x.json
real 0.137
user 0.034
sys 0.019
maxmem 23 MB
faults 0
$ time dash -c "echo https:|./1.sh test.10x.json 156 100"
real 0.067
user 0.089
sys 0.069
maxmem 7 MB
faults 0
I compared the profiles of `rg --color=never` and `rg --color=always`, and they look about the same to me. This suggests to me that color is slower simply because rendering it in my terminal emulator is slower.For grins, I also tried ugrep:
$ time ugrep-4.4.1 --color=never -o '.{100}https:.{50}' test.10x.json
real 6.003
user 5.977
sys 0.007
maxmem 6 MB
faults 0
Owch. But not as bad as GNU grep.So with a bigger input, we can see that `rg -o` is about twice as fast as ired, even on "ordinary" hardware.
And IMO, for inputs of the size you've provided, the difference is not meaningful.
Going back to your original prompt:
> Personally I'm interested in "grep -o" alternatives.
It seems to me like `rg -o` is quite serviceable in that regard, and at the very least, substantially better than GNU grep.
At this point, I wondered what ired did for substring search[2]. That immediately stuck out to me as something that looked wrong. Indeed:
$ cat haystack
ABAABAB
$ echo -n BAB | od -An -tx1 | sed 's>^>/>;s/ //g' | ired -n haystack
0x4
$ echo -n ABAB | od -An -tx1 | sed 's>^>/>;s/ //g' | ired -n haystack
$ rg -o ABAB haystack
1:ABAB
So ired is a toy. One wonders how many search results you've missed over the years because of ired's feature "it's so minimal that it's wrong!" I mean sometimes tools have bugs. ripgrep has had bugs too. But this one has been in ired since 2009.What is it that you said? YIKES. Yeah. Seems appropriate.
[1]: https://github.com/BurntSushi/dotfiles/blob/eace294fd80bfde1...
[2]: https://github.com/radare/ired/blob/a1fa7904e6ad239dde950de5...
1vuio0pswjnm7
About grep -o.
# stat -c %s file
6297285
# file file
file: ASCII text, with very long lines (1545), with CRLF, LF line terminators
Imagine file as a wall of text.1. Printing byte offsets.
# time grep -ob string file
0.03user 0.08system 0:00.22elapsed 52%CPU (0avgtext+0avgdata 1104maxresident)k
0inputs+0outputs (0major+86minor)pagefaults 0swaps
# rg -V
ripgrep 13.0.0
# time rg -ob string file
0.10user 0.17system 0:01.11elapsed 25%CPU (0avgtext+0avgdata 7804maxresident)k
0inputs+0outputs (0major+559minor)pagefaults 0swaps
# time sh -c "echo -n string|od -An -tx1|sed 's>^>/>;s/ //g'|ired -n file"
0.03user 0.09system 0:00.18elapsed 67%CPU (0avgtext+0avgdata 720maxresident)k
0inputs+0outputs (0major+189minor)pagefaults 0swaps
2. Printing some "context" around the matched string. For example, add characters immediately preceding string.Baseline.
# time grep -o string file
0.02user 0.07system 0:00.15elapsed 65%CPU (0avgtext+0avgdata 1068maxresident)k
0inputs+0outputs (0major+84minor)pagefaults 0swaps
Add one character. # time grep -o .string file
0.21user 0.08system 0:00.36elapsed 83%CPU (0avgtext+0avgdata 1088maxresident)k
0inputs+0outputs (0major+87minor)pagefaults 0swaps
Add another character. # time grep -o ..string file
0.29user 0.09system 0:00.46elapsed 82%CPU (0avgtext+0avgdata 1064maxresident)k
0inputs+0outputs (0major+88minor)pagefaults 0swaps
# time rg -o ..string file
0.13user 0.13system 0:00.90elapsed 28%CPU (0avgtext+0avgdata 9012maxresident)k
0inputs+0outputs (0major+574minor)pagefaults 0swaps
Yikes.Now let's try ired. Another shell script. This one will print all occurences of string.
cat > 1.sh << eof
#!/bin/sh
# usage: echo string|1.sh file [blocksize] [seek]"
read x;
x=$(echo -n $x|xxd -p);
echo /$x \
|ired -n $1 \
|sed "s/.*/s&@s-${3-0}@X$2/" \
|tr @ '\12' \
|ired -q -i /dev/stdin $1 \
|sed 's/.*/w&0a/' \
|ired -n /dev/stdout
eof
Baseline. # echo string|time sh 1.sh 6
0.11user 0.10system 0:00.17elapsed 127%CPU (0avgtext+0avgdata 772maxresident)k
0inputs+0outputs (0major+466minor)pagefaults 0swaps
Add one character before string. # echo string|time sh 1.sh 7 1
0.12user 0.09system 0:00.16elapsed 131%CPU (0avgtext+0avgdata 740maxresident)k
0inputs+0outputs (0major+473minor)pagefaults 0swaps
Add another. # echo string|time sh 1.sh 8 2
0.12user 0.11system 0:00.20elapsed 112%CPU (0avgtext+0avgdata 744maxresident)k
0inputs+0outputs (0major+461minor)pagefaults 0swaps
Perhaps grep or ripgrep might be slightly faster at printing byte offsets.But ired is faster at printing matches with context. (NB. Context here means characters, not lines.)
Try using ripgrep to print offsets for ired.
#!/bin/sh
read x;
rg --no-mmap -ob $x $1 \
|cut -d: -f1 \
|sed "s/.*/s&@s-${3-0}@X$2/" \
|tr @ '\12' \
|ired -q -i /dev/stdin $1 \
|sed 's/.*/w&0a/' \
|ired -n /dev/stdout
# time sh -c "echo string|1.sh file 8 2"
0.11user 0.06system 0:00.18elapsed 101%CPU (0avgtext+0avgdata 5972maxresident)k
0inputs+0outputs (0major+905minor)pagefaults 0swaps
# stat -c %s /usr/bin/ired /usr/bin/grep /usr/bin/rg
37544
219248
5074800burntsushi
OK, so first of all, let's get one thing cleared up. What the heck is ired? It isn't in the Archlinux package repos. I found this[1], but it looks like an incomplete and abandoned project. It doesn't even have proper docs:
$ ired -h
ired [-qhnv] [-c cmd] [-i script] [-|file ..]
$ ired --help
$
So like, I don't even know what `ired -n` is doing. From what I can tell from your commands, it's searching for `string`, but you first need to convert it to a hexadecimal representation.But okay, let's also check the output between the commands and make sure they're the same. I used my own file:
$ time grep -ob string 1-2048.txt
333305:string
333380:string
920494:string
5166701:string
5210094:string
6775219:string
real 0.006
user 0.006
sys 0.000
maxmem 15 MB
faults 0
$ time rg -ob string 1-2048.txt
13123:333305:string
13124:333380:string
33382:920494:string
159885:5166701:string
161059:5210094:string
211466:6775219:string
real 0.003
user 0.000
sys 0.003
maxmem 15 MB
faults 0
$ time sh -c "echo -n string|od -An -tx1|sed 's>^>/>;s/ //g'|ired -n 1-2048.txt"
0x515f9
0x51644
0xe0bae
0x4ed66d
0x4f7fee
0x6761b3
real 0.013
user 0.010
sys 0.004
maxmem 15 MB
faults 0
Indeed, the hexadecimal offsets printed by ired line up with the offsets printed by grep and ripgrep. Notice also the timing. ired is slower here for me.OK, now let's do context:
$ time grep -ob string 1-2048.txt
[..snip..]
real 0.006
user 0.006
sys 0.000
maxmem 16 MB
faults 0
$ time grep -ob .string 1-2048.txt
[..snip..]
real 0.005
user 0.003
sys 0.003
maxmem 16 MB
faults 0
$ time grep -ob ..string 1-2048.txt
[..snip..]
real 0.006
user 0.003
sys 0.003
maxmem 16 MB
faults 0
$ time rg -ob string 1-2048.txt
[..snip..]
real 0.004
user 0.003
sys 0.000
maxmem 16 MB
faults 0
$ time rg -ob .string 1-2048.txt
[..snip..]
real 0.004
user 0.000
sys 0.003
maxmem 16 MB
faults 0
$ time rg -ob ..string 1-2048.txt
[..snip..]
real 0.004
user 0.004
sys 0.000
maxmem 16 MB
faults 0
I don't see anything worth saying "yikes" about here.One possible explanation for the timing differences is that your search has a lot of search results. The match count is a crucial part of benchmarking, and you've made the same mistake as the ugrep author by omitting them. But okay, let me try a search with more hits.
$ time rg -ob the 1-2048.txt | wc -l
60509
real 0.011
user 0.006
sys 0.006
maxmem 16 MB
faults 0
$ time rg -ob .the 1-2048.txt | wc -l
60477
real 0.014
user 0.014
sys 0.000
maxmem 16 MB
faults 0
$ time rg -ob ..the 1-2048.txt | wc -l
60359
real 0.014
user 0.014
sys 0.000
maxmem 16 MB
faults 0
A little slower, but that's what you'd expect with the higher match frequency. Now let's try your script for 1.sh: $ echo the | time sh 1.sh 1-2048.txt 6 | wc -l
63304
real 0.048
user 0.072
sys 0.052
maxmem 16 MB
faults 0
$ echo the | time sh 1.sh 1-2048.txt 7 1 | wc -l
63336
real 0.056
user 0.096
sys 0.042
maxmem 16 MB
faults 0
$ echo the | time sh 1.sh 1-2048.txt 8 2 | wc -l
63419
real 0.053
user 0.079
sys 0.049
maxmem 16 MB
faults 0
(The counts are a little different because `..the` matches fewer things than `the` when given to grep, but presumably `ired` doesn't care about that.)But in any case, ired is quite a bit slower here.
OK, let's pop up a level. Your benchmark is somewhat flawed. For three reasons. First is because the timings are so short that the differences here are generally irrelevant to human perception. It reminds me of the time when ripgrep came out, and someone would respond with a "gotcha" that `ag` was faster because it ran a search on a tiny repository in 10ms where as ripgrep took 12ms. That's not quite exactly the same as what's happening here, but it's close. The second is that the haystack is so short that overhead is likely playing a role here. The timings are just too short to be reliable indicators of performance as the haystack size scales. See my commentary on ugrep's benchmarks[2].
Let's try a bigger file:
$ stat -c %s eigth.txt
1621035918
$ file eigth.txt
eigth.txt: ASCII text
$ time rg -ob Sherlock eigth.txt | wc -l
1068
real 0.154
user 0.103
sys 0.050
maxmem 1551 MB
faults 0
$ time rg -ob .Sherlock eigth.txt | wc -l
935
real 0.156
user 0.096
sys 0.060
maxmem 1551 MB
faults 0
$ time rg -ob ..Sherlock eigth.txt | wc -l
932
real 0.154
user 0.107
sys 0.047
maxmem 1551 MB
faults 0
And now ired: $ echo Sherlock | time sh 1.sh eigth.txt 6 | wc -l
1068
real 1.393
user 0.671
sys 0.729
maxmem 16 MB
faults 0
$ echo Sherlock | time sh 1.sh eigth.txt 7 1 | wc -l
1201
real 1.391
user 0.604
sys 0.793
maxmem 16 MB
faults 0
$ echo Sherlock | time sh 1.sh eigth.txt 8 2 | wc -l
1204
real 1.395
user 0.578
sys 0.823
maxmem 16 MB
faults 0
Yikes. Over an order of magnitude slower.Now that the memory usage reported for ripgrep is high just because it's using file backed memory maps. It's not actual heap usage. You can check this by disabling memory maps:
$ time rg -ob ..Sherlock eigth.txt --no-mmap | wc -l
932
real 0.179
user 0.063
sys 0.116
maxmem 16 MB
faults 0
And if we increase the match frequency on the same large haystack, the gap closes a little, but ired is still about 4x slower: $ time rg -ob ..the eigth.txt | wc -l
13141187
real 2.470
user 2.418
sys 0.050
maxmem 1551 MB
faults 0
$ echo the | time sh 1.sh eigth.txt 8 2 | wc -l
13894916
real 10.027
user 16.293
sys 8.122
maxmem 402 MB
faults 0
I'm not clear on why you're seeing the results you are. It could be because your haystack is so small that you're mostly just measuring noise. ripgrep 14 did introduce some optimizations in workloads like this by reducing match overhead, but I don't think it's anything huge in this case. (And I just tried ripgrep 13 on the same commands above and the timings are similar if a tiny bit slower.)joshka
There's a few ripgrep based tuis:
- https://github.com/acheronfail/repgrep
- https://github.com/konradsz/igrep
nsagent
You can also use fzf with ripgrep to great effect:
[1]: https://github.com/junegunn/fzf/blob/master/ADVANCED.md#usin...
comex
Interesting, it supports an n-gram indexer. ripgrep has had this planned for a few years now [1] but hasn't implemented it yet. For large codebases I've been using csearch, but it has a lot of limitations.
Unfortunately... I just tried the indexer and it's extremely slow on my machine. It took 86 seconds to index a Linux kernel tree, while csearch's cindex tool took 8 seconds.
dtgriscom
That's close to a gig of disk reads; I trust you didn't try ugrep first and then cindex second, without taking into account caching.
comex
I ran both multiple times, alternating (and making sure to clean out the indexes in between). Results were reasonably consistent across runs.
jgalt212
If you're gonna go the csearch route, you should also consider hound. I use it many times per day.
bishfish
It creates per-directory index files on its first run. ugrep-indexer is also labeled as beta. A couple of relevant quotes from its GitHub site:
“Indexing adds a hidden index file ._UG#_Store to each directory indexed.”
“Re-indexing is incremental, so it will not take as much time as the initial indexing process.”
o11c
Important note: not actually compatible. It took me seconds to find an option that does something completely different than the GNU version.
burntsushi
Indeed. And here are some concrete examples around locale:
$ grep -V | head -n1
grep (GNU grep) 3.11
$ alias ugrep-grep="ugrep-4.4.1 -G -U -Y -. --sort -Dread -dread"
$ echo 'pokémon' | LC_ALL=en_US.UTF-8 grep 'pok[[=e=]]mon'
pokémon
$ echo 'pokémon' | LC_ALL=en_US.UTF-8 ugrep-grep 'pok[[=e=]]mon'
$ echo 'γ' | LC_ALL=en_US.UTF-8 grep -i 'Γ'
γ
$ echo 'γ' | LC_ALL=en_US.UTF-8 ugrep-grep -i 'Γ'
BSD grep works like GNU grep too: $ grep -V
grep (BSD grep, GNU compatible) 2.6.0-FreeBSD
$ echo 'pokémon' | LC_ALL=en_US.UTF-8 grep 'pok[[=e=]]mon'
pokémon
$ echo 'γ' | LC_ALL=en_US.UTF-8 grep -i 'Γ'
γfwip
Which option is that? I'm scanning the ugrep page, but nothing is popping out to me.
e12e
I would assume compatible meant posix/bsd - unless explicitly advertised AS "GNU grep compatible"?
burntsushi
From the OP: "Ugrep is compatible with GNU grep and supports GNU grep command-line options."
zaidhaan
A little off-topic, but I'd love to see a tool similar to this that provides real-time previews for an entire shell pipeline which, most importantly, integrates into the shell. This allows for leveraging the completion system to complete command-line flags and using the line editor to navigate the pipeline.
In zsh, the closest thing I've gotten to this was to bind Ctrl-\ to the `accept-and-hold` zle widget, which executes what is in the current buffer while still retaining it and the cursor position. That gets me close (no more ^P^B^B^B^B for editing), but I'd much rather see the result of the pipeline in real-time rather than having to manually hit a key whenever I want to see the result.
wazzaps
Sounds similar to this: https://github.com/akavel/up
tacone
I guess Alt+a is the default zsh shortcut for that.
ijustlovemath
Any particular reason why newer tools don't follow the well-established XDG standard for config files? Those folder structures probably already exist on end user machines, and keep your home directory from getting cluttered with tens of config files
xcdzvyn
Slight rant/aside but Firefox is bad for this. You can point it to a custom profile path (e.g. .config/mozilla) but ~/.mozilla/profile.ini MUST exist. Only that one file - you can move everything else.
ijustlovemath
In my mind, this is fine, as Firefox predates the standard by a long time. But newer tools specifically should know better.
tedunangst
XDG isn't recognized as an authority outside of XDG.
burntsushi
For ripgrep at least, you set an environment variable telling it where to look for a config file. You can put it anywhere, so you don't need to put it in $HOME.
I didn't do XDG because this route seemed simpler, and XDG isn't something that is used everywhere.
smaudet
Standard should be - tool tells you where it's configured, how to change the config, and choose a 'standard' default config, such as XDG.
Assuming you aren't doing weird things with paths, I can work around 'dumb lazy' developers releasing half-assed tools with symlinks/junctions, but I really don't want to spend a ton of time configuring your tool or fighting its presumptions.
burntsushi
Oh okay, I guess you've got it figured out. Now specify it in enough detail for others to implement it, get all stakeholders to agree and get everyone to implement it exactly to the spec.
Good luck. You're already off to a rough start with XDG, since that isn't what is used on Windows. And it's unclear whether it ought to be used on macOS.
Hendrikto
> I didn't do XDG because this route seemed simpler
Simpler how? This requires custom config, instead of following what I set system-wide.
> and XDG isn't something that is used everywhere.
Yeah, that‘s why it defines defaults to fall back on.
burntsushi
It's far simpler to implement.
No, you don't understand. I'm not saying The XDG variables might not be defined. Give me a little credit here lol. I have more than a passing familiarity with XDG. I've implemented it before. I'm saying the XDG convention itself may not apply. For example, Windows. And its controversial whether to use them on macOS when I last looked into it.
I don't see any significant problem with defining an environment variable. You likely already have dozens defined. I know I do.
I'm not trying to convince you of anything. Someone asked why. This is why for ripgrep at least.
Joel_Mckay
Someone please just standardize the grep flags across all platforms.
Specifically -P / --perl-regexp support on MacOS and FreeBSD
It really would reduce the WTF moments for the students.
Insert jokes about standards below... =)
burntsushi
That's what POSIX was supposed to be.
It's easier IMO to just use the same tool on all platforms. Which you can of course do.
Joel_Mckay
Not sure if brew's grep is as NERF'ed, but POSIX standard often is just a subset of minimal features for the GNU version.
Cheers, =)
burntsushi
Yes, that's the problem. You need to maintain a close attention level to know which things are POSIX. And in the case of GNU grep, you actually need to set POSIXLY_CORRECT=1. Otherwise its behavior is not a subset.
POSIX also forbids greps from searching UTF-16 because it mandates that certain characters always use a single byte. ripgrep, for example, doesn't have this constraint and thus can transparently search UTF-16 correctly via BOM sniffing.
karakanb
Slightly off topic, but how does one publish so many installable versions of a binary across all the package managers? I figured out how to do it for Brew, but the rest seems like a billion different steps that need to be done and I feel like I am missing something.
wint3rmute
You only have to set up CI/CD once for each package type, afterwards all the packaging work is done for you automatically.
Ripgrep is also quite a large project (judging on both star count and contribution cout), so people probably volunteer to support their platform/package manager of choice.
mathverse
Also look at https://github.com/stealth/grab from Sebastian Krahmer.
meowface
ripgrep, grab, ugrep, hypergrep... Any of the four are probably fast enough for any of my use cases but I suddenly feel tempted to micro-optimize and spend ages comparing them all.
infamia
Ugrep is also available in Debian based repos, which is super nice.
louwrentius
I will never learn this tool
I will not even contemplate using this tool.
The reason is very simple: I can trust 'grep' to be on any system I ever touch. Learning ugrep doesn't make any sense as I can't trust it to be available.
I could still use it on my own systems, but I work on customer systems which won't have this tool installed.
And I'm proficient enough with grep that it's 'good enough', I'm not focussing on a better grep. I'm focussing on fixing a problem, or trying something new.
I'd rather invest my time into something that will benefit me across all environments I work with.
Because a tool may be 'better' (whatever that means) doesn't mean it will see adoption.
This is not about being closeminded, but it's about focus on what's really important.
jftuga
I really like the fuzzy match feature. Useful for typos or off by 1-2 characters.
Get the top HN stories in your inbox every day.
Here's a thread on performance vs rg (ripgrep). https://github.com/BurntSushi/ripgrep/discussions/2597 didn't know about hypergrep either.