Get the top HN stories in your inbox every day.
bastien2
You don't. You use a full-text indexer and normal search tools. A chatbot is only going to decrease the integrity of query results.
andai
I found that grep actually outperformed vector search for many queries. The only thing I was missing was when I didn't know how exactly to phrase something (the exact keyword to use).
Do keyword search systems have workarounds for this? My own idea was for each keyword to generate a list of neighbor keywords in semantic space. I figured with such a dataset, I'd get something approximating vector search for free.
I made some attempts at that (found neighbors by their proximity in text), but I ended up with a lot of noise (words that often go together without having the same meaning). So I'd probably have to use actual embeddings instead.
More generally, any suggestions for full-text indexing? Elasticsearch seems like overkill. I built my own keyword search in Python (simple tf-idf) which was surprisingly easy. (Long-term project is to have an offline copy of a useful/interesting subset of the internet. Acquiring the datasets is also an open question. Common Crawl is mostly random blogs and forum arguments...)
skydhash
> The only thing I was missing was when I didn't know how exactly to phrase something (the exact keyword to use).
I think that's the only things GUI (or TUI) directories have over CLI. I remember having Wikipedia locally (english texts, back in 2010) and the portals were surprisingly useful. They act like the semantic space in case you can't find an article for your exact word. So Literature > Fiction > Fantasy > Epic Fantasy will probably land you somewhere close to "The Lord of The Rings".
ravetcofx
Do you know of any way to build a fast index you can run grep against? Would love to have something as instantaneous as "Everything" on windows for full text on Linux so I can just dump everything in a directory
semi-extrinsic
Have you tried the more modern solutions like gripgrep, ack, etc.?
Or for something more comprehensive (to also search PDF, docx, etc.) there is ripgrep-all:
everforward
As others have said, ripgrep et al are faster than regular grep. You would probably also get much faster results with an alias that excludes directories you don't expect results in (I.e. I don't normally grep in /var at all).
I have seen some recommendations for recoll, but I haven't used it so can't comment. Anecdotally, I normally just use ripgrep in my home directory (it's almost always in ~ if I don't remember where it is). It's fast enough as long as my homedir is local (I.e. not on NFS).
jononor
Tracker is an open source project for that. It has been around for some 10+ years now. https://tracker.gnome.org/overview/
haiku2077
Try ripgrep.
j0hnyl
The point of vector search is to support semantic search. It makes sense that grep will outperform if you're just looking for verbatim occurrences of a string.
3abiton
A combination of both could help!
SkyPuncher
Most developers are going to outperform vector search. We “get” how computers do lookups so we build our queries appropriately.
Vector search is amazing for using layman concepts.
yreg
> decrease the integrity of query results
What does that even mean. When you know the exact keywords then you use full-text.
When you don't know them then other tools can be helpful.
eviks
It means you'd use the same tool since it's more convenient and get worse results in one tool vs. the other
Capricorn2481
Because they're two different tools for two different tasks. If you expect to always know the exact phrase than, yes, grep will be better. But if you search a semantically similar phrase you will get nothing
vikramkr
You wouldn't use a chatbot for the same query you'd use normal search tools for (and on a side note your answer would be much more useful with an example of what those tools would be, it's not really actionable). A vague natural language question over data whose structure you haven't fully understood using terms that might be inexact is not as likely to provide good results with normal search tools as with an llm based tool.
skydhash
> your answer would be much more useful with an example of what those tools would be
Paperless, DevonThink, even Calibre (the ebook manager) can do it.
You only need a day or two to categorize the documents. No need for huge amounts of RAM, or privacy concerns, or hallucinated answers.
dotancohen
> You only need a day or two
For some of us, for some types of data, huge amounts of RAM, or even privacy concerns, or even the occasional hallucinated answer, is an easier pill to swallow.A recent example, maybe not the best example but recent, was the query "What do the three headed dog from the Harry Potter books and the cat from Alien have in common"
xeromal
I never want to categorize stuff. I want it done for me.
ajsnigrutin
Another (ugly but works nice): https://www.recoll.org/pics/index.html
opensource, local, yada yada, almost zero configuration (just add folders, run indexer, wait).
rahimnathwani
Paperless-ngx set up using docker compose is good for this use case.
barrenko
Hi bastien,
Could you expand on the answer? Thanks!
pierre
RAG cli from llamaindex, allow you to do it 100% locally when used with ollama or llamacpp instead of OpenAI.
https://docs.llamaindex.ai/en/stable/getting_started/starter...
homarp
and at some point (https://github.com/ggerganov/llama.cpp/issues/7444) you will be able to use Phi-3-vision https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
but for now you will have to use python.
You can try it here https://ai.azure.com/explore/models/Phi-3-vision-128k-instru... to get an idea of its OCR + QA abilities
nl
Does the llamaindex PDF indexer correctly deal with multi-column PDFs? Most I've seen don't, and you get very odd results because of this.
rspoerri
i've made quite good conversions from pdf to markdown with https://github.com/VikParuchuri/marker . it's slow but worth a shot. Markdown should be easily parseable by a rag.
i'm trying to get a similar system setup on my computer.
nl
This looks worth exploring, so thanks. The author has done a bunch of work beyond what PyMuPDF does on multicolumn layouts.
jd3
basically, still the same answer(s) from
tspann
https://milvus.io/docs/integrate_with_llamaindex.md
Pretty easy to run local and lightweight with Milvus Lite with LlamaIndex
ekianjo
llamaindex has an horrible API, very poor docs and is constantly changing. I do not recommend it.
m0shen
Paperless supports OCR + full text indexing: https://docs.paperless-ngx.com/
As far as AI goes, not sure.
whynotmaybe
You can use Gpt4all with localdocs to analyze the folder where you store the output of paperless-ngx
Ey7NFZ3P0nzAe
I am a medical students with thousands and thousands of PDF and was unsatisfied with RAG tools so I made my own. It can consume basically any type of content (pdf, epub, youtube playlist, anki database, mp3, you name it) and does a multi step RAG by first using embedding then filtering using a smaller LLM then answering using by feeding each remaining document to the strong LLM then combine those answers.
It supports virtually all LLMs and embeddings, including local LLMs and local embedding It scales surprisingly well and I have tons of improvements to come, when I have some free time or procrastinate. Don't hesitate to ask for features!
Here's the link: https://github.com/thiswillbeyourgithub/DocToolsLLM/
samspenc
Nvidia's 'Chat with RTX' can do this as well https://www.nvidia.com/en-us/ai-on-rtx/chatrtx/
You do need a beefy GPU to run the local LLM, but I think it's a similar requirement for running any LLM on your machine.
Ey7NFZ3P0nzAe
I am deeply unsatisfied with how most RAG systems handle questions, chunking, embeddings, storage, and even those used for summaries are usually rubbish. That's why I created my own tool. Check it out I updated it a lot! It supports ollama too for private use.
constantinum
The primary challenge is not just about harnessing AI for search; it's about preparing complex documents of various formats, structures, designs, scans, multi-layout tables, and even poorly captured images for LLM consumption. This is a crucial issue.
There is a 20 min read on why parsing PDFs is hell: https://unstract.com/blog/pdf-hell-and-practical-rag-applica...
To parse PDFs for RAG applications, you'll need tools like LLMwhisperer[1] or unstructured.io[2].
Now back to your problem:
This solution might be an overkill for your requirement, but you can try the following:
To set things up quickly, try Unstract[3], an open-source document processing tool. You can set this up and bring your own LLM models; it also supports local models. It has a GUI to write prompts to get insights from your documents.[4]
[1] https://unstract.com/llmwhisperer/ [2] https://unstructured.io/ [3] https://github.com/Zipstack/unstract [4] https://github.com/Zipstack/unstract/blob/main/docs/assets/p...
jszymborski
Apache Tika could help extract the relevant bits of PDFs, couldnt it?
fooker
Modern LLMs are good enough at treating pdfs as images and groking the context.
Well, Claude and GPT-4 seem to be.
elrostelperien
For macOS, there's this: https://pdfsearch.app/
Without AI, but searching the PDF content, I use Recoll (https://www.recoll.org/) or ripgrep-all (https://github.com/phiresky/ripgrep-all)
gumboshoes
The best indexer for macOS, bar none is Foxtrot Professional. https://foxtrot-search.com/foxtrot-professional.html Very sophisticated searching, including regex, its own query language, and proximity searches - x within z words of y - which for me is the biggest win. I have 2TB of files indexed with this.
hm-nah
You have the find a good OCR tool that you can run locally on your hardware. RAG depends on your doc processing pipeline.
It’s not local, but the Azure Document Intelligence OCR service has a number of prebuilt models. The “prebuilt-read” model is $1.50/1k pages. Once you OCR your docs, you’ll have a JSON of all the text AND you get breakdowns by page/word/paragraph/tables/figures/alllll with bouding-boxes.
Forget the Lang/Llama/Chain-theory. You can do it all in vanilla Python.
Kikawala
Quivr: https://github.com/QuivrHQ/quivr
SecureAI-Tools: https://github.com/SecureAI-Tools/SecureAI-Tools
pixelmonkey
rga, aka ripgrep-all, is my go-to for this. I suppose grep is a form of AI -- or, at least, an advanced intelligence that's wiser than it looks. ;)
gyrovagueGeist
+1 for this. I use rga all the time. it's a "simple" solution but often enough for what I actually needed.
SoftTalker
If you haven’t given some serious thought to getting rid of most of the documents then consider it. There is very little need to keep most routine documents for more than a few years. If you think you need your electric bill for March 2006 at your fingertips, why?
datpiff
I was hoping someone would make this point. A lot of digital archiving is just delaying tossing things - a hard drive is easier to deal with than boxes of paper. The contents can still be useless.
When it comes to a search solution - what kind of searches have you done in the past? What kind of problems did you come across? If the answer to either is "none" you are planning on building a useless system.
temp3000
You never know when you will need a 10 year old doc. Audits and disputes for example. In addition I suspect keeping all the docs uses 1% of the spaces of photos people back up anyway.
I agree that search is overkill - just drudge manually or use grep when the time comes to dig.
Kikobeats
You can use Microlink to turn PDF into HTML, and combine it with other service for processing the text data.
Here an example turning a arxiv paper into real text:
https://api.microlink.io/?data.html.selector=html&embed=html...
It looks like PDF, but it you open devtools you can see it's actually a very precise HTML representation.
theolivenbaum
If you're looking for something local, we develop an app for macOS and Windows that let's you search and talk to local files and data from cloud apps: https://curiosity.ai For the AI features, you can use OpenAI or local models (the app uses llama.cpp in the background, it ships with llama3 and a few other models, and we're soon going to let you use any .gguf model)
Get the top HN stories in your inbox every day.
As the title says, I have many PDFs - mostly scans via Scansnap - but also non-scans. These are sensitive in nature, e.g. bills, documents, etc. I would like a local-first AI solution that allows me to say things like: "show me all tax documents for August 2023" or "show my home title". Ideally it is Mac software that can access iCloud too, since that where I store it all. I would prefer to not do any tagging. I would like to optimize on recall over precision, so False Positives in the search results are ok. What are modern approaches to do this, without hacking one up on my own?