Marker: Convert PDF to Markdown quickly with high accuracy

Daily Digest email

Get the top HN stories in your inbox every day.

alsodumb

Great work! I am a bit confused with the comparison with nougat throughout the repo. Nougat was specifically trained for academic documents, and I don't think anyone ever claimed Nougat was the best OCR model out there. That's kinda clear in your benchmark too where you mention that nougat has higher accuracy on arxiv documents. You also mention that marker will convert fewer equations when compared to nougat, and yet compare with nougat in terms of speed? (again, only complaining because it's a model designed for academic documents).

For anyone trying to do OCR on any pdf with math in it, definitely do try nougat. It's very easy to install (just a python package), and extracts the math, text, tables and beyond (in a .mmd file) with a single command line command. It also runs reasonably fast for personal uses - it takes about 30 seconds to convert a 6 page document using CPU only on my 4 year old i5 laptop.

fshr

> I don't think anyone ever claimed Nougat was the best OCR model out there

Comparing two things doesn't inherently imply the previous thing was touted about with superlatives. It's just a way to juxtapose the new thing with something that may be familiar. As you said, nougat is easy to install/run so it makes sense they'd compare it. Would it be better if they could add more libraries in the comparison? Absolutely; that'd be helpful.

defsectec

How do you think nougat would handle RPG rulebook PDFs?

I'm looking for a food OCR model to help me transcribe sections of RPG books to markdown. Ideally, I'd like the emphasis such as bold or italics to be transcribed.

The combo of text, numbers, and math symbols seems similar to technical and academic writing, but often has weird formatting, text boxes in the margins, and many diagrams.

alsodumb

I'm not completely sure to be honest, but you should try it yourself with a sample page! I believe hugging face hosts it online on their demo pages so you don't even have to install the package to test on one page.

vikp

Author here: for my use case (converting scientific PDFs in bulk), nougat was the best solution, so I compared to it as the default. I also compare to naive text extraction further down.

Nougat is a great model, and converts a lot of PDFs very well. I just wanted something faster, and more generalizable.

Ldorigo

Reading your comment and parent's I think perhaps there is a mistake in the comparison chart on GitHub? It says nougat takes around 700 seconds per page and yours around 90. This doesn't match with parent's claim that it took him 30 seconds to run nougat on 6 pages.

civilitty

Great work! I just tried it on Linux for System Administrators and it did a great job properly picking up on code and config text.

I noticed marker downloaded a PyTorch checkpoint called `nougat-0.1.0-small`, do you use nougat under the hood too or is that just a coincidence?

vikp

Yes, nougat is used as part of the pipeline to convert the equations (basically marker detects the equations then passes those regions to nougat). It's a great model for this.

sumedh

> extracts the math, text, tables

I want to extract financial statements from pdfs which are in tables, would Nougat be suitable for that use case?

undefined

[deleted]

mannycalavera42

Let's not underestimate the impact of such tool: we are talking about freeing up tons of knowledge from a "good for consumption/bad for distribution" format.

I'm very excited about it.

Let's build a pipeline: all the pdfs -> markdown them all -> archive.org them all

kevincox

> Let's build a pipeline

I don't think that is the right approach for archiving. The preferred pipeline would be

all the pdfs -> archive them all -> markdown them

This way you can always re-run the conversion as bugs are fixed and improvements are made. Generally archivist prefer to save as close to the source material as possible, because every transformation from there can only lose data.

crotchfire

Yeah if you get down into the weeds these models are significantly corrupting the source data.

I opened the first example to a random chapter (1.4 Formal and natural languages); within the first three paragraphs it:

- Hallucinated spurious paragraph breaks

- Ignored all the boldfacing

- Hallucinated a blockquote into a new section

This is not a tool to produce something for humans to read.

Maybe it might be useful as part of some pipeline that needs to feed markdown into some other machine process. I would not waste my time reading the crud that came out of this thing.

It's a stunt.

Alex3917

> we are talking about freeing up tons of knowledge from a "good for consumption/bad for distribution" format.

FWIW PDF is actually great for distribution. It allows you to invisibly embed all the raw data used to generate the document that the end user is seeing, in whatever format you want. So if you are generating your PDFs by using PrinceXML to render HTML, you can embed the raw JSON used to generate all of the text, graphs, charts, etc. Now most people don't actually do this of course, but that isn't the fault of the spec.

HeavyStorm

The problem of PDF is not distribution, it's consumption. It has a fixed layout that's so 1990 that makes me itch.

sertbdfgbnfgsd

pdfs don't play well with ereaders.

Alex3917

Are the standards for building accessible PDFs worse than the standards for building accessible websites, or are they just not as commonly implemented?

undefined

[deleted]

chaxor

Yeah, totally. PDFs are wonderful for archiving.*

They can hold so many different types of data so that they're extremely difficult to parse.

Because of this, you can put several malicious programs into them for RCE.

That way, if someone archives many PDFs, there can be a plethora of different RCE vulnerabilities just waiting for the user to discover.

It's a wonderful dream for any malicious actor.

* /s

vikp

Author here - this is one of the reasons I made this. Also see https://github.com/VikParuchuri/libgen_to_txt , although I haven't integrated marker with it yet (it uses naive text extraction).

samuell

Yes, there is an enormous interest in this kind of thing, not the least in larger organizations with tons of PDF documents in various forms.

Even though this would only cover a small part of the needs or use cases, it will still be hugely useful if it works well.

mannycalavera42

cough L cough L cough M cough anyone? :)

samuell

Yeah, I know, but a lot of this content can be pretty sensitive, and might not be allowed to upload outside organization networks sometimes (hospitals, governments etc).

miki123211

This also has tons of use-cases for accessibility, getting PDF accessibility right is tons of work and even if you manage it, it's highly likely that the PDF viewers your users use don't support the necessary standards anyway.

Gabrys1

Finally a good usecase for AI/ML/LLM.

undefined

[deleted]

defsectec

This looks amazing, I'll have to play around with this over the weekend.

I regularly hand transcribe RPG PDFs scans from dubious sources that have not always been run through OCR to have selectable text. If it has, it wasn't always done very well.

It's literally faster to type it all myself than fix all the errors from copy-pasting (or after using OCR to turn it into text).

Even if the file was an official PDF the formatting would often get screwed up with lots of double or triple spaces and even tabs included between words.

This would save so much time if I can get it to work. Thanks for sharing!

milep

I had this use case also in mind. Already tried with one book, but the results were not that good. Many of the tables and text boxes were messed up. I had pretty good results converting tables to markdown with ChatGPT by taking a screenshot of a table and pasting it to chat. It was able to handle some "irregular" tables with a bit of prompting. Like "Read the table row by row. Column headers are X, Y, Z. X is text, Y is number, Z is word" as a simplified example.

crooked-v

> I regularly hand transcribe RPG PDFs scans from dubious sources

Heh, that was my immediate thought too. There's a ton of RPG stuff that never had any kind of physical release and is totally orphaned as IP.

iamflimflam1

How good is tesseract for OCR nowadays? I tried using it a while back and it was nowhere near as good as the online offerings from AWS, Azure and GCP.

rereasonable

Last update was pretty recent, and the git mentions tesseract 5 as a dep. so it's likely moved on a bit from when you last tried it:

https://github.com/tesseract-ocr/tesseract/releases

I suppose it depends on your use-case. For personal tasks like this it should be more than sufficient, and won't need user details/cc or whatever to use it.

habosa

I found it to be surprisingly good and I was very impressed with the in-browser performance. It is very very sensitive to resolution though. Once my images got down to a certain size they produced garbage from Tesseract even though they were very human readable.

is_true

It requires quite a bit of preprocessing. I've only tried GCP's solution which it's better in my experience

Geee

I tried it quite recently and it failed on a very basic image. I also tried the iOS Vision API, which also failed. My test case was a clear photo of a book page.

sertbdfgbnfgsd

Question for the author: Why to markdown? It seems to me the hard part of this tool is parsing pdfs with high accuracy, not whatever you do with them. As such, I would love if this tool allowed the user to choose the output format. I know that I would use a high accuracy pdf parser to render into epub.

Finnucane

You would want to have some kind fo markup that preserves structural markup as much as possible. I manage ebooks for a university press, and we have a deep backlist waiting for conversion, a lot of which only exists as page scans of old print volumes. I want to be able to offer them as epubs, which means I need to know where there are chapter breaks, heads, tables, charts, math, blockquotes, and so on and so forth. I have vendors that can do this for me, but it costs more than we'd get for some of these books in sales. I'd love to be able to do soem of this myself.

carschno

I agree, the intermediate format should be plain text that could optionally be converted to any other format. I suppose that Markdown, however, is used as intermediate format here. It is close to plain text while it can preserve simple layout information.

In practice, I would use the Markdown output and plug it into any tool that converts that into the desired final output format.

sertbdfgbnfgsd

That sounds reasonable. I might explore pdf -> markdown -> epub.

I wonder if this could somehow be used directly by calibre. I think calibre's pdf->epub conversion isn't amazing. In particular, tables often end up broken.

vikp

I chose markdown because I wanted to preserve equations (fenced by $/$$), tables, bold/italic information, and headers. I haven't looked into epub output, but this ruled out plain text.

da39a3ee

Why not choose an unambiguously parseable output format such as JSON, and then convert JSON to markdown/ html / etc when needed?

danofsteel32

I have an odd usecase that I've yet to find a good solution to: Reading construction documents (Blueprints are always PDF). I've had much better luck parsing DXF (AutoCAD) files but it's not always easy to get an architect to send them to me even if I'm the GC on the job.

scary-size

Nice work. I tend to do most of my longer reading on an e-reader. PDFs, especially multi-column layouts, are a nightmare with the out-of-the-box offerings from Amazon Kindle or Pocketbook. This looks like something that'll improve my experience quite a lot.

KeplerBoy

Great stuff!

I have a question regarding the output of Nougat: Where do the "hallucinations" come from (just scroll through the Nougat output of the Think Python example to see what I mean)?

Nevermind, i just read it runs it through an LLM, so hallucinations are par for the course.

thfuran

I think these sorts of tools are dangerous at least until the hallucination (in text or formatting) rate is below that experienced by a careful reader repeatedly re-reading a document, which is almost but not quite zero and, depending on the application, potentially even until it's actually zero. I guess they're mostly fine for cases where the extact document content isn't important, but it's probably not common to have a lot of documents that nobody anywhere considers or ever will consider important yet which must be more accessible than pdfs.

potatoman22

This seems like a great tool to help migrate my notes out of OneNote

baby_souffle

Try this?

https://help.obsidian.md/import/onenote

smusamashah

How can it help with OneNote?

airstrike

Really interesting stuff... it might be worth adding some before-and-after examples to the repo.

What kind of PDF are you tweaking it for? How does it handle handwritten annotations?

ramoz

Kosmos2.5 seems promising and I hope we see it in oss (otherwise assume it just makes Azure cloud ocr better)

https://arxiv.org/pdf/2309.11419.pdf

mlhpdx

Nice. This would have been very helpful when I was building an e-discovery document processing engine. Back then we could get text out (OCR, so kind of) but it was a bear to present. Markdown would have been a whole lot easier.

Daily Digest email

Get the top HN stories in your inbox every day.