Get the top HN stories in your inbox every day.
senecaso
I hope you have better luck than I did!
A few years ago, my partner and I built vendazzo.com (now defunct). It was an e-commerce search engine on products listed on Shopify shops (sound familiar? :)). At the time, we had > 100m products listed, and I don't remember how many shops we were indexing.. over 100k I think, but we had access to over a million. Overall, I think your approach is very similar to ours, but we managed to keep our costs lower. At the time, we were spending ~$550/mo, and our search times were under 300ms. We had established partnerships with a number of shops, and we had a few users, but not nearly enough. That's where the wheels came off. The site operated for over a year, but the monthly costs wore us down until we finally decided to pull the plug.
I still maintain that this is a good idea, and constantly have to fight off the urge to "try again", however, to do it properly, I think funding would be necessary, or finding some way to organically gain a lot of users.
Looking back, there are things I could have done to reduce my opex further, but in the end, it still wouldn't have mattered if I couldn't figure out how to acquire users.
DeathArrow
>but in the end, it still wouldn't have mattered if I couldn't figure out how to acquire users
In EU there are many price comparison engines with millions or billions of products. I don't know how popular they are. Some monetize trough ads, some have partnership with stores and you can buy directly from the search results.
I generally search first on the local Amazon equivalent, if I don't like what I see, I search on a smaller store. If I still can't find or dislike the products or prices I search Google. If I am still not contended with the results, I will go search on comparison engines.
And I also have a browser extension called Pricy who polls the comparison engines, so once I land in a product page I know which store has the better price and what was the price history through last year.
Probably many people have similar patterns. I expect people in US to search Amazon first, if it's not a very niche product they are after.
I think you can have a better monetization proposal, if instead of just search you build a sales platform, so people can directly buy after searching, without hoping to various websites.
berkes
Unfortunately many of these "comparison" websites have a businesses model built on affiliate fees.
It doesn't take much imagination to predict which products show up as "best" or "cheapest".
And the fairer ones have to keep playing cat and mouse with shops lowering pricing when they detect a scraper coming by. Or employ tricks to make their shipping seem free, lowering their overall price on the comparison platform.
Semaphor
> It doesn't take much imagination to predict which products show up as "best" or "cheapest".
Never seen a "best" outside of amazon, which does weird shit even without any affiliate fees. And "cheapest" is not really up to the site, unless they want to go under quite quickly.
zo1
Many if not all are like that. It's like everyone wants to take advantage of the lack of perfect information in the marketplace, as opposed to actually being helpful for consumers.
senecaso
We were intentionally limiting the number of products and shops we were indexing due to opex. We needed to keep it low enough to provide ourselves with enough runway to keep things floating for longer.
pricerunner is another site which operates in a similar space. We had plans to build out the price tracking and a number of other features, so that we would appeal more to users who had your use cases. Sadly, we weren't getting enough traction. We did have regular users from the EU, but we simply couldn't seem to get in front of enough eyeballs for it to matter. At least at first, I expect that a large amount of your traffic to a new site like this has to be driven by Google, and we failed on that front as well. I'm not an SEO expert, so there were likely many things we did wrong or didn't even do which lead to this situation.
re: a sales platform, that's a pretty big challenge to take on, which would require massive investment up front. Not sure thats a viable route for most. We did have plans to address the "without hoping to various websites" problem, as we identified that as problematic for users very early on. The solution was relatively simple, but required more money to build out. We simply ran out of funds before we could get there.
SantiagoRuberto
<< We did have plans to address the "without hoping to various websites" problem, as we identified that as problematic for users very early on. The solution was relatively simple, but required more money to build out. We simply ran out of funds before we could get there. >>
What were your plans to solve this problem?
wingerlang
> In EU there are many price comparison engines with millions or billions of products. I don't know how popular they are.
Anecdotally, I guess, I'd say extremely popular. I never search for products anywhere else.
timthelion
Yeah, here in Czechia I always look at https://www.heureka.cz/ first.
olivermuty
What do you consider the local Amazon variant? And which country?
pbmonster
Amazon has no direct presence in Switzerland, but you can order a fraction of its products from neighboring countries. Many products are not available, mainly because nobody wants to deal with customs once the product crosses the EU boarder.
Amazon itself never moved into Switzerland in the first place for many reasons (small market, unusual customs situation, relatively high salary for warehouse workers), and in the meantime the largest Swiss supermarket chain created an Amazon clone which became hugely popular pretty much immediately: Galaxus.ch
VPenkov
There are alternatives throughout Europe. The Balkans have Emag, Benelux has bol.com. I think in both regions Amazon is less popular. I'm sure there are other examples.
gsa
The Netherlands has plenty of them. Tweakers.net is a price tracker for electronics and such (eg: computer parts, phones, laptops etc) and usually it's easier to find a shop cheaper than Amazon. I have some go to stores for my needs because their content is organised way better than Amazon. I also find some alternatives better than Amazon because they have free next day shipping, something that's not free on Amazon.
DeathArrow
Emag in Romania. I hate it, they bought most of the competition, they did a lot of anticompetitive things, but it's really easy to buy from them.
julesvr
hagglezon.com to compare Amazon variant prices
vargr616
bol.com in the Netherlands
bruce511
Im curious why you consider lack of users to be the problem. I would have described it as lack of revenue.
What plans did you have for generating revenue from the site? (Serious question - given your low costs it would seem like a tiny amount of revenue would gave been enough.)
senecaso
Our business model revolved around referrals, so lack of users directly translated to lack of revenue. While its true that even if we had millions of users but none of them were buying sponsored items we would have had a revenue problem, that wasn't the problem we were facing, as the few users we did have were in fact purchasing sponsored items.
DeathArrow
Then the problem seem to be the lack of users.
Have you tried having an YouTube channel, TikTok, Facebook, Twitter, blog and explain daily how you built the website, how your platform is going to help users?
pencildiver
Thanks for sharing this! If you're up for it, I'd love to talk more about your experience, especially the technical tooling. Working as fast as I can to understand the right way to approach the tech, as there are tradeoffs with performance and price. I'm at support @ searchagora .com
bytearray
What strategies did you consider or implement to attract more users, and what would you do differently now to ensure better user acquisition?
senecaso
We had no capital, so advertising or solutions that basically involved "throwing money at the problem" were off the table for us.
We spent time posting in forums helping people find items they were looking for, and we had a few posts here on HN that generated short-lived, explosive traffic bursts. I remember those days we had posts get picked up on HN, it was always an exciting night!
We were looking at influencers and getting our name getting bloggers to talk about us, but, again, without capital, our options were very limited here. I'm sure someone with more of a marketing background would have found a bunch of ways we could have generated organic user growth, but neither me or my business partner had that skill set.
If I were to do it again, I think I would try to get someone with a marketing background involved to help gain traction. Without that, even the best product in the world will die of starvation if no one finds it.
slim
looks like simptoms of no market. maybe you were solving a problem already solved by amazon ? most shops on shopify also use amazon
MuffinFlavored
> We spent time posting in forums helping people find items they were looking for,
Did you run any analytics on how much overlap there was across Shopify sites on "similar items" (Alibaba resellers/dropshippers)?
grumpyviscacha
Wow, it's cool to see this idea trending on HN! Full disclosure, I'm one of the co-founders at https://www.marmalade.co. Speaking from personal experience, it’s been a long road getting from the universe of all Shopify products to a curated inventory that’s easy for people to shop on. While ChatGPT isn't going to replace human curation anytime soon, the AI tailwind has made it much easier to build search and recommendation systems. On our end, we've definitely caught the semantic search bug. Watch out for it - you’ll wake up one day with a cross-modal hybrid search index on pinecone and any number of models on huggingface :). However, as you rightly point out, user growth is still the key. We're working toward launching a community aspect of the platform in the coming months as a solution.
senecaso
You site looks good, and your results are fantastic! Job well done. I did hit a server error though, so obviously still some issues to work out, but overall, really well done. Moving to semantic search was one of my top priorities before we went under, but I struggled to justify the costs of it as we were operating on a shoestring budget.
Best of luck to you and your team on user acquisition!
screye
What was the process for scraping 25M products ?
I have always used standard python tools like selenium, bs4 and the like. But I'm guessing none of these work at scale.
Could you talk about your process and key bottlenecks at that scale a little bit ? Also, how much did it cost ?
______________
A recommendation for how to improve search.
Your base captions will be pretty bad. You can use spot instances on a smaller GPU machine to run a dense captioning model (https://portal.vision.cognitive.azure.com/demo/dense-caption...) and generate captions for all your images.
Then for search, a simple vector store index would be a great retrieval solution here. It is better to do search using those as well.
Both are pretty cheap and can be done reliably within 20-30 lines of code each in python. 3rd party tools for these are pretty stable.
pencildiver
Great suggestions, looking into this right now. First time building something like this so definitely new to some of these tools.
For scraping: Found that every Shopify store has a public JSON file that is available in the same route. The JSON file appears on the [Base URL]/products.json. For example, the store for Wild Fox has their JSON file available here: https://www.wildfox.com/products.json.
Built a crawler in simple Javascript to run through a list that I bought on a site called "Built With", access their JSON file with the product listing data, and scrape the exact data we want for Agora. Then storing it in Mongo and, currently, using Mongo Atlas Search (i.e. saw they released Vector Search but haven't looked at it). It has been a process of trial and error to pick the right data fields that are required for the front-end experience but not wanting to increase the size of the data set drastically. And after initially using React, switched to NextJS to make it easier to structure URLs of each product listing page.
Mongo will run me about $1,500 / month at the current CPU level. AWS all in will be about $700. I'm currently not storing the image files, so that reduces the cost as well.
A few improvement that has helped so far:
- Having 2 separate Search Indexes, one for the 'brand' and on for the 'product'. There's a second public JSON file that is available on all Shopify stores with relevant store data at [Base URL]/meta.json For example: https://wildfox.com/meta.json
- Removing the "tags" that are provided by store owners on Shopify. I believe these are placed for SEO reasons. These were 1 - 50 words / product so removing these reduced the data size we're dealing with. The tradeoff is that they can't be used to improve the search experience now.
Hope this helps. Still wrapping my head around all of this.
ljm
2.2k/mo right off the bat is pretty steep, especially if you're paying that while the search response reliably takes over 10 seconds.
Why would you shovel 1.5k into MongoDB's pockets right off the bat? Especially when ElasticSearch is much better suited to what you're trying to do?
altdataseller
Sounds like someone drank the Mongo kool-aid. You absolutely do not need Mongo, let alone Mongo Atlas. 25 million documents with ecommeece products is measly and should fit in a single 600 GB server
dinobones
You could run this entire stack (yes, even for 25 million products) using Kubernetes in a $40/month Linode + Elasticsearch + Cloudflare free plan.
mfrye0
If you're already on AWS, I recommend switching to postgres for now. For context, I have 3 RDS instances, each multi zone, with the biggest instance storing several billion records. My total bill for all 3 last month was $661.
Postgres has full text search, vector search, and jsonb. With jsonb you can store and index json documents like you would in Mongo.
- https://www.postgresql.org/docs/current/textsearch.html - https://aws.amazon.com/about-aws/whats-new/2023/05/amazon-rd...
philippemnoel
You can even do Elastic-level full text search in Postgres with pg_bm25 (disclaimer: I am one of the makers of pg_bm25). Postgres truly rules, agree on the rec :)
LunaSea
I have troubles seeing how this is possible.
$220 dollars per instance gets you 8Gb of RAM which is way, way, below the index size if you are indexing billions of vectors.
neeleshs
how big is the disk for the biggest instance?
tehlike
Disclaimer: I am building https://pricetracker.wtf
You may want to look at Hetzner, and cut your costs by about 90%.
Feel free to reach me, email in profile.
katella
In your footer you have a lot of links like "kitchenaid price tracker" and "best buy price tracker". Have these helped links helped?
adentranter
hey! this is cool, I take it you are based in the US?
How long have you been working on this?
wolfgang42
I’ll second the comments that $2k/month is alarmingly high, especially for the performance that you seem to be getting. When I shoved ~40M webpages into a stock ElasticSearch instance running on a 2013-era server I bought for $200 (on eBay), it handled the load when I hit the HN front page just fine. Either you’re being drastically overcharged or there’s something horribly inefficient in your setup that could probably be tweaked fairly easily to bring your prices down.
jabo
I'm biased, but I'd recommend exploring Typesense for search.
It's an open source alternative to Algolia + Pinecone, optimized for speed (since it's in-memory) and an out-of-the-box dev experience. E-commerce is also a very common use-case I see among our users.
Here's a live demo with 32M songs: https://songs-search.typesense.org/
Disclaimer: I work on Typesense.
keybits
I can also highly recommend TypeSense and have no affiliation. You'll save a lot of money and get much faster results.
hipadev23
You’re spending $2k/mo run this?? Holy hell.
k12sosse
> I'm currently not storing the image files, so that reduces the cost as well.
I wonder if someone catches on and replaces all your image URLs to the fuzzy testicle egg cup[0], will that negatively impact reputation?
leobg
I index 40M paragraphs of legal text, bm25 and vector similarity search, at < 200ms query time, on a single $80/month Hetzner server. Email in profile if you’d like to talk.
Ninjinka
As someone who has scraped millions of items myself, I had success using Geziyor (https://github.com/geziyor/geziyor) built in Go. Shopify sites are especially easy to scrape because they tend to share the same product data formatting and don't hide it behind JS rendering.
helsinki
25 million products is really not much at all to scrape.
DeathArrow
>I have always used standard python tools like selenium, bs4 and the like
There's nothing to scrap. You just download a JSON, the site owners kindly put on your disposal.
Scraping is a more complex process, where you have to work around rate limiting and captchas. For the tool I built I wrote tens of thousands of lines of code and I still find daily issues I have to deal with if I want to scrap a particular web page, issues I don't always have the time to solve.
joshuamcginnis
I love your approach; you found a problem and developed a solution for it. And then you got the courage to share with the larger technical community. Good on you.
There's obviously some rough edges (multiple duplicate products, issues with product links linking to empty pages, and no results for broad terms), but don't let that stop you. I'm certain they can all be fixed.
Keep going! At the least, you'll come out of this with an excellent project in your portfolio.
pencildiver
Thank you, that means a lot. It has definitely been a whirlwind of emotions since posting on HN but glad I did. It's definitely an MVP so going to work fast to improve it.
pitched
Shopify has tried a few times to build a tool like this but hasn’t ever managed to get any traction. I think that missing any curation at all could be what eventually kills it. Their current attempt is https://shop.app and a query for red shoes is mostly red shoes.
senecaso
Ya, curation is sadly required in the Shopify ecosystem. There are millions of shops, there is a tonne of garbage. Its also difficult (but not impossible) to properly classify items so that you can better target results for a given query. One of the first problems that anyone attempting this will run into is the amount of mature content available on Shopify shops. Innocent queries turn up many NSFW images that may offend some users, so you have to be able to get on top of that one pretty quick.
I remember in once case, I found what appeared to be an escort service listing "models" on Shopify. It was super creepy. I needed to get in front of that one pretty quick as well, as it was turning up in results.
hackideiomat
> a query for red shoes is mostly red shoes
well I get mostly black shoes lol
Edit: ah no, they just use half a page for shoe shops first with black shoes as logo??
dangoodmanUT
ads baby
callmeed
I built this a couple years ago (now defunct) for the same reason :) The public JSON endpoints on shopify stores make it pretty easy to get the data. You mentioned using Mongo but it sounds expensive. I honestly think you could do this with just elastic or even postgres full text search and save money.
Here's a pro tip + feature you should implement: Shopify has a semi-hidden hack where you can link directly to checkout of a product if you know the variant ID. You could add a BUY NOW button to your site without forcing the user to navigate the original site or checkout flow. Example: https://hapaboardshop.com/cart/42165521907955 (it also supports quantities and coupon codes)
A word of caution: more products isn't necessarily better. I definitely found there to be a long tail of really bad shopify stores and products. IMO it's better to curate or audit the stores you index–otherwise you risk your site being littered with kitchy t-shirts or drop-shipping garbage.
pencildiver
Thanks for the heads up! I spent some time trying to get the cart route to work. Doesn't seem to be supported anymore (link you sent leads to a 404 page). Tried it with every combination of Product ID, Variant ID, etc. Let me know if you have any ideas on how to get this to work. It would be a great feature to add to Agora.
And I agree on quality over quantity. Writing a script to remove all stores that are shutdown, products that are sold out, and a few other characteristics. Heavily focusing on the search algorithm and data quality now.
senecaso
I didnt know about the link to checkout. That's a slightly nicer user experience for sure. Still, its confusing for users who want to do more shopping at the same time. I had users who clicked on a number of items, clicked "add to cart" in each one (all different shops), and then couldn't figure out how to checkout on the main site afterwards! Obviously people were looking for a more complete one-stop-shopping experience than I was providing at the time.
callmeed
I mean a single checkout from multiple shopify stores isn't really possible (at least by 3rd parties)
My hypothesis is that, if you could drive traffic to your site and offer a fast checkout experience, there's probably multiple ways to monetize that. Driving the traffic is the hard part.
DeathArrow
>otherwise you risk your site being littered with kitchy t-shirts or drop-shipping garbage.
You mean like Amazon?
konschubert
Hey, I have a Shopify store that sells e-paper calendars / smart screens. I tried to search for it but I could not find it. What should I do so your crawler can find me?
pencildiver
You’re live on Agora:
https://www.searchagora.com/products/invisible-calendar-6266...
Thinking that we should have a page where store owners can submit their URL to be crawled.
konschubert
Cool, thanks!
pencildiver
Super cool product! I'm currently using a list of Shopify stores, so it's still limited (i.e. wanted to start with a relatively small list to focus on the search experience). I'll submit your URL to the crawler now. If you want to reach out to support @ searchagora.com , I'd love to get your feedback as a Shopify store owner.
shubham_sinha
Hi, you could drop an email to onboard@peppyhop.com and we will be happy to onboard you. Please add target geography like you would like to target Indian market or US market
crakhamster01
always be closing lol
jillesvangurp
There are a few conferences dedicated to ecommerce search. Mices is pretty good. I did not go there this year but I know some of the people behind it. Good community and lots of stuff happening.
Two points here.
- 25 million is really not a lot for most search engines. Something like Elasticsearch can easily deal with that if you deal with it properly. And there are plenty of equally capable solutions. I have worked with logging clusters that processes log entries by those numbers on a daily basis. A modestly sized cluster goes a long way for that. Bare metal is cheaper than cloud for this. But a couple of simple servers with decent CPUs and memory and SSDs should go a long way here. Start worrying once you hit a few hundred GB of storage used. Anything below that is easy to deal with.
- The key challenge with this volume is not performance but search quality. Building a competitive search engine is hard. You might have thousands of potential matches out of millions for any given query and your job is to pick the best 3, 5, 10 (whatever fits on your screen) ones. This is hard.
So, what makes for a good answer is the key question to answer. All the naive solutions for this problem put you at the bottom of the market in terms of competitiveness. If you can't do better, you are just another low quality search engine not quite solving the problem. The bar is high these days for a good search engine and most of the better ecommerce companies have highly skilled search teams working on this.
quaxar
Great site. Having built a search engine that needed to handle product data on a similar scale, it's not an easy thing to manage.
Some observations:
- Don't use infinite scrolling, it's an outdated UI practice that leads to bad user experience. It also makes the footer entirely unviewable.
- Clicking on a product card image does not reliably open up the product. I have to randomly click on it a few times (Chrome, Brave)
- Clicking on product card image and title leads to different actions, this is a bit unexpected, should show some hint of the difference.
- The product page pop up will reset the search list when closed, this messes up my search navigation, breaks the flow of browsing.
twothamendment
Searching is slow (kinda expected that right now), but after clicking a product and then hitting back, I have to wait for the search again.
Not at computer so I didn't check the headers, but maybe allow the client to cache the response for a short time so it doesn't need to load search results again.
pencildiver
Just upgraded the storage and put in a few fixes so it's working a bit faster now. Working on caching some responses locally as we speak. Great idea.
Redster
Have Swedish family. Searched dala because family wants traditional Christmas ornaments. Sure enough, there were several results that were 10x cheaper than what I could find on the first page of Big Search Company. Great job!
pencildiver
Amazing, glad you were able to find it. I also just learned about what a "Dala" is :)
TekMol
The Terms page goes to "Jaggi Enterprises", "A Modern Investment Fund. We buy, build, and invest in software companies with recurring revenue.".
So maybe this is not really something a guy built for his wife, but some anonymous startup that googled "Which terms rank best on Hacker News" and then wrote the "I did ... my wife .." story?
ltbarcly3
Jaggi is a fake it until you make it fake portfolio. Most of the companies it runs are just lorem ipsum fake sites. I think it is likely true that this is a solo dev.
SwedishExpat
Yeah you're right: https://www.linkedin.com/company/jaggi/
TekMol
You mean the site is not owned by Jaggi?
Then why would the terms and privacy links go to Jaggi?
pencildiver
OP here. Yup, I am in the process of starting a holding company LLC for my software products and small investments. Just went ahead and deleted 2 from the Investments page that are not launched yet but still in-development (just had landing pages up for those). Wasn't planning on releasing the Jaggi site yet, as I'm still wrapping my head around the holding company structure / it's new to me.
Agora has been a side project of mine. TBH in retrospect, I wish I would have given this post more thought as the servers / search performance wasn't prepared for any significant traffic. So definitely didn't game HN.
undefined
muratsu
Agora also doesn't return red shoes for the search query "red shoes". Seems like you haven't fully solved the problem yet :)
From a technical perspective, crawling 25M products is impressive but the search itself doesn't provide much value to me. I already use large e-commerce sites (amazon, wallmart, ...) and targeted ones (Nordstrom, SSENSE, ...). Sure I may not be searching through all the shopify, wix stores but I need to know why that's valuable to me to begin with. Perhaps understanding the value prop of SMBs and educating me about it would be a better positioning for Agora than simply being a search engine.
pencildiver
Definitely have not solved the problem yet! The search algorithm prioritizes the brand called "Red Wing Shoe" so still figuring out ways to show real 'red shoes'. Have been thinking about passing the images through a detection tool and tag them to enhance the search experience.
Re: Value Proposition. Absolutely, I think focusing on the SMB-angle and 'local shopping' will help direct users better. I'll definitely take this into account.
paulddraper
Best of luck on your marriage
charliejuggler
You could try the method we used for our vector search demo for e-commerce (all open source, natch) - use CLIP to get vector embeddings for product pictures and then use these for boosting or matching. https://opensourceconnections.com/blog/2023/03/22/building-v... Our demo works pretty well for searches like 'blue network cable' when the colour isn't always explicitly mentioned in the product data.
undefined
Get the top HN stories in your inbox every day.
Hi HN! I built Agora as a side-project leading up to the holiday season. I wanted to find an easier way to find Christmas gifts, without needing to go store-by-store.
My wife asked me for a a pair of red shoes for Christmas. I quickly typed it into Google and found a combination of ads from large retailers and links to a 1948 movie called 'Red Shoes'. I decided to build Agora to solve my own problem (and stay happily married). The product is a search engine that automatically crawls thousands of Shopify stores and makes them easily accessible with a search interface. There's a few additional features to enhance the buying experience including saving products, filters, reviews, and popular products.
I've started with exclusively Shopify stores and plan to expand the crawler to other e-commerce platforms like BigCommerce, WooCommerce, Wix, etc. The technical challenge I've found is keeping the search speed and performance strong as the data set becomes larger. There's about 25 million products on Agora right now. I'll ramp this up carefully to make sure we don't compromise the search speed and user experience.
I'd love any feedback!