Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

tutfbhuf

High-availability durable filesystem is a difficult problem to solve. It usually starts with NFS, which is a big huge single point of failure. Depending on the nature of the application this might be good enough.

But if it's not, you'll typically want cross-datacenter replication so if one rack goes down you don't lose all your data. So then you're looking at something like Glusterfs/MooseFS/Ceph. But the latencies involved with synchronously replicating to multiple datacenters can really kill your performance. For example, try git cloning a large project onto a Glusterfs mount with >20ms ping between nodes. It's brutal.

Other products try to do asynchronous replication, EdgeFS is one I was looking at recently. This follows the general industry trend, like it or not, of "eventually consistent is consistent enough". Not much better than a cron job + rsync, in my opinion, but for some workloads it's good enough. If there's a partition you'll lose data.

Eventually you just give up and realize that a perfectly synchronized geographically distributed POSIX filesystem is a pipe dream, you bite the bullet and re-write your app against S3 and call it a day.

geertj

> It usually starts with NFS, which is a big huge single point of failure.

NFS is just the protocol. Whether it's a single point of failure depends on the server-side implementation. In Amazon EFS it is not.

(disclaimer: I'm a PM-T on the EFS team)

ryanmarsh

Hi, my use case for EFS Lambda is burts of small writes. Could you add a perf mode that favors bursts of small writes? We did some perf tests and just couldn’t make it fast enough.

jamesblonde

What's a small write in your use case? How many KBs is the write op, and how many ops/sec?

We can store small files on NVMe disks in metadata, so that you can read them with a latency of just a couple of ms, and can scale to 10s of thousands of concurrent reads/writes per second with small files.

https://www.logicalclocks.com/blog/millions-and-millions-of-...

tw04

https://www.netapp.com/cloud-services/cloud-volumes-service-...

You could try NetApp - it's significantly faster than EFS in my experience.

geertj

Hi, feel free to reach out with some details on your use case at gerardu at amazon.

redis_mlc

Everything is throttled in AWS because of people like you, so the answer is no.

the8472

> For example, try git cloning a large project onto a Glusterfs mount with >20ms ping between nodes. It's brutal.

That may be true but also is due to applications often having very sequential IO patterns even when they don't need to be.

I hope we'll get some convenience wrappers around io_uring that make batching of many small IO calls in a synchronous manner simple and easy for cases where you don't want to deal with async runtimes. E.g. bulk_statx() or fsync_many() would be prime candidates for batching.

fxtentacle

> write your app against S3 and call it a day.

... and then later you notice that you need a special SLA and Amazon Redshift to guarantee that a read from S3 will return the same value as the last write.

Even S3 is only eventually consistent and especially if a US user uploads images into your US bucket and then you serve the URLs to EU users, you might have loading problems. The correct solution, according to our support contact, is that we wait a second after uploading to S3 to make sure at least the metadata has replicated to the EU before we post the image url. That, or pay for Redshift to tell us how long to wait more precisely.

Because contrary to what everyone expects, S3 requests to US buckets might be served by delayed replicas in other countries if the request comes from outside the US.

Galanwe

> The correct solution, according to our support contact, is that we wait a second after uploading to S3

I'm shocked that you got this answer. This is definitely not how you are supposed to operate.

If you need to ensure the sequantiality of a write followed by a read on S3, the idiomatic way is to enable versioning on your bucket, issue a write, and provide the version ID to whoever need to read after that write.

Not only will that transparently ensure that you will not read deprecated data, but it will even ensure that you actually read the result of that particular write, and not any consecutive write that could have happened in between.

This pattern is very easy to implement for sequential operations in a single function, like:

    version = s3.put_object(...)
    data = s3.get_object(version_id=version, ...)
I agree that when you deal with various processes it can become messy.

In practice it never bothered me too much though. I prefer having explicit synchronization through versions, rather than having a blocking write waiting for all caches to be synchronized.

Also, this should only be necessary if you write to an already existing object. New keys will always trigger a cache invalidation from my understanding.

fxtentacle

Depending on your write throughput, versioning can be quite expensive.

Also, we were already at a throughput where the s3 metadata nodes would sometimes go offline, so I'm not sure putting more work on them would have improved our overall situation.

dividuum

> S3 requests to US buckets might be served by delayed replicas in other countries if the request comes from outside the US

What? That makes no sense. Do you have a source for that? I thought the explicit choice of region when creating a bucket limits where data is located. Why would the give you geo-replication for free? Also: "Amazon S3 creates buckets in a Region you specify. To optimize latency, minimize costs, or address regulatory requirements, choose any AWS Region that is geographically close to you." - https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket....

Also the consistency when creating objects is described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.... For new objects, you get read-after-write consistency and the example you gave contradicts that.

Dunedan

> For new objects, you get read-after-write consistency and the example you gave contradicts that.

Mind the documented caveat for this case:

> The caveat is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.

fxtentacle

They also say "A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list" which implies that you cannot read immediately after write.

slaymaker1907

You can minimize the effects of eventual consistency by distinguishing the distant replicas from the same-datacenter ones. Cross datacenter/region might be eventually consistent, but you will only see it rarely since the same-datacenter replicas are fully consistent with each other.

The main downside is that you have to pick one datacenter as the main datacenter at any given time. This can change if the old one goes down or due to business needs, but you can't have two of them as the main datacenter at once for a given dataset.

jmpman

Know of any commercial offerings with that characteristic?

Cixelyn

> EdgeFS is one I was looking at recently

Do you have any additional info? EdgeFS's github[1] doesn't work; does repo access require a Nexenta sales call?

We're also looking into asynchronously replicated FSs, I think built-in caching + tiering is slightly nicer than cron + rsync; would love to know what other solutions you looked into.

[1] https://github.com/Nexenta/edgefs

rsync

Don't give up too easily on (something like cron+rsync).

Things like cron+rsync fail in boring ways.

Fancy things fail in fascinating ways.

asyncscrum

This guy rsyncs

objectivefs

ObjectiveFS takes a different approach to high-availability durable file system by using S3 as the storage backend and building a log structured file system on top. This leverages the durability and scalability of S3 while allowing e.g. git cloning/diffing of a large project (even cross region) to remain fast. To compare we have some small file benchmark results at https://objectivefs.com/howto/performance-amazon-efs-vs-obje... .

harshaw

EFS attempts to solve this problem, although the speed of light between data centers is still a minor challenge.

hanikesn

Quobyte claim to have solved the synchronized distributed POSIX filesystem (mod geographically ).

catlifeonmars

S3 isn’t a file system. It’s a key-value store. If you’re trying to use it as a file system, you’re doing it wrong :)

bogomipz

That's exactly what a filesystem is.

Unix directories map directory and file names(keys) to inode numbers(values). And a file's inode(key) is a map to the data blocks(values) on disk that make up a file's content.

S3 might not have Posix semantics but to say it's not a filesystem is incorrect.

Nitramp

A Unix file system is not only a map from paths to inodes to bytes. Unix file systems support a host of additional apis, some of which are hard to implement on top of something like s3 (eg atomically renaming a parent path).

Even if you drop the Unix (posix) part, most practically used file systems have features that are hard to guarantee in a distributed setting, and in either case simply don't exist in S3.

bogomipz

The point in my post was to point out that storage implemented using keys and values is fundamental to all filesystems. To say that the very model that is fundamental to all filesystem is the thing that somehow precludes S3 from being considered a filesystem is kind of bizarre.

Also Nowhere did I say or even imply that a key/value store is all a Unix file system is. Different filesystems have different features. Object storage is a type of filesystem with a very specific feature set.

noidesto

Except its actually an object store and is used as a "file system" for hadoop/emr clusters for big data. See EMRFS.

valenterry

This.

It is an eventually consistent key-value store with some built-in optimization (indexes) for filtering/searching object timestamps and keys (e.g. listing keys by prefix), which allows for presenting them in a UI similar to how it is possible with files in a directory. That's about it.

gct

What do you think a file system is?

zelly

Eventually someone will find a coincidental or unintended aspect of your API and then depend on it. This article is the ultimate exemplar of this rule.

mobilemidget

but isn't a filename a key, and its content the value?

elhawtaky

I'm the author. Let me know if you have any questions.

hilbertseries

Reading through your article, this solution is built on top of s3. So, moving and listing files is faster, presumably due to a new metadata system you've built for tracking files. The trade off here, is that writes must be strictly slower now than they were previously because you've added a network hop. All read and write data now flows through these workers. Which adds a point of failure, if you steam too much data through these workers, you could potentially OOM them. Reads are potentially faster, but that also depends on the hit rate of the block cache. Would be nice to see a more transparent post listing the pros and cons, rather than what reads as a technical advertisement.

jamesblonde

I'm one of the co-authors. The numbers for writes are in the paper, so it is very unfair to call it an advertisement. And it is a global cache - if the block is cached somewhere, it will be used in reads.

xyzzy_plugh

The parent does make a good point about centralization of requests being a problem. S3 load balances under the hood, so different key prefixes within a bucket are usually serviced in isolation -- a DoS to one prefix will usually not affect other prefixes.

It seems like you'd be limiting yourself for concurrent access -- if everything is flowing through the MySQL cluster -- not a bad thing! Just perhaps warrants a caveat note. I'd expect S3 to smoke HopFS on concurrency.

dividuum

The linked post says "has the same cost as S3", yet on the linked pricing page there's only a "Contact Us" Enterprise plan besides a free one. Am I missing something?

jamesblonde

There is no extra charge for HopsFS as part of the Hopsworks platform - so you only pay for what you read/write/store in S3. There is SaaS pricing public on hopsworks.ai.

runako

This was also not clear to me when looking at the pricing page. I assumed there was Free and Enterprise, and that to use this beyond 250 working days, I would have to go to Enterprise.

fh973

> ... but has 100X the performance of S3 for file move/rename operations

Isn't rename in S3 effectively a copy-delete operation?

booi

That’s my understanding too. Also rename / copy turned out not to be very useful at the end of the day. Nearly all my implementations just boil down to randomized characters as ids

yandie

Yup. Use a system like Redis/DynamoDB or even a traditional database to store the metadata and use random UUID for actual file storage.

And tag the files for expiration/clean up. S3 is not a file system and people should stop treating it like one - only to get bitten by these assumptions around it being a FS.

ajb

You say it's "posix-like" - so what from posix had to be left out?

jamesblonde

Random writes are not supported in the HDFS API.

fwip

What tradeoffs did you make? In what situations does S3 have better characteristics than HopsFS?

undefined

[deleted]

de_Selby

How does this differ from Objectivefs?

jamesblonde

Funnily enough, i wasn't aware of ObjectiveFS - i guess it's because i can't find a research paper for it. HopsFS on S3 is similar to ADLS on Azure (built on Azure block storage). Internally, ADLS and HopsFS are different - HopsFS has a scaleout consistent metadata layer with a CDC API, while ADLS doesn't. But ADLS-v2 is also very good.

rsync

I searched your page, and then this HN discussion, for the string 'ssh' and got nothing ...

What is the access protocol ? What tools am I using to access the POSIX presentation ?

jeffbee

Diagram seems to imply an active/passive namenode setup like HDFS. Doesn't that limit it to tiny filesystems, and curse it with the same availability problems that plague HDFS?

chordysson

No because HopsFS uses a stateless namenode architecture. From https://github.com/hopshadoop/hops

"HopsFS is a new implementation of the Hadoop Filesystem (HDFS), that supports multiple stateless NameNodes, where the metadata is stored in MySQL Cluster, an in-memory distributed database. HopsFS enables more scalable clusters than Apache HDFS (up to ten times larger clusters), and enables NameNode metadata to be both customized and analyzed, because it can now be easily accessed via a SQL API."

SirOibaf

No, HopsFS namenodes are stateless as the metadata is stored on an in-memory distributed database (NDB)

There are other papers that describe HopsFS architecture in more details if you are interested: https://www.usenix.org/system/files/conference/fast17/fast17...

daviesliu

We see similar results using JuiceFS [1], are glad to see that another player has moved on the same direction with us, since JuiceFS was released to public in 2017, and is available on most of public cloud vendor.

The architect of JuiceFS is simper than HopsFS, it does not have worker nodes (the client access S3 and manage cache directly), and the metadata is stored in highly optimized server (similar to NN in HDFS, can be scaled out by adding more nodes).

JuiceFS provide POSIX client (using FUSE), and Hadoop SDK (in Java), and a S3 gateway (also a WebDAV gateway).

[1]: https://juicefs.com/docs/en/metadata_performance_comparison....

Disclaimer: Founder of JuiceFS here

hartator

> 100X the performance of S3 for file move/rename operations

I don’t see how it can be useful. Moving or renaming files in S3 seems more like maintenance than something you want to do on a regular basis.

colinwilson

Systems like Hadoop and Spark will run multiple versions of the same task writing out to a FS at once but to a temporary directory and when the first finishes the output data is just moved to the final place. It's not uncommon for a job to "complete" writing data to S3 and just sit there and hang as the FS move commands run copying the data and deleting the old version. It is just assumed a rename/move is a no-op in some systems.

randallsquared

Also, I presume that "move" means "update the index, held in another object or objects" (no slight! If I understand correctly, that's what most on-disk filesystems do as well). That's the only way I can imagine getting performance that much better.

stingraycharles

We do this in our ETL jobs on several hundreds of thousands of files a day. Not a reason to switch to a different system, but there are definitely non-maintenance use cases for this.

otterley

Any particular reason you don't tag the objects instead? That's a significantly lighter-weight operation, since S3 doesn't have native renaming capability.

jamesblonde

In HopsFS, you can also tag objects with arbitrary metadata (the xattr API), and the JSON objects you tag with are automatically (eventually consistent) replicated to Elastic search. So, you can search the FS namespace with free-text search on elastic. However, the lag is up to 100ms from the MySQL Cluster metadata. See the epipe paper in the blog post for details.

bfrydl

You can't list objects by tag.

haodemon

If someone is affected by the "copy-delete on move" problem on S3: there is a way to mitigate this https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoo...

Aperocky

S3 isn't a file system? There's a reason it's called buckets.

There's no 'renaming' of any file in S3. I don't think AWS had S3 as file system in mind. There's EBS for that.

dragonwriter

> I don't think AWS had S3 as file system in mind. There's EBS for that.

EBS isn't a file system, it's lower level than that (it's. block store, hence the name; you bring your own filesystem.) EFS and FSx are filesystems.

Aperocky

Thanks for correcting.

optimalsolver

>There's EBS for that

Don't you have to spin up an EC2 instance to use that?

Aperocky

You can keep it standalone or use it across multiple EC2 instances per my understanding.

stevekemp

Lambdas can use EBS these days, for persistent storage.

Dunedan

You might be confusing EBS (Elastic Block Storage) with EFS (Elastic File System): https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aw...

ComputerGuru

To be fair, EBS rolled out many years after.

untech

From the title, I though that this is a technology which gives you 100x faster static sites compared to S3, which sounded really cool. Discovering that it was about move operations was really underwhelming. The title is on the edge of being misleading.

smarx007

I think the right title may have been "HopsFS: speeding up HDFS over S3 by up to 100x"

perrohunter

Dude, there’s literally no rename/move in the official S3 specification, you are going to need to do a PUT object operation which is going to be as fast as a PUT operation is. If you judge a fish for its ability to fly it will feel stupid.

threeseed

Looks no different to Alluxio or Minio S3 Gateway or the dozens of other S3 caches around.

Would've been more interesting had it taken advantage of newer technologies such as io_uring, NVME over Fabric, RDMA etc.

emmanueloga_

I was gonna ask how it compares with Minio. Minio looks incredibly easy to "deploy" (single executable!) [1]. Looks nice to start hosting files on a single machine if there's no immediate need for the industrial strength features of AWS S3, or even to run locally for development.

1: https://min.io/download

rwdim

What’s the latency with 100,000 and 1,000,000 files?

elhawtaky

We haven't run mv/list experiments on 100,000 and 1,000,000 files for the blog. However, We expect the gap in latency between HopsFS/S3 and EMRFS would increase even further with larger directories. In our original HopsFS paper, we showed that the latency for mv/rename for a directory with 1 million files was around 5.8 seconds.

undefined

[deleted]

spicyramen

I have VM for my data scientists already in GCP, my datasets live in Google Cloud Storage. Can I take advantage of HopsFS for a shared file system across my VMs. Google Filestore Is ridículous expensive and at least they give u 1TB. Multi writer only supports 2VMs

boulos

Disclosure: I work on Google Cloud.

If Filestore (our managed NFS product) is too large for you, I'd suggest having gcsfuse on each box (or just use the GCS Connector for Hadoop). You won't get the kind of NFS-style locking semantics that a real distributed filesystem would support, but it sounds like you mostly need your data scientists to be able to read and write from buckets as if they're a local filesystem (and you wouldn't expect them to actively be overwriting each other or something where caching gets in the way).

Edit: We used gcsfuse for our Supercomputing 2016 run with the folks at Fermilab. There wasn't time to rewrite anything (we went from idea => conference in a few weeks) and since we mostly just cared about throughput it worked great.

spicyramen

Thanks for replying, is there any performance numbers for gcsfuse. We use small files and large files. We write small files at high rate. (Jupyter Notebooks) and can read and write large files (models) I also started looking into the multi writer and wondering if there are new developments (nodes >2)

boulos

More than anything, gcsfuse and similar projects that don’t put a separate metadata / data cache in between, reflect GCS’s latency and throughput (with a bit of an extra burden for being done through fuse).

GCS has pretty high latency. So even if you write a single byte, expect it to be like 50+ms. This is slower than a spinning disk in an old computer. If you’re just updating a single person’s notebook, they’ll feel it a bit on save (but obviously each person and file is independent).

But you can also do about 1 Gbps per network stream (and putting several of those together, 32+ Gbps per VM) such that even a 1 MB file is also probably done in about that much time. I think for streaming writes (save model) gcsfuse may still do a local copy until some threshold before writing out.

I’d probably put your models directly on GCS though. That’s what we do in colab and with TensorFlow (sadly, it seems from a quick search that PyTorch doesn’t have this out of the box).

Filestore and multi-writer PD will naturally improve over time. But I’m guessing you need something “today”. Feel free to send me a note (contact in profile) if you want to share specifics.

daviesliu

JuiceFS supports GCS and can be used in GCP, we have customers use it for data scientists. JuiceFS is free if you don't have data more than 1TB.

Discloser: Founder of JuiceFS here

Daily Digest email

Get the top HN stories in your inbox every day.

HopsFS: 100x Times Faster Than AWS S3 - Hacker News