Adding 16 kb page size to Android

android-developers.googleblog.com

Daily Digest email

Get the top HN stories in your inbox every day.

iam-TJ

In Debian kernel we've very recently enabled building an ARM64 kernel flavour with 16KiB page size and we've discussed adding a 64KiB flavour at some point as is the case for PowerPC64 already.

This will likely reveal bugs that need fixing in some of the 70,000+ packages in the Debian archive.

That ARM64 16KiB page size is interesting in respect of the Apple M1 where Asahi [0] identified that the DART IOMMU has a minimum page size of 16KiB so using that page size as a minimum for everything is going to be more efficient.

[0] https://asahilinux.org/2021/10/progress-report-september-202...

a1o

> The very first 16 KB enabled Android system will be made available on select devices as a developer option. This is so you can use the developer option to test and fix

> once an application is fixed to be page size agnostic, the same application binary can run on both 4 KB and 16 KB devices

I am curious about this. When could an app NOT be agnostic to this? Like what an app must be doing to cause this to be noticeable?

o11c

The fundamental problem is that system headers don't provide enough information. In particular, many programs need both "min runtime page size" and "max runtime page size" (and by this I mean non-huge pages).

If you call `mmap` without constraint, you need to assume the result will be aligned to at least "min runtime page size". In practice it is probably safe to assume 4K for this for "normal" systems, but I've seen it down to 128 bytes on some embedded systems, and I don't have much breadth there (this will break many programs though, since there are more errno values than that). I don't know enough about SPARC binary compatibility to know if it's safe to push this up to 8K for certain targets.

But if you want to call `mmap` (etc.) with full constraint, you must work in terms of "max runtime page size". This is known to be up to at least 64K in the wild (aarch64), but some architectures have "huge" pages not much beyond that so I'm not sure (256K, 512K, and 1M; beyond that is almost certainly going to be considered huge pages).

Besides a C macro, these values also need to be baked into the object file and the linker needs to prevent incompatible assumptions (just in case a new microarchitecture changes them)

lanigone

you can also do 2M and 1G huge pages on x86, it gets kind of silly fast.

ShroudedNight

1G huge pages had (have?) performance benefits on managed runtimes for certain scenarios (Both the JIT code cache and the GC space saw uplift on the SpecJ benchmarks if I recall correctly)

If using relatively large quantities of memory 2M should enable much higher TLB hit rates assuming the CPU doesn't do something silly like only having 4 slots for pages larger than 4k ¬.¬

ignoramous

What? Any pointers on how 1G speeds things up? I'd have taken a bigger page size to wreak havoc on process scheduling and filesystem.

dotancohen

Yes, but the context here is Java or Kotlin running on Android, not embedded C.

Or do some Android applications run embedded C with only a Java UI? I'm not an Android dev.

david_allison

The Android Native Development Kit (NDK) allows building native code libraries for Android (typically C/C++, but this can include Rust). These can then be loaded and accessed by JNI on the Java/Kotlin side

* Brief overview of the NDK: https://developer.android.com/ndk/guides

* Guide to supporting 16KB page sizes with the NDK https://developer.android.com/guide/practices/page-sizes

orf

Yes, Android apps can and do have native libraries. Sometimes this can be part of a SDK, or otherwise out of the developers control.

fpoling

Chrome browser on Android uses the same code base as Chrome on desktop including multi-process architecture. But it’s UI is in Java communicating with C++ using JNI.

saagarjha

Android apps can call into native code via JNI, which the platform supports.

warkdarrior

Apps written in Flutter/Dart and React Native/Javascript both compile to native code with only shims to interface with the Java UI framework.

mlmandude

If you use mmap/munmap directly within your application you could probably get into trouble by hardcoding the page size.

growse

If you use a database library that does mmap to create a db file with SC_PAGE_SIZE (4KB) pages, and then upgrade your device to a 16KB one and backup/restore the app, now your data isn't readable.

phh

Which is the reason you need to format your data to experiment with 16k

saagarjha

Pages sizes are often important to code that relies on low-level details of the environment it’s running in, like language runtimes. They might do things like mark some sections of code as writable or executable and thus would need to know what the granularity of those requests can be. It’s also of importance to things like allocators that hand out memory backed by mmap pages. If they have, say, a bit field for each 16-byte region of a page that has been used that will change in size in ways they can detect.

edflsafoiewq

jemalloc bakes in page size assumptions, see eg https://github.com/jemalloc/jemalloc/issues/467.

nox101

I don't know if this fits but I've seen code that allocated say 32 bytes from a function that allocated 1meg under the hood. Not knowing that's what was happening the app quickly ran out of memory. It arguably was not the app's fault. The API it was calling into was poorly designed and poorly named, such that the fact that you might need to know the block size to use the function was in no way indicated by the name of the function nor the names of any of its parameters.

dataflow

> When could an app NOT be agnostic to this

When the app has a custom memory allocator, the allocator might have hardcoded the page size for performance. Otherwise you have to load a static variable (knocks out a cache line you could've used for something else) and then do a multiplication (or bit shift, if you assume power of 2) by a runtime value instead of a shift by a constant, which can be slower.

No idea if Android apps are ever this performance sensitive, though.

dmytroi

Also ELF segment alignment, which is defaults to 4k.

bri3d

Only on Android, for what it's worth; most "vanilla" Linux aarch64 linkers chose 64K defaults several years ago. But yes, most Android applications with native (NDK) binaries will need to be rebuilt with the new 16kb max-page-size.

devit

Seems pretty dubious to do this without adding support for having both 4KB and 16KB processes at once to the Linux kernel, since it means all old binaries break and emulators which emulate normal systems with 4KB pages (Wine, console emulators, etc.) might dramatically lose performance if they need to emulate the MMU.

Hopefully they don't actually ship a 16KB default before supporting 4KB pages as well in the same kernel.

Also it would probably be reasonable, along with making the Linux kernel change, to design CPUs where you can configure a 16KB pagetable entry to map at 4KB granularity and pagefault after the first 4KB or 8KB (requires 3 extra bits per PTE or 2 if coalesced with the invalid bit), so that memory can be saved by allocating 4KB/8KB pages when 16KB would have wasted padding.

Veserv

Having both 4KB and 16KB simultaneously is either easy or hard depending on which hardware feature they are using for 16KB pages.

If they are using the configurable granule size, then that is a system-wide hardware configuration option. You literally can not map at smaller granularity while that bit is set.

You might be able to design a CPU that allows your idea of partial pages, but there be dragons.

If they are not configuring the granule size, instead opting for software enforcement in conjunction with always using the contiguous hint bit, then it might be possible.

However, I am pretty sure they are talking about hardware granule size, since the contiguous hint is most commonly used to support 16 contiguous entrys (though the CPU designer is technically allowed to do whatever grouping they want) which would be 64KB.

stingraycharles

I’m a total idiot, how exactly is page size a CPU issue rather than a kernel issue? Is it about memory channel protocols / communication?

Disks have been slowly migrating away from the 4kb sector size, is this a same thing going on? That you need to actual drive to support it, because of internal structuring (i.e. how exactly the CPU aligns things in RAM), and on some super low level 4kb / 16kb being the smallest unit of memory you can allocate?

And does that then mean that there’s less overhead in all kinds of memory (pre)fetchers in the CPU, because more can be achieved in less clock cycles?

s_tec

Each OS process has its own virtual address space, which is why one process cannot read another's memory. The CPU implements these address spaces in hardware, since literally every memory read or write needs to have its address translated from virtual to physical.

The CPU's address translation process relies on tables that the OS sets up. For instance, one table entry might say that the 4K memory chunk with virtual address 0x21000-0x21fff maps to physical address 0xf56e3000, and is both executable and read-only. So yes, the OS sets up the tables, but the hardware implements the protection.

Since memory protection is a hardware feature, the hardware needs to decide how fine-grained the pages are. It's possible to build a CPU with byte-level protection, but this would be crazy-inefficient. Bigger pages mean less translation work, but they can also create more wasted space. Sizes in the 4K-64K range seem to offer good tradeoffs for everyday workloads.

IshKebab

The CPU has hardware that does a page table walk automatically when you access an address for which the translation is not cached in the TLB. Otherwise virtual memory would be really slow.

Since the CPU hardware itself is doing the page table walk it needs to understand page tables and page table entries etc. including how big pages are.

Also you need to know how big pages are for the TLB itself.

The value of 4kB itself is pretty much arbitrary. It has to be a small enough number that you don't waste a load of memory by mapping memory that isn't used (e.g. if you ask for 4.01kB you're actually going to get 8kB), but a large enough number that you aren't spending all your time managing tiny pages.

That's why increasing the page size makes things faster but waste more memory.

4kB arguably isn't optimal anymore since we have way more memory now than when it was de facto standardised so it doesn't matter as much if we waste a bit. Maybe.

pwg

> I’m a total idiot, how exactly is page size a CPU issue rather than a kernel issue?

Because the size of a page is a hardware defined size for Intel and ARM CPU's (well, more modern Intel and ARM CPU's give the OS a choice of sizes from a small set of options).

It (page size) is baked into the CPU hardware.

> And does that then mean that there’s less overhead in all kinds of memory (pre)fetchers in the CPU, because more can be achieved in less clock cycles?

For the same size TLB (Translation Look-aside Buffer -- the CPU hardware that stores the "referencing info" for the currently active set of pages being used by the code running on the CPU) a larger page size allows more total memory to be accessible before taking a page fault and having to replace one or more of the entries in the TLB. So yes, it means less overhead, because CPU cycles are not used up in replacing as many TLB entries as often.

fpoling

Samsung SSD still reports to the system that their logical sector size is 512 bytes. In fact one of the recent models even removed the option to reconfigure the disk to use 4k logical sectors. Presumably Samsung has figured that since the physical sector is much larger and they need complex mapping of logic sectors in any case, they decided not to support 4K option and stick with 512 bytes.

sweetjuly

Hmm, I'm not sure that's quite right. ARMv8 supports per TTBR translation granules [1] and so you can have 4K and 16K user processes coexisting under an arbitrary page size kernel by just context switching TCR.TG0 at the same time as TTBR0. There is no such thing as a global granule size.

[1]: https://arm.jonpalmisc.com/2023_09_sysreg/AArch64-tcr_el2#fi...

Veserv

Well, if you want to run headfirst into the magical land of hardware errata, I guess you could go around creating heterogeneous, switched mappings.

I doubt the TCRs were ever intended to support rapid runtime switching or that the TLBs were ever intended to support heterogeneous entrys even with ASID tagging.

jonpalmisc

cool site you linked there :)

Zefiroj

The support for mTHP exists in upstream Linux, but the swap story is not quite there yet. THP availability also needs work and there are a few competing directions.

Supporting multiple page sizes well transparently is non-trivial.

For a recent summary on one of the approaches, TAO (THP Allocation Optimization), see this lwn article: https://lwn.net/Articles/974636/

phh

Google/Android doesn't care much about backward compatibility and broke programs released on Pixel 3 in Pixel 7. (the interdiction of 32bit-only apps is 2019 on Play Store, Pixel 7 is first 64bits only device, while Google still released 32bits only device in 2023...). They quite regularly break apps in new Android versions (despite their infrastructure to handle backward compatibility), and app developers are used to brace themselves around Android & Pixel releases

reissbaker

Generally I've found Google to care much more about not breaking old apps compared to Apple, which often expects developers to rebuild apps for OS updates or else the apps stop working entirely (or buy entirely new machines to get OS updates at all, e.g. the Intel/Apple Silicon transition). Google isn't on the level of Windows "we will watch for specific binaries and re-introduce bugs in the kernel specifically for those binaries that they depend on" in terms of backwards compatibility, but I wouldn't go so far as to say they don't care. I'm not sure whether that's better or worse: there's definitely merit to Apple's approach, since it keeps them able to iterate quickly on UX and performance by dropping support for the old stuff.

username81

Shouldn't there be some kind of setting to change the page size per program? AFAIK AMD64 CPUs can do this.

saagarjha

Yes, ARM CPUs can do it too.

lxgr

> all old binaries break and emulators which emulate normal systems with 4KB pages

Would it actually affect the kind of emulators present on Android, i.e. largely software-only ones, as opposed to hardware virtualizers making use of a CPU's vTLB?

Wine is famously not an emulator and as such doesn't really exist/make sense on (non-x86) Android (as it would only be able to execute ARM binaries, not x86 ones).

For the downvote: Genuinely curious here on which type of emulator this could affect.

fouronnes3

Could they upstream that or would that require a fork?

mgaunard

why does it break userland? if you need to know the page size, you should query sysconf SC_PAGESIZE.

fweimer

It should not break userland. GNU/Linux (not necessarily Android though) has supported 64K pages pretty much from the start because that was the originally page size chosen for server-focus kernels and distributions. But there are some things that need to be worked around.

Certain build processes determine the page size at compile time and assume it's the same at run time, and fail if it is not: https://github.com/jemalloc/jemalloc/issues/467

Some memory-mapped files formats have assumptions about page granularity: https://bugzilla.redhat.com/show_bug.cgi?id=1979804

The file format issue applies to ELF as well. Some people patch their toolchains (or use suitable linker options) to produce slightly smaller binaries that can only be loaded if the page size is 4K, even though the ABI is pretty clear in that you should link for compatibility with up to 64K pages.

akdev1l

Assumptions in the software.

Jemalloc is infamous for this: https://github.com/sigp/lighthouse/issues/5244

ndesaulniers

Ossification.

If the page size has been 4k for decades for most OS' and architectures, people get sloppy and hard code that literal value, rather than query for it.

PhilipRoman

Replacing compile time constants with function calls will always bring some trouble, suddenly you need to rearrange your structures, optimizations get missed (in extreme cases you can accidentally introduce a DIV instruction), etc. So it is not surprising that code assumes 4k pages.

mgaunard

Any code that does divisions and modulos with non-constants that are known to be powers of 2 should do the optimization manually.

Even for non-powers-of-two there are also techniques to speed up divisions if the same divisor is used repeatedly.

Dwedit

Emulating a processor with 4K size pages becomes much higher performance if you can use real addresses directly.

aaron695

[dead]

twoodfin

A little additional background: iOS has used 16KB pages since the 64-bit transition, and ARM Macs have inherited that design.

arghwhat

A more relevant bit of background is that 4KB pages lead to quite a lot of overhead due to the sheer number of mappings needing to be configured and cached. Using larger pages reduce overhead, in particular TLB misses as fewer entries are needed to describe the same memory range.

While x86 chips mainly supports 4K, 2M and 1G pages, ARM chips tend to support more practical 16K page sizes - a nice balance between performance and wasting memory due to lower allocation granularity.

Nothing in particular to do with Apple and iOS.

jsheard

Makes me wonder how much performance Windows is leaving on the table with its primitive support for large pages. It does support them, but it doesn't coalesce pages transparently like Linux does, and explicitly allocating them requires special permissions and is very likely to fail due to fragmentation if the system has been running for a while. In practice it's scarcely used outside of server software which immediately grabs a big chunk of large pages at boot and holds onto them forever.

andai

A lot of low level stuff is a lot slower on Windows, let alone the GUI. There's also entire blogs cataloging an abundance of pathological performance issues.

The one I notice the most is the filesystem. Running Linux in VirtualBox, I got 7x the host speed for many small file operations. (On top of that Explorer itself has its own random lag.)

I think a better question is how much performance are they leaving on the table by bloating the OS so much. Like they could have just not touched Explorer for 20 years and it would be 10x snappier now.

I think the number is closer to 100x actually. Explorer on XP opens (fully rendered) after a single video frame... also while running virtualized inside Win10.

Meanwhile Win10 Explorer opens after a noticeable delay, and then spends the next several hundred milliseconds painting the UI elements one by one...

arghwhat

Quite a bit, but 2M is an annoying size and the transparent handling is suboptimal. Without userspace cooperating, the kernel might end up having to split the pages at random due to an unfortunate unaligned munmap/madvise from an application not realizing it was being served 2M pages.

Having Intel/AMD add 16-128K page support, or making it common for userspace to explicitly ask for 2M pages for their heap arenas is likely better than the page merging logic. Less fragile.

1G pages are practically useless outside specialized server software as it is very difficult to find 1G contiguous memory to back it on a “normal” system that has been running for a while.

tedunangst

I've lost count of how many blog posts about poor performance ended with the punchline "so then we turned off page coalescing".

daghamm

IIRC, 64-bit ARM can do 4K, 16K, 64K and 2M pages. But there are some special rules for the last one.

https://documentation-service.arm.com/static/64d5f38f4a92140...

sweetjuly

It's a little weirder. At least one translation granule is required but it is up to the implementation to choose which one(s) they want. Many older Arm cores only support 4KB and 64KB but newer ones support all three.

The size of the translation granule determines the size of the block entries at each level. So 4K granules has super pages of 2MB and 1GB, 16KB granules has 32MB super pages, and 64K has 512MB super pages.

CalChris

Armv8-A also supports 4K pages: FEAT_TGran4K. So Apple did indeed make a choice to instead use 16K, FEAT_TGran16K. Microsoft uses 4K for AArch64 Windows.

HumblyTossed

How is this "additional background"? This was a post by Google regarding Android.

Kwpolska

That this isn't the only 4K→16K transition in recent history? Some programs that assumed 4K had to be fixed as part of the transition, this can provide insights for the work required for Android.

carstenhag

As an android dev: the work hours I spend talking with iOS colleagues is the same as with AND ones. Usually you want to sort of be up to date with the other platform as well.

eyalitki

RHEL tried that in that past with 64KB on AARCH64, it led to MANY bugs all across the software stack, and they eventually reverted it - https://news.ycombinator.com/item?id=27513209.

I'm impressed by the effort on Google's side, yet I'll be surprised if this effort will pay off.

nektro

apple's m-series chips use a 16kb page size by default so the state of things has improved significantly with software wanting to support asahi and other related endeavors

rincebrain

I didn't realize they had reverted it, I used to run RHEL builds on Pi systems to test for 64k page bugs because it's not like there's a POWER SBC I could buy for this.

kcb

Nvidia is pushing 64KB pages on their Grace-Hopper system.

monocasa

I wonder how much help they had by asahi doing a lot of the kernel and ecosystem work anablibg 16k pages.

RISC-V being fixed to 4k pages seems to be a bit of an oversight as well.

IshKebab

Probably wouldn't be too hard to add a 16 kB page size extension. But I think the Svnapot extension is their solution to this problem. If you're not familiar it lets you mark a set of pages as being part of a contiguously mapped 64 kB region. No idea how the performance characteristics vary. It relieves TLB pressure, but you still have to create 16 4kB page table entries.

monocasa

Svnapot is a poor solution to the problem.

On one hand it means that that each page table entry takes up half a cache line for the 16KB case, and two whole cache lines in the 64KB case. This really cuts down on the page walker hardware's ability to effectively prefetch TLB entries, leading to basically the same issues as this classic discussion about why tree based page tables are generally more effective than hash based page tables (shifted forward in time to today's gate counts). https://yarchive.net/comp/linux/page_tables.html This is why ARM shifted from a Svnapot like solution to the "translation granule queryable and partially selectable at runtime" solution.

Another issue is the fact that a big reason to switch to 16KB or even 64KB pages is to allow for more address range for VIPT caches. You want to allow high performance implementations to be able to look up the cache line while performing the TLB lookup in parallel, then compare the tag with the result of the TLB lookup. This means that practically only the untranslated bits of the address can be used by the set selection portion of the cache lookup. When you have 12 bits untranslated in a address, combined with 64 byte cachelines gives you 64 sets, multiply that by 8 ways and you get the 32KB L1 caches very common in systems with 4KB page sizes (sometimes with some heroic effort to throw a ton of transistors/power at the problem to make a 64KB cache by essentially duplicating large parts of the cache lookup hardware for that extra bit of address). What you really want is for the arch to be able to disallow 4KB pages like on apple silicon which is the main piece that allows their giant 128LB and 192KB L1 caches.

aseipp

> What you really want is for the arch to be able to disallow 4KB pages like on apple silicon which is the main piece that allows their giant 128LB and 192KB L1 caches.

Minor nit but they allow 4k pages. Linux doesn't support 16k and 4k pages at the same time; macOS does but is just very particular about 4k pages being used for scenarios like Rosetta processes or virtual machines e.g. Parallels uses it for Windows-on-ARM, I think. Windows will probably never support non-4k pages I'd guess.

But otherwise, you're totally right. I wish RISC-V had gone with the configurable granule approach like ARM did. Major missed opportunity but maybe a fix will get ratified at some point...

wren6991

> This means that practically only the untranslated bits of the address can be used by the set selection portion of the cache lookup

It's true that this makes things difficult, but Arm have been shipping D caches with way size > page size for decades. The problem you get is that virtual synonyms of the same physical cache block can become incoherent with one another. You solve this by extending your coherence protocol to cover the potential synonyms of each index in the set (so for example with 16 kB/way and 4 kB pages, there are four potential indices for each physical cache block, and you need to maintain their coherence). It has some cost and the cost scales with the ratio of way size : page size, so it's still desirable to stay under the limit, e.g. by just increasing the number of cache ways.

ashkankiani

It's pretty cool that I can read "anablibg" and know that means "enabling." The brain is pretty neat. I wonder if LLMs would get it too. They probably would.

evilduck

Question I wrote:

> I encountered the typo "anablibg" in the sentence "I wonder how much help they had by asahi doing a lot of the kernel and ecosystem work anablibg 16k pages." What did they actually mean?

GPT-4o and Sonnet 3.5 understood it perfectly. This isn't really a problem for the large models.

For local small models:

* Gemma2 9b did not get it and thought it meant "analyzing".

* Codestral (22b) did not it get it and thought it meant "allocating".

* Phi3 Mini failed spectacularly.

* Phi3 14b and Qwen2 did not get it and thought it was "annotating".

* Mistral-nemo thought it was a portmanteau "anabling" as a combination of "an" and "enabling". Partial credit for being close and some creativity?

* Llama3.1 got it perfectly.

treyd

I wonder if they'd do better if there was the context that it's in a thread titled "Adding 16 kb page size to Android"? The "analyzing" interpretation is plausible if you don't know what 16k pages, kernels, Asahi, etc are.

jandrese

Seems like there is a bit of a roll of the dice there. The ones that got it right may have just been lucky.

slaymaker1907

I wonder how much of a test this is for the LLM vs whatever tokenizer/preprocessing they're doing.

Alifatisk

Is there any task Gemma is better at compared to others?

Retr0id

fwiw I failed to figure it out as a human, I had to check the replies.

im3w1l

I asked chatgpt and it did get it.

Personally, when I read the comment my brain kinda skipped over the word since it contained the part "lib" I assumed it was some obscure library that I didn't care about. It doesn't fit grammatically but I didn't give it enough thought to notice.

mrbuttons454

Until I read your comment I didn't even notice...

mrob

LLMs are at a great disadvantage here because they operate on tokens, not letters.

platelminto

I remember reading somewhere that LLMs are actually fantastic at reading heavily mistyped sentences! Mistyped to a level where humans actually struggle.

(I will update this comment if I find a source)

saagarjha

Probably very little, since the Android ecosystem is quite divorced from the Linux one.

nabla9

Android kernel is a mainstream Linux kernel, with additional drivers, and other functionality.

saagarjha

I am aware. This does not change what I said.

temac

The linux kernel already works perfectly fine with various base page sizes.

wren6991

RV64 has some reserved encoding space in satp.mode so there's an obvious path to expanding the number of page table formats at a later time. Just requires everyone to agree on the direction (common issue with RISC-V).

For RV32 I think we are probably stuck with Sv32 4k pages forever.

CalChris

  iOS has had 16K pages since forever.
  OSX switched to 16K pages in 2020 with the M1.
  Windows is stuck on 4K pages, even for AArch64.
  Linux has various page sizes. Asahi is 16K.

nullindividual

Windows has 4K, 2M, and 1G page sizes on x86-64.

CalChris

Normal, large and huge. But default normal pages (which is what Android is changing) are 4K. FWIW, Itanium and Alpha had 8K default pages.

https://devblogs.microsoft.com/oldnewthing/20210510-00/?p=10...

I wonder why Microsoft stayed with 4K for AArch64.

Kwpolska

Microsoft wanted to make x86 compatibility as painless as possible. They adopted an ABI in which registers can be generally mapped 1:1 between the two architectures.

nullindividual

I was confused as to why you were posting incorrect information when this thread already contained the correct information.

baby_souffle

> and 1G page sizes on x86-64.

I wonder who requested the 1G page size be implemented and what they use it for...

Kwpolska

Another thread says virtual machines.

lxgr

Now I wonder: Does increased page size have any negative impacts on I/O performance or flash lifetime, e.g. for writebacks of dirty pages of memory-mapped files where only a small part was changed?

Or is the write granularity of modern managed flash devices (such as eMMCs as used in Android smartphones) much larger than either 4 or 16 kB anyway?

tadfisher

Flash controllers expose blocks of 512B or 4096KB, but the actual NAND chips operate in terms of "erase blocks" which range from 1MB to 8MB (or really anything); in these blocks, an individual bit can be flipped from "0" to "1" once, and flipping any bit back to "0" requires erasing the entire block and flipping the desired bits back to "1" [0].

All of this is hidden from the host by the NAND controller, and SSDs employ many strategies (including DRAM caching, heterogeneous NAND dies, wear-leveling and garbage-collection algorithms) to avoid wearing the storage NAND. Effectively you must treat flash storage devices as block devices of their advertised block size because you have no idea where your data ends up physically on the device, so any host-side algorithm is fairly worthless.

[0]: https://spdk.io/doc/ssd_internals.html

lxgr

Writes on NAND happen at the block, not the page level, though. I believe the ratio between the two is usually something like 1:8 or so.

Even blocks might still be larger than 4KB, but if they’re not, presumably a NAND controller could allow such smaller writes to avoid write amplification?

The mapping between physical and logical block address is complex anyway because of wear leveling and bad block management, so I don’t think there’s a need for write granularity to be the erase block/page or even write block size.

to11mtm

> Even blocks might still be larger than 4KB, but if they’re not, presumably a NAND controller could allow such smaller writes to avoid write amplification?

Look at what SandForce was doing a decade+ ago. They had hardware compression to lower write amp and some sort of 'battery backup' to ensure operations completed. Various bits of this sort of tech is in most decent drives now.

> The mapping between physical and logical block address is complex anyway because of wear leveling and bad block management, so I don’t think there’s a need for write granularity to be the erase block/page or even write block size.

The controller needs to know what blocks can get a clean write vs what needs an erase; that's part of the trim/gc process they do in background.

Assuming you have sufficient space, it works kinda like this:

- Writes are done to 'free-free' area, i.e. parts of the flash it can treat like SLC for faster access and less wear. If you have less than 25%-ish of drive free this becomes a problem. Controller is tracking all of this state.

- When it's got nothing better to do for a bit, controller will work to determine which old blocks to 'rewrite' with data from the SLC-treated flash into 'longer lived' but whatever-Level-cell storage. I'm guessing (hoping?) there's a lot of fanciness going on there, i.e. frequently touched files take longer to get a full rewrite.

TBH sounds like a fun thing to research more

lostmsu

Not entirely related (except the block size), but I am considering making and standardizing a system-wide content-based cache with default block size 16KB.

The idea is that you'd have a system-wide (or not) service that can do two or three things:

- read 16KB block by its SHA256 (also return length that can be <16KB), if cached

- write a block to cache

- maybe pin a block (e.g. make it non-evictable)

I would be like a block-level file content dedup + eviction to keep the size limited.

Should reduce storage used by various things due to dedup functionality, but may require internet for corresponding apps to work properly.

With a peer-to-peer sharing system on top of it may significantly reduce storage requirements.

The only disadvantage is the same as with shared website caches prior to cache isolation introduction: apps can poke what you have in your cache and deduce some information about you from it.

monocasa

I'd probably pick a size greater than 16KB for that. Windows doesn't expose translations less than 64KB in their version of mmap, and internally their file cache works in increments of 256KB. And these were numbers they picked back in the 90s.

treyd

I would go for higher than 16K. I believe BitTorrent's default minimum chunk size is 64K, for example. It really depends on the use case in question though, if you're doing random writes then larger chunk sizes quickly waste a ton of bandwidth, especially if you're doing recursive rewrites of a tree structure.

Would a variable chunk size be acceptable for whatever it is you're building?

lostmsu

I could feasibly do 2x partitioning. E.g. have caches for 16KB, 32KB, etc provided there's some mechanism to automatically combine/split pieces

taeric

I see they have measured improvements in the performance of some things. In particular, the camera app starts faster. Small percentage, but still real.

Curious if there are any other changes you could do based on some of those learnings? The camera app, in particular, seems like a good one to optimize to start instantly. Especially so with the the shortcut "double power key" that many phones/people have setup.

Specifically, I would expect you should be able to do something like the lisp norm of "dump image?" Startup should then largely be loading the image, not executing much if any initialization code? (Honestly, I mostly assume this already happens?)

saagarjha

A big part of the challenge for launching the camera app is getting the hardware ready and quickly freeing up RAM for image processing.

taeric

Makes sense. That it is so much faster on repeat wakeups does seem to hint that it could be computing a something. I'm assuming you are saying that most of what is getting computed is related to paging in/out memory? That would track on how it could be better with larger pages.

daghamm

Can someone explain those numbers to me?

5-10% performance boost sounds huge. Wouldn't we have much larger TLBd if page walk was really this expensive?

On the other hand 9% increase in memory usage also sounds huge. How did this affect memory usage that much?

scottlamb

> 5-10% performance boost sounds huge. Wouldn't we have much larger TLBd if page walk was really this expensive?

It's pretty typical for large programs to spend 15+% of their "CPU time" waiting for the TLB. [1] So larger pages really help, including changing the base 4 KiB -> 16 KiB (4x reduction in TLB pressure) and using 2 MiB huge pages (512x reduction where it works out).

I've also wondered why the TLB isn't larger.

> On the other hand 9% increase in memory usage also sounds huge. How did this affect memory usage that much?

This is the granularity at which physical memory is assigned, and there are a lot of reasons most of a page might be wasted:

* The heap allocator will typically cram many things together in a page, but it might say only use a given page for allocations in a certain size range, so not all allocations will snuggle in next to each other.

* Program stacks each use at least one distinct page of physical RAM because they're placed in distinct virtual address ranges with guard pages between. So if you have 1,024 threads, they used at least 4 MiB of RAM with 4 KiB pages, 16 MiB of RAM with 16 KiB pages.

* Anything from the filesystem that is cached in RAM ends up in the page cache, and true to the name, it has page granularity. So caching a 1-byte file would take 4 KiB before, 16 KiB after.

[1] If you have an Intel CPU, toplev is particularly nice for pointing this kind of thing out. https://github.com/andikleen/pmu-tools

95014_refugee

> I've also wondered why the TLB isn't larger.

Fast CAMs are (relatively) expensive, is the excuse I always hear.

pflanze

I would expect that this increases the gap between new and old phones / makes old phones unusable more quickly: new phones will typically have enough RAM and can live with the 9% less efficient memory use, and will see the 5-10% speedup. Old phones are typically bottlenecked at RAM, now 9% earlier, and reloading pages from disk (or swapping if enabled) will have a much higher overhead than 5-10%.

Daily Digest email

Get the top HN stories in your inbox every day.