August 2, 2013
Participants: Ric Wheeler, Daniel Phillips, Chris Mason, Dave Jones, Miklos Szeredi, Tony Luck, Ben Hutchings, Christoph Lameter.
People tagged: Matthew Wilcox, Jens Axboe, Jeff Moyer, Ingo Molnar, Zach Brown.
Ric Wheeler proposes a break-out session covering persistent memory and storage. Ric notes that with persistent RAM, you cannot expect a simple reboot to fully clean up state, which might be a bit of a surprise for some parts of the boot path. Ric also calls out devices that support atomic write and shingled disk drives.
Daniel Phillips asks what we are going to do with these devices in real kernel code, arguing that we should not leave this topic solely to the applications. Daniel also fears a proliferation of standards, and would like the kernel community to experiment with this new hardware so that we can drive these standards in a sane direction.
Chris Mason has been working on patches to support atomic I/O (presumably including the atomic writes that Ric called out), and is working on enabling kernel interfaces that will atomically create files pre-filled with data, or atomically execute multiple writes to multiple files.
Dave Jones
seemed excited by this prospect from a viewpoint of more bugs for
Trinity
to find, but also asked after Ingo Molnar's and Zach Brown's syslet work.
Miklos Szeredi
suggested use of the at*()
variants fo syscalls, generalizing
the “dirfs” argument to contain the transaction.
Matthew Wilcox
argued that the advent of low-latency storage devices means that
the storage stack must pay more attention to NUMA issues.
Tony Luck
expressed interest, but noted that there were some tradeoffs.
Do you move to a CPU close to the device, giving up memory and cache
locality (and taking a migration-induced hit to performance and latency)
in favor of storage locality?
How would the application decide whether or not the improved scheduler
latency was worth the cost?
Tony also noted that this decision depends on platform/generation-specific
information, and is suspicious of rules of thumb such as
“use this if you are going to access >32MB of data.”
[ Ed.: Suppose you need several storage devices, each of which happens to be
associated with different NUMA nodes? ]
Matthew
agreed that migrating from one socket to another mid-stream was usually
unwise, but noted that some applications can determine where the storage
is before building up local state, giving database queries and git
compression/checksumming as examples.
Longer term, Matthew would like to get the scheduler involved in
this decision process.
Mel Gorman
expressed support for this longer-term plan, but sooner rather than later.
Mel also wants to see hints back up to the application, and raised the
issue of NUMA-awareness of accesses to mmap()
ed paged.
Christoph Lameter
called out the potential performance benefits of binding acquisitions of
a given lock to a given CPU, binding threads to storage devices, and
storage controllers that write directly into CPU cache (thus saving
the presumed destination CPU the overhead of a cache miss, as in
Intel DDIO and PCIe TLP Processing Hints).
Christoph
also favors RDMA capabilities for storage, getting the kernel out of
the storage data path in a manner similar to InfiniBand.
Some discussion of the merits and limitations ensued. Chris Mason liked the idea of binding processes close to specific devices, including NUMA-aware swap-device selection. Matthew Wilcox called out multipath I/O and RAID configurations as possible confounding factors. [ Ed.: Back in the day when SCSI disks over FibreChannel was considered “low latency” the dinosaurs then roaming the earth simply ensured that all NUMA nodes had a FibreChannel path to each device, which eliminated disk-device affinity from consideration. Alas, not practical for today's high-speed solid-state storage devices! FibreChannel can no longer be considered to be particularly low latency by comparison. ] Chris Mason suggested an API that takes a file descriptor and returns one node mask of CPUs for reads and another for writes, but notes that this would not necessarily be a small change. Christoph Lameter asked that processes be guaranteed to wake up on the socket that has the relevant device attached, similar to what can be done with the networking stack. Matthew expressed some dissatisfaction with Christoph's suggestions regarding RDMA and wakeup locality, which indicates that a face-to-face discussion between the two of them would be entertaining, whatever the potential for enlightenment might be. Ben Hutchings noted that networking chooses the place to put incoming packets based on where the corresponding user thread has been running.