The Lies We Tell Our Code

We tell our code lies from development to deploy. The most common of these lies start with the simple act of launching a virtual machine. These lies are critical to our applications. Some of them protect applications from themselves and each other, some even improve performance. Some, however, decrease performance, and others create barriers to simply getting things done.

We lie about the systems, networks, storage, RAM, CPU and other resources our applications use, but how we tell those lies is critical to how the applications that depend on them perform. Joyent's Casey Bisson will explore the lies we tell our code and demonstrate examples of how they sometimes help and hurt us.

Slides as presented at http://www.meetup.com/Seattle-Scalability-Meetup/events/219709036/

Published in: Software, Technology

The Lies We Tell Our Code (#seascale 2015 04-22) from Casey Bisson

Transcript

1. “the lies we tell our code”  @misterbisson
2. Powering modern applications Your favorite code Container optimized infrastructure Your favorite tools
3. Our data center or yours Joyent Public Cloud Joyent Container Service. We run our customer’s mission critical applications on container native infrastructure. Private DataCenter SmartDataCenter is an on-premise, container run-time environment used by some of the world’s most recognizable companies.
4. Our data center or yours Joyent Public Cloud Joyent Container Service. We run our customer’s mission critical applications on container native infrastructure. Private DataCenter SmartDataCenter is an on-premise, container run-time environment used by some of the world’s most recognizable companies. …and open source too! Fork me, pull me: https://github.com/joyent/sdc
5. Node.js enterprise support As the corporate steward of Node.js and one of the largest- scale production users, Joyent is uniquely equipped to deliver the highest level of enterprise support for this dynamic runtime. • Best practices guidance • Performance analysis • Core ﬁle analysis • Debugging support • Critical incident support
6. Docker container hosting The original container infrastructure company loves the new container packaging standard … • Portability From laptop to any public or private cloud • Productivity Faster code, test, and deploy • Devops for everyone Large community building tools for management, deployment, and scale
7. Proprietary & Confidential Information © 2015 Joyent, Inc ‹#›. The best place to run Docker containers,   making Ops simple and scalable. Triton Triton SecurityManagement Networking IntrospectionPerformance Utilization
8. breath for a moment
9. lying to our code is a practical choice
10. without moral consequence
11. without all consequence …but not
12. most importantly
13. most importantly never lie to yourself
14. The earliest common lie Virtual memory from http://www.webopedia.com/TERM/V/virtual_memory.html
15. Virtual memory  according to Poul-Henning Kamp Take Squid for instance, a 1975 program if I ever saw one: You tell it how much RAM it can use and how much disk it can use. It will then spend inordinate amounts of time keeping track of what HTTP objects are in RAM and which are on disk and it will move them forth and back depending on traﬃc patterns. Squid’s elaborate memory management…gets into ﬁghts with the kernel’s elaborate memory management, and like any civil war, that never gets anything done. from http://web.archive.org/web/20080323141758/http://varnish.projects.linpro.no/wiki/ArchitectNotes
16. Virtual memory  according to Poul-Henning Kamp Varnish knows it is not running on the bare metal but under an operating system that provides a virtual-memory-based abstract machine. For example, Varnish does not ignore the fact that memory is virtual; it actively exploits it. A 300-GB backing store, memory mapped on a machine with no more than 16 GB of RAM, is quite typical. The user paid for 64 bits of address space, and I am not afraid to use it. from http://queue.acm.org/detail.cfm?id=1814327
17. vm.swappiness = 0
18. The harmless lie Hyperthreading from http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/36016.htm
19. Hyperthreading One physical core appears as two processors to the operating system, which can use each core to schedule two processes at once. It takes advantage of superscalar architecture in which multiple instructions operate on separate data in parallel. Hyper-threading can be properly utilized only with an OS speciﬁcally optimized for it. from http://en.wikipedia.org/wiki/Hyper-threading
20. Faster, but not double the performance Hyperthreading from https://capacitas.wordpress.com/2013/03/07/hyper-threading-on-vs-off-case-study/
21. The lie that built the cloud Hardware virtual machines from http://virtualizationtutor.com/what-is-hosted-virtualization-and-dedicated-virtualization/
22. HVM: call translation Say a virtual machine guest OS makes the call to ﬂush the TLB (translation look-aside buﬀer) which is a physical component of a physical CPU. If the guest OS was allowed to clear the entire TLB on a physical processor, that would have negative performance eﬀects for all the other VMs that were also sharing that same physical TLB. [Instead, the hypervisor must translate that call] so that only the section of the TLB that is relevant to that virtual machine is ﬂushed. from http://serverfault.com/a/455554
23. The lie that made VMware huge HVM: type 1 vs. type 2 from https://microkerneldude.wordpress.com/2009/03/23/virtualization-some-get-it-some-dont/
24. Lies upon lies Paravirtualization from http://www.cubrid.org/blog/dev-platform/x86-server-virtualization-technology/
25. HVM vs. clocksource… EC2 User: the kernel time will jump from 0 to   thousands of seconds. Kernel dev: for some reason it looks like the vcpu time info misses…without implementation details of the host code it is hard to say anything more. AWS: Ubuntu…uses the underlying hardware as a timesource, rather than sources native to the instance, leading to timestamps that are out of sync with the local instance time. from https://forums.aws.amazon.com/thread.jspa?messageID=560443
26. HVM vs. CPU oversubscription An operating system requires synchronous progress on all its CPUs, and it might malfunction when it detects this requirement is not being met. For example, a watchdog timer might expect a response from its sibling vCPU within the speciﬁed time and would crash otherwise. When running these operating systems as a guest, ESXi must therefore maintain synchronous progress on the virtual CPUs. from http://www.vmware.com/ﬁles/pdf/techpaper/VMware-vSphere-CPU-Sched-Perf.pdf
27. HVMs vs. network I/O Reality: interrupts are challenging in HVM with oversubscribed CPU. Consider these AWS network tuning recommendations: • Turn oﬀ tcp_slow_start_after_idle • Increased netdev_max_backlog from 1000 to 5000 • Maximize window size (rwnd, swnd, and cwnd) from http://www.slideshare.net/AmazonWebServices/your-linux-ami-optimization-and-performance-cpn302-aws- reinvent-2013
28. HVMs vs. memory oversubscription [P]age sharing, ballooning, and compression are opportunistic techniques. They do not guarantee memory reclamation from VMs. For example, a VM may not have sharable content, the balloon driver may not be installed, or its memory pages may not yield good compression. Reclamation by swapping is a guaranteed method for reclaiming memory from VMs. from https://labs.vmware.com/vmtj/memory-overcommitment-in-the-esx-server
29. HVM vs. performance Most successful AWS cluster deployments use more EC2 instances than they would the same number of physical nodes to compensate for the performance variability caused by shared, virtualized resources. Plan to have more EC2 instance based nodes than physical server nodes when estimating cluster size with respect to node count. from http://docs.basho.com/riak/latest/ops/tuning/aws/
30. Because lying about software is easier than lying about hardware OS-based virtualization from http://www.slideshare.net/ydn/july-2014-hug-managing-hadoop-cluster-with-apache-ambari
31. OS-based virtualization Simple idea • The kernel is there to manage the relationship with hardware and isolate processes from each other • We’ve depended on secure memory protection, process isolation, privilege management in unix for a long time • Let’s leverage that and expand on it OS virt adds new requirements • Namespace lies (pid, uid, ipc, uts, net, mnt) • Polyinstantiation of resources • Virtualized network interfaces, etc Learn more about Linux, SmartOS
32. OS-based virtualization • Signiﬁcantly reduced RAM requirements • Makes microservices possible • Shorter I/O chains • Kernel visibility across all processes • Coscheduled I/O and CPU tasks • Elastic use of memory and CPU across all containers • Allowing explicit resizing of containers (raising RAM, CPU, I/O limits) • Allowing bursting of containers (unused CPU cycles can claimed by whatever container wants them) • Allowing the kernel to use unused RAM as an FS cache across all containers • Greater tolerance of CPU oversubscription • Signiﬁcantly higher workload density
33. Rendering from Bruce Irving Go container-native for Earth Day
34. OS-based virtualization: Linux Linux kernel support for namespaces is still very new. This note accompanying their introduction has proved prescient: “[T]he changes wrought by this work are subtle and wide ranging. Thus, it may happen that user namespaces have some as-yet unknown security issues that remain to be found and ﬁxed in the future.” from http://lwn.net/Articles/531114/
35. from https://twitter.com/swardley/status/587747997334765568
36. OS-based virtualization: Joyent • Kernel and facilities built for zones from the start • Process encapsulation separates processes, their data and the namespace • Processes cannot escape from zones. • Processes cannot observe other zones. • Processes cannot signal other zones. • Naming (such as user IDs or opening a port on an IP address) does not conﬂict with other zones • Zone processes have a privilege limit and no process in a zone ever has as much privilege as the global zone • Mature and tested: almost ten years in production at Joyent without incident • Coming up: ﬁlesystem and network virtualization contributions to container security
37. Playing charades: two syllables, sounds like… Syscall virtualization • The internet • Native Linux binaries • Linux syscall translation • SmartOS Kernel
38. Syscall virtualization • Branded zones provide a set of interposition points in the kernel that are only applied to processes executing in a branded zone. • These points are found in such paths as the syscall path, the process loading path, and the thread creation path. • At each of these points, a brand can choose to supplement or replace the standard behavior. from http://docs.oracle.com/cd/E19044-01/sol.containers/817-1592/gepea/index.html
39. The lie on which our massive media libraries were built Virtual block storage: RAID from http://www.seagate.com/manuals/network-storage/business-storage-nas-os/raid-modes/
40. The lie that puts data in a separate cloud from compute Virtual block storage: SAN from ...wordpress.com/.../private-cloud-principles... and aws.amazon.com/message/680342/
41. SAN vs. app performance Riak's primary bottleneck will be disk and network I/O. [S]tandard EBS will incur too much latency and iowait. Riak's I/O pattern tends to operate on small blobs from many places on the disk, whereas EBS is best at bulk reads and writes. from http://docs.basho.com/riak/latest/ops/tuning/aws/
42. SAN vs. disaster [Some common solutions] force non-obvious single points of failure. [They are] a nice transition away from traditional storage, but at the end of the day it is just a diﬀerent implementation of the same thing. SAN and Software Deﬁned Storage are all single points of failure when used for virtual machine storage. from https://ops.faithlife.com/?p=6
43. More lies about where your data is Filesystem virtualization: links from http://www.cs.ucla.edu/classes/spring13/cs111/scribe/11c/   see also Busybox’s use of links, http://www.busybox.net/FAQ.html#getting_started
44. The lie on which Docker containers are built Filesystem virtualization: copy-on-write from https://docs.docker.com/terms/layer/
45. Filesystem virtualization: AUFS ★ Works on top of other ﬁlesystems ★ File-based copy-on-write ★ Each layer is just a directory in the host ﬁlesystem; no user namespace mapping is applied ★ Original underlying ﬁlesystem for Docker containers ★ Read/write performance degrades with number of layers ★ Write performance degrades with ﬁlesize ★ In practice, dotCloud avoided these performance problems by adding secondary volumes to containers to store data separately from container layers See also http://jpetazzo.github.io/assets/2015-03-03-not-so-deep-dive-into-docker-storage-drivers.html and https:// github.com/docker-library/mysql/blob/master/5.6/Dockerﬁle#L35
46. True lies about ﬁlesystems and blockstores Filesystem virtualization: ZFS from http://www.slideshare.net/relling/zfs-tutorial-lisa-2011
47. More lies for better performance Filesystem virtualization: ZFS hybrid pools from http://na-abb.marketo.com/rs/nexenta/images/tech_brief_nexenta_performance.pdf and http:// agnosticcomputing.com/2014/05/01/labworks-14-7-the-last-word-in-zfs-labworks/
48. Filesystem virtualization: ZFS ★ Native block-based copy on write ★ No performance hit for CoW ★ Default thin provisioned ﬁlesystems backed by hybrid pools of real devices ★ Low provisioning cost ★ Native snapshots map to Docker layers ★ Native checksum validation used to detect device errors before the device reports them ★ Convenient, fast, and reliable by default ★ Native support for write-through SSD and big read caches to further improve performance
49. from https://twitter.com/swardley/status/587747997334765568
50. from https://twitter.com/swardley/status/587747997334765568
51. The lie that completes the cloud Network virtualization from https://blogs.oracle.com/sunay/entry/crossbow_virtualized_switching_and_performance (Wayback Machine)
52. Network virtualization: Weave A weave router captures Ethernet packets from its bridge-connected interface in promiscuous mode, using ‘pcap’. This typically excludes traﬃc between local containers, and between the host and local containers, all of which is routed straight over the bridge by the kernel. Captured packets are forwarded over UDP to weave router peers running on other hosts. On receipt of such a packet, a router injects the packet on its bridge interface using ‘pcap’ and/or forwards the packet to peers. from http://weaveworks.github.io/weave/how-it-works.html
53. The lie that completes the cloud Network virtualization: Crossbow from https://blogs.oracle.com/sunay/entry/crossbow_virtualized_switching_and_performance (Wayback Machine)
54. Network virtualization: Triton SDN • Extends Crossbow to add user-deﬁned networks. • Every user gets a private layer 2 network with a unique IP. • All my containers have working interconnectivity, regardless of what physical hardware they’re on • …but your containers can’t see my containers. • When requested, containers also get a unique, publicly routable IP.
55. An exquisite collection of lies Docker from https://blog.docker.com/2014/12/announcing-docker-machine-swarm-and-compose...
56. Docker: Swarm • Aggregates any number of Docker Remote API endpoints and presents them as a single endpoint • Automatically distributes container workload among available APIs • Works in combination with Docker Compose to deploy and scale applications composed of multiple containers • Oﬀers a direct path from building and testing on our laptops to deploying across a number of hosts • Downside: you pay for VMs, not containers
57. Docker: Triton • Exposes the entire data center as a single Docker Remote API endpoint • Automatically distributes container workload among available APIs • Works in combination with Docker Compose to deploy and scale applications composed of multiple containers (awaiting DOCKER-335 or compose/1317) • Oﬀers a direct path from building and testing on our laptops to deploying across a number of hosts • You pay for containers, not VMs
58. breath for a moment
59. Lie about all the things • Containerize for better performance and workload density • Don't run containers in VMs, that's sad • Watch out for security issues • ...including at the ﬁlesystem level • Virtualize the network too, • give every container its own NICs and IPs • Don't stop lying at the edge of the compute node
60. Missy Elliot’s philosophy • Is it worth it? • Let me work it • I put my thing down, ﬂip it, and reverse it • Get that cash • Ain't no shame • Do your thing • Just make sure you’re ahead of the game
61. Thank you
62. Remember Joyent for… • Proven container security Run containers securely on bare metal in multi-tenant environments • Bare metal container performance Eliminate the hardware hypervisor tax • Simpliﬁed container networking Each container has its own IP(s) in a user-deﬁned network (SDN) • Simpliﬁed host management Eliminates Docker host proliferation • Hybrid: your data center or ours Private cloud, public cloud, hybrid cloud, and open source

Top of the world

Sunday, April 26, 2015