Skip to main content

Keio standup notes

Week 1: What is Unix?

Unix is an idea.

Why did it win?

What is an operating system?

What is a computer?
But what is an operating system?
  • An OS kernel provides a coordinating layer between the hardware and programs running on that hardware.
  • The kernel runs in ring 0, kernel mode or kernel space, and has full access to everything.
  • User programs run in ring 3, user space.
  • Usually we say that the standard tools provided to run in userspace are the userspace tools.

What was computing like in the early 1960s?

Expensive!

  • "Hail to the IBM"
  • Desk Set
  • The first microprocessor wasn't until about 1970s, so the processors in these systems were HUGE
  • Multics was developed starting 1964 on 36-bit GE mainframes, development continued until the 1980s.
  • "overdesigned and overbuilt and over everything. It was close to unusable. They [Massachusetts Institute of Technology] still claim it's a monstrous success, but it just clearly wasn't". -- Ken Thompson
  • "Multics couldn't even accommodate three users efficiently," writes Unix historian Peter Salus in his engaging book A Quarter Century of Unix. "Computer research at BTL was in the doldrums." In https://arstechnica.com/tech-policy/2011/07/should-we-thank-for-feds-for-the-success-of-unix/
  • "Program development generally occurred out-of-hours," a programmer subsequently explained to Salus. "The terminals on the development machine were in a common room and when several people were at work, one would call out 'dangerous program!' before executing a new a.out file (the default output file from the linking editor). This allowed others to save their editor-files quickly (and often)." - Same Ars Technica article
  • Originally MIT + GE + Bell Labs

It's always about video games

  • Ken Thompson at Bell Labs had written a game called Space Travel for Multics.
  • Gameplay video
  • Then Bell Labs internal accountants started tracking computer use time, and each gaming session cost $50-$75... not to mention that Bell was leaving the Multics project.
  • So he ported it to an unused, underpowered PDP-7 in another department... it needed an operating system, so he started building something minimalistic in assembly language and called it Unics...

Enter the PDP-11

The DEC PDP-11 was hot stuff! 16 bits, up to 4 MB of RAM by the mid-1970s, only $10,000 (~1千2百万円today) starting price.

More than 20 operating systems eventually ran on it!

Folks at Bell Labs wanted one, and by promising another department that they would build a word processor for them, were able to get the budget to do so.

It needs an OS

Ken Thompson and Dennis Ritchie are considered the two people most closely associated with Unix's design.

  • https://en.wikipedia.org/wiki/History_of_Unix

What we wanted to preserve was not just a good environment in which to do programming, but a system around which a fellowship could form. We knew from experience that the essence of communal computing, as supplied by remote-access, time-shared machines, is not just to type programs into a terminal instead of a keypunch, but to encourage close communication.

  • By 1971, many modern commands were already present: https://en.wikipedia.org/wiki/Research_Unix#Versions

We need a programming language to build an OS

Enter C, by Dennis Ritchie.

C was an improvement on B, which Ken Thompson also worked on. B was a simplification of BCPL. There was no A.

By 1973, Unix was rewritten in C, making it the first practical operating system not written in assembly language.

  • The K&R C book, by Brian Kernighan and Dennis Ritchie

Fork it

Many of the early Unix systems are usable online here. ssh unix50@unix50.org, password unix50

  • AT&T logo or Death Star?
  • After an agreement with the United States government in 1956, the Bell System could not sell products, and was mandated to license all of its patents to anyone who wanted them for a reasonable fee.
  • Ken Thompson started giving away copies of Unix and Unix source code for the cost of media (e.g. tape reels) and shipping.
  • Within a few years, AT&T was licensing their Unix patents and providing source code to schools for about $200.
  • This is arguably the accident that allowed Unix to dominate.

  • The map of Unix derivatives

  • TIM Berners-Lee's NeXTcube

POSIX and SUS

Proto-open-source: BSD and the GNU Project

We'll cover this next time.

What comes after Unix?

Many of the same people worked on Plan 9, which remained a niche operating system.

There are many hobby operating system projects, but for general-purpose computing, Unix is the dominant paradigm, with no signs of change.

By the way

You can follow Rob Pike, one of the early Bell Labs Unix pioneers and Golang co-creator alongside Ken Thompson, on Mastodon.

Long ago, I was an angry young man who railed against all the injustices of the world, but now time has passed, I have matured and learned so much more about things, and now I am an angrier old man.

What makes Unix Unix? (part 1)

Decoupling

Consider Windows. Can you run Windows without windows? In other words, can Windows exist without a GUI?

What shell do you use? zsh or bash? Something else?

What window manager do you use? MacOS Finder? Wayland? X Windows? Some combination of these?

Who wrote your printer manager? Does the fact that Apple wrote it prevent it from working on Linux?

What standard libraries and command-line tools do you use? Are there other implementations of them with similar APIs?

Week 2: What are Unix's building blocks?

A little more history

  • Unix was initially created by Ken Thompson and Dennis Ritchie (classic photo of the pair) with several others at Bell Labs. You can still follow Rob Pike on Mastodon.
  • C was closely related to Unix from the beginning. (By the way, where did the name "C" come from?)
  • Thanks to a lot of ambiguity over intellectual property rights, and the ability for universities to obtain the source code for cheap, the late-1970s versions of Unix became the basis of many forked versions, including BSD, OpenBSD, NetBSD, FreeBSD, NeXTStep, MacOS, Irix, Xenix, AIX, etc... but not Linux. Linux came from a different path.
The GNU Project
  • By 1983, Richard Stallman (rms), a programmer at the MIT Artificial Intelligence Lab, was bothered by the increasingly closed nature of software. This led him to found the GNU Project and the Free Software Foundation NPO by 1985. This predated the idea of "open source" by at least ten years.

    • Stallman had never used Unix! For many years, he had used the PDP-10-based ITS "Incompatible Timesharing System" and had written the first version of emacs for ITS.
    • rms's recollection of events
  • rms is a problematic person -- REALLY problematic -- and we should talk about this! I have withheld donations from the FSF for years now because he is still on the board. He has consistently defended indefensible conduct and himself behaved in a manner at least inappropriate, if not more.

    • The Stallman Report was just published on 2024-10-14 and summarizes many years of issues.
    • Even so, we can't talk about computing today without talking about rms.
  • The GNU Project's goal was to build a free, Unix-like operating system. rms started with a new version of emacs and a C compiler (gcc), then started adding things like core Unix command-line utilities.

    • The goal was to create an OS called GNU Hurd. It has been in development since at least 1990 and is still unusable in 2024, largely because its Mach-based microkernel is still unusable.
    • Incidentally, MacOS also based their hybrid kernel on Mach and managed to make it work.
Enter Linux
  • Linus Torvalds first started developing the Linux kernel in 1991, and in 1992 licensed it under the GNU GPL. People quickly started pairing it with the GNU Project tools to deliver a full operating system.
  • A few quotes from the beginning

  • Sigh... Linus is also a problematic person, although not in the same way as rms. He has a reputation for sending extremely abusive emails to other Linux kernel developers on the kernel mailing list, and has repeatedly made attempts to improve in recent years. This became such a big deal that the New Yorker covered it in 2018.

  • The combination of the Linux kernel with the GNU userspace tools essentially created the world we live in today. Had this not happened, we would probably all be using OpenBSD and discussing Theo de Raadt's communication style instead.

  • This is one of the open-source/free software world's enduring memes.
Rediscovering accidental and intentional patterns

After Linux's success and the rise of personal computers, many more programmers were exposed to Unix ideas. There's a lot to cover in esr's book The Art of Unix Programming.

Less history, more hands-on

Let's get our terminals running. More photos

Make yourself a user account
sudo /sbin/adduser USERNAME
sudo /sbin/usermod -a -G cdrom,floppy,sudo,audio,dip,video,plugdev,users,netdev USERNAME

And then exit by typing exit or pressing Ctrl+D, sending the end of transmission character.

About users

How many user accounts are on this system? Where can I look to find this out? What are these other users?

What does this file mean, anyway? Where are the passwords?

What is a group?

Focus on the filesystem

The Unix filesystem isn't just a place to store files. It's a hierarchical namespace containing several different APIs and a lot of file-like objects.

  • A few weeks ago, we talked about accessing serial ports through a file: /dev/ttyUSB0
  • What happens if you cat /dev/nvme0 ? (Note: don't do this).
  • /proc contains information about running processes and the system. Check less /proc/cpuinfo, for instance.
  • /sys is used on Linux to understand and control the kernel and the devices it manages. Here's an example of how to control a GPIO device by writing to files in /sys.
    • You can change your screen's brightness by writing to the backlight file: echo 5 > /sys/class/backlight/acpi_video0/brightness

These APIs are naturally secured by the same permissions system that protects your files. And how are those protected?

Chmod calculator

What's a mount? What's mounted?

What's up with dotfiles? Rob Pike explains that they were a mistake.

What's a hard link?

Types of files: regular (could be text or binary, but the OS doesn't care), directory, symbolic link, FIFO special (pipe), block special (hard drives, etc. that need random access), character special (like serial ports, that send and receive streams of data), and socket (Unix sockets, not network sockets, but they work similarly)

There are some interesting consequences of these abstractions:

# Output your microphone to a remote computer's speaker. This only works on old Linux systems and maybe BSD.
# These days, pipewire would probably be a better way of doing this.
dd if=/dev/dsp | ssh username@host dd of=/dev/dsp

Week 3: Virtual machines

By the way, xkcd has some thoughts on the Unix permissions model.

Are we developers yet?

Digression: installing Debian

BIOS/UEFI, bootloader, disk partitions, swap partition or file, package management and trust, port forwarding.

Week 4: Processing

Processes

With a little knowledge of users and files, we can get a rough idea of what a process is.

Wikipedia summarizes what makes up a process:

  • An image of the executable machine code associated with a program.
  • Memory (typically some region of virtual memory); which includes the executable code, process-specific data (input and output), a call stack (to keep track of active subroutines and/or other events), and a heap to hold intermediate computation data generated during run time.
  • Operating system descriptors of resources that are allocated to the process, such as file descriptors (Unix terminology) or handles (Windows), and data sources and sinks.
  • Security attributes, such as the process owner and the process' set of permissions (allowable operations). On Linux, this is enforced using namespaces and cgroups.
  • Processor state (context), such as the content of registers and physical memory addressing. The state is typically stored in computer registers when the process is executing, and in memory otherwise.

If you want to learn more about things like virtual memory, take an operating systems class. And maybe read Can you get root with a cigarette lighter?

On Linux, every process can be identified by a PID, a process ID.

Every process has a parent, except for process zero, the "idle process." It spawns process 1, the init process. Almost all other processes are children of init. On modern Linux, init is usually systemd, but this is a VERY modern and Linux-specific development.

Parents can create child processes, and are generally responsible for managing the lifecycle of their children. They are usually created through a process called forking, where the parent and all of its metadata are copied, and then the new program is overlaid. Parents are supposed to wait for their children to exit, and (usually) terminate them when the parent exits. See also fork bombs.

See also: orphan processes and zombie processes. Orphaning a process is not always a bad thing: this is how services/daemons are started.

Processes receive information on startup from several places: the arguments provided (argv), the set of environment variables provided to them, and their standard input. They might also be configured to read a file on startup.

You can send signals to processes to ask them to do things or shut down. One way of doing this is the misnamed kill command.

For instance, sending kill -s SIGUSR1 to a running tarsnap process will tell you what it's up to.

init and systemd

  • There has to be a process 0 and process 1. Process 0 is used by the kernel during startup. Process 1 is the init process, the parent of all other non-kernel processes.
  • Historically there was SysV init, introduced in Unix System 5. Most Linux distributions used this up until about 2015. BSDs tend to use a somewhat similar system ("BSD init").
  • Now most Linux distributions use systemd. The transition was long and often loudly protested, but systemd is probably an improvement. How systemd looks to longtime Unix users. systemd architect Lennart Poettering is a lightning rod for controversy.
  • systemd is organized around the concept of "units."
  • Common systemd commands you might use include systemctl, journalctl, timedatectl, resolvectl, and the three power management commands (poweroff, reboot, and halt)
  • MacOS uses launchd, controlled by launchctl

Week 5: Shell stuff

Bash and friends

  • Bash and zsh are the most popular modern derivatives of the late-1970s Bourne shell (bash stands for "Bourne-again shell")
  • Interactive shells are a REPL for the shell programming language
    • Many interpreted languages have REPLs, for instance the Python or Ruby shells
  • It's also best to thing of a shell as a process control language. In most programming languages, it's hard to start a new process, and most work is done inside one process. Shell languages are the opposite.
  • What interprets the * in ls *?
  • Some helpful resources:
  • less is used everywhere, so it's good to be familiar with it.
  • sed, the stream editor.
    • find . -type f -name providers.tf -exec sed -i 's/required_version = "1.8.0"/required_version = "1.8.5"/g' {} \;
  • Use ack or ripgrep or similar tools instead of grep to search through directories of source code.

Week 6: Making a hash of things

A disclaimer

  • Now that we're talking about cryptography, it's important to remember: Don't roll your own crypto!
  • "Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can't break." - Bruce Schneier

What's a hash?

Hash browns, corned beef hash, salmon hash, "to make a hash of things..."

Similarly, a hash function chops up a number in a way that makes it unrecognizable.

Unlike the food, though, hash functions usually take an input of arbitrary length and shrink it down (or size it up) to a fixed length.

Simple example: credit card numbers and the Luhn algorithm

This is often called a checksum rather than a hash, but they are conceptually very similar.

All inputs resolve to only ten different files, so collisions are frequent.

How many unique files are there in the universe?

Technically, infinite. But let's think about something more concrete.

Project Gutenberg's text-only copy of War and Peace is 3.3 megabytes, or 3,359,652 bytes, or 26,877,216 bits. How many other files of this exact length exist?

About 2^26,877,216. Wolfram Alpha gives a decimal approximation of about 1.64 * 10^8,090,848. Wikipedia tells us there are only about 10^80 atoms in the universe.

And that's only for files the exact length as War and Peace.

On the other hand, hashes are commonly 128, 256, or 512 bits in length. But even 2^128 is in the range of 10^38; 2^512 is around 10^154. With a truly random hash function and a realistic number of inputs, the odds of a hash collision are fairly low.

(Even outside of hashing, 128-bit UUIDs are commonly used as universally unique identifiers.)

Hash tables (or hash maps) (or dictionaries) (or associative arrays) (or key-value stores)

There's lots of different names for this very common data structure. There can be dictionaries, maps, associative arrays that don't use hashes, but hash-based implementations have O(1) performance.

Python uses them to implement dict, Java's most frequently used Map implementation is HashMap, etc.

Wikipedia

These usually use very fast hash algorithms where collisions are not as much of a problem.

File checksumming

Historically, files often came with a CRC checksum value you could use to confirm the file's integrity, but more recently cryptographic hash functions such as SHA2 or SHA3 have been common.

Depending on your platform, you might have sha256sum, sha512sum, etc. available. These come with a widely-used file format to let you easily check a file against a known checksum. You might also have shasum on MacOS. On pretty much any platform openssl dgst is available.

This really does help detect changes! I have an example from testbed development earlier this year. And the codecov incident is a great example of how a shasum exposed a security problem.

Merkle trees and Git

Wikipedia

Pro Git book on Git objects

Cryptographic hash functions

Wikipedia

Output should appear to be random, and have a equal distribution over the full range of possible hash values.

It should be hard to find collisions: if you know one input resolves to a particular hash, it should be very difficult to find another input that also resolves to the same hash. It is now feasible to generate collisions for the MD5 algorithm; this was exploited to generate a fake Microsoft signing certificate in the early 2010s as part of the Flame malware!

These days, SHA-3 is most preferred, with SHA-2 in common use. SHA-1 is considered risky and in need of replacement, and MD5 should not be used at all.

But a cryptographic hash function alone is still not enough to store a password safely...

Storing passwords

Hashes allow us to build services that authenticate passwords without actually storing the password.

But rainbow tables illustrate the problem with simply hashing a password before storing it.

  • Specialized hashing algorithms like bcrypt, scrypt

    • These are intentionally designed to be computationally expensive.
    • All of them also add a salt to strengthen the input password.
  • Have I Been Pwned tends to get copies of password database dumps. You can get notifications if your email address is found in the dump. They even have a way to search for pwned passwords without exposing the password.

What makes for a good password?

Depends on your threat model.

For an encrypted hard drive, where an attacker can make as many copies as they want and use as much computational power as they want, the easy-to-understand estimate in the cryptsetup FAQ says about 18 truly random characters.

Major websites should ideally use rate limiting to slow down attempts to break passwords. Even without rate limiting, attempts will be much slower, limited by network and server capacity.

No matter how good your password is, password spraying is still very much a threat if you re-use your password in more than one place.

Other applications

  • I once needed to look for identical rows in a database, so I added a new database column with a hash of the other columns' values concatenated together. Most likely creating a massive search index over all columns would work similarly.
  • Companies often use HMAC algorithms with a fixed-key input to hide sensitive personal information while still enabling searching.
    • For example, hiding users' birthdays by sending the birthday into a hash function, mixed with a salt that is kept secret, and then storing the hashed output.
    • But there are very few birthdays, so if you know one user's birthday, you can identify all other users with the same birthday...
  • There are schemes to prevent sending passwords in plain text over a network (that are very rarely used in practice).
  • Distributed blockchains use hashing for several purposes, but since distributed blockchains are generally not the best solution to any problem they don't really belong in this section.

Encryption

The CIA triad

Confidentiality, Integrity, Availability

Symmetric encryption

You have a secret key. Everyone with the key can encrypt messages using that key. Everyone with the key can decrypt messages using that key.

You will probably almost exclusively use block ciphers, and almost always AES. Blowfish sometimes makes an appearance. AES-128 means the key is 128 bits long; 192 and 256 are also supported.

Just as important is the way that you chain these blocks together; the block cipher mode of operation.

gpg --encrypt --symmetric <<EOF
<some message>
EOF

Asymmetric (public-key) encryption

Public-key encryption is magic. Friendship is also magic, but sometimes you need a good way to protect your secrets.

PGP and GPG

GPG is an implementation of the PGP standard. "GPG" stands for "GnuPG."

The thing is, PGP is not good and needs replacement, as Matthew Green pointed out a decade ago.

gpg --full-generate-key
gpg --list-secret-keys
gpg --armor --export <key id>

Share this with everyone. Then import others' keys, for instance:

gpg --import <file containing key text>

And sign the key, indicating that you trust it:

gpg --sign-key <key id>

Now, you can send messages to other people:

gpg --encrypt --sign --armor --recipient <key id> <file containing message text>

We could go farther and create a web of trust.

TLS and HTTPS

And now, a musical interlude.

TLS is considered a "layer 6" or "presentation layer" protocol, while HTTP is layer 7 (application). That is to say that TLS can be used to secure many different application-layer protocols. When it's used with HTTP, it's called "HTTPS." The "layers" come from the OSI model, which doesn't match how the Internet actually works exactly.

Software licensing, intellectual property, and copyright

From an American perspective. IANAL (I am not a lawyer).

From 1776, the Declaration of Independence's "life, liberty, and the pursuit of happiness" is well-known, but it was inspired by earlier sources that often said something more like "life, liberty, and property."

The American constitution (1789) specifically gives Congress the power:

To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries;

showing that the idea of intellectual property, not just real property, was already present at the beginning of modern democratic government.

There are, very broadly, four different kinds of intellectual property recognized by law:

  • Trade secrets
  • Trademarks
  • Patents
  • Copyright

How would each of these apply to Coca-Cola, for example?

Intellectual property can be licensed to other people. In other words, the owner of the property agrees to give you a license to do certain things with that property: for example, if you have a Netflix subscription, you have a license to watch shows on Netflix, but you do not have a license to set up a movie theater and sell tickets to watch Netflix.

Copyright and software

Anything creative that you write or produce is automatically copyrighted, giving you exclusive rights for a period of time. After that period of time, works enter the public domain.

What is more important: the ideas or the implementation?

Is writing software more like proving a theorem, or writing a novel?

We use copyright as the primary means of protecting and collaborating on software, but patents also play a significant role.

Open source and Free/libre software

Open Source Initiative

Open source ideas outside of software

Recent challenges

Patents and patent thickets

Patents are intended to protect inventions. It's not necessary that you actually produce a working example of the invention, though, so often it's more like patents protect ideas -- sometimes very obvious ones.

Digital Rights Management

DRM is possibly the third-biggest waste of software engineering time and computing power, after cryptocurrency and generative AI.