Chinese Diceware

I’ve been trying to come up with strong passwords since forever, and have failed to find a magic alternative to entropy. Recently, I took Diceware for a roll, but wasn’t entirely happy with passwords like “wn rare swung strop situs slept”—wn isn’t even a word, is it? I also tried the Swedish dictionary but wasn’t much happier.

How about Mandarin Chinese, written using pinyin? There are only around 400 pinyin syllables, but thousands of characters with different meanings, so I guessed that for a random sequence of syllables it should often be possible to come up with a somewhat meaningful phrase.

The kHanyuPinlu property from the Unihan database turned out to be an excellent source for character to syllable mapping, so I wrote to reverse that mapping. The output is a list of 392 pinyin syllables with example characters in traditional and simplified Chinese.

Unfortunately, 392 is not a power of 6, so using real dice to generate the numbers is a bit complicated, albeit possible. Instead I wrote, which uses as few bytes as possible from /dev/random to roll a die with an arbitrary number of sides.

Using my list and my virtual D-392, here are the first 6-syllable pinyin phrases I generated, each with a memorable (?) Chinese phrase and a rough English translation.

  • yan kai bo ren se dai—眼開撥任色帶—eyes open, poke any ribbon
  • zui ku ba ge mei xu—最酷八個沒序—the coolest eight have no order
  • ban zhai dian die keng bao—搬宅殿爹吭抱—moving villa/palace, dad says hold (this)
  • you mu sa kang xu su—有母萨扛需速—(things) carried by mother Bodhisattva need speed

A native speaker would probably be able to come up with better phrases, but I think that I could remember any of these, with 最酷八個沒序 being the easiest. If this is a representative sample, I think the scheme works.

How about the entropy? With 392 syllables, each syllable contributes log2(392) = 8.6 bits, so these 6-syllable phrases have 51.6 bits of entropy, slightly better than a completely random 8-character alphanumeric password. English Diceware has 12.9 bits of entropy per word, so to get as much entropy as with a 6-word English phrase, a 9-syllable Chinese phrase is needed. The average word and syllable length are 4.2 and 3.2 respectively, so the average phrase lengths (including spaces) would be 30.2 for English and 36.8 for Chinese. (Removing spaces blindly will lose some entropy if the pinyin becomes ambiguous.)

Feel free to use/improve my lists and scripts, and never forget: the coolest eight have no order!

Free will

My thoughts on Free Will by Sam Harris, cross-posted from Goodreads.

I had already enjoyed the 2012 talk and was a bit worried that a “book” this short couldn’t add much to it. It doesn’t, in fact, add much, but it was still worth my while to revisit the argument in a different medium.

The first of Harris’ arguments concerns experiments where the test subjects are asked to make a decision and record the time of the decision. Apparently, the decision can be predicted by brain activity before the test subject is aware of having made it, which Harris argues shows that our decisions are made for us by deeper processes. I know nothing about psychology or neurology, so I don’t know if the conclusion is sound, but I wish that Harris had spent a little more time exploring this. It makes no evolutionary sense for our consciousness to only act as a narrator for decisions already made, because it would be superfluous. What kinds of choices need to involve our consciousness? When the decision is made elsewhere, why does our consciousness pretends as if it were in charge? Is it possible, with self-control, to force certain decisions out of the dark, into the light of our conscious thought?

Second is the problem of regress. To quote:

My choices matter—and there are paths towards making wiser ones—but I cannot choose what I choose. And if it ever appears that I do—for instance, after going back between two options—I do not choose to choose what I choose. There is a regress here that always ends in darkness.

Or more succinctly:

We are not self-caused little gods.

I think this is compelling, but it is a little bit like the children’s game of “why why why.” Colloquially, we can account for why it snows without asking “why” all the way back to the origin of the universe. Perhaps a similar line can be drawn for inquiries into volition, that ends somewhere inside our heads?

Third, Harris says that self-introspection will reveal that the source of our thoughts and decisions are mysterious even to ourselves. Ever since I saw his talk I have tried to think about this, but cannot say I find it as obviously true as Harris does. I don’t know where my ideas and impulses come from, but if pressed I think I could attribute many of them to known external and internal sources, which are obviously not of my choosing, but still not mysterious. Some preferences, like tea or coffee, are mysterious, but it’s not mysterious why I prefer an ice tea over hot chocolate on a warm summer day.

Finally, Harris untangles free will from determinism. We don’t yet know for certain which kind of universe we inhabit, but there’s nothing about an indeterminate universe that would grant us free will. Conversely, compatibilism is the view that we can have free will even in a deterministic universe, even if Harris is rather dismissive of this. I should probably read Elbow Room: The Varieties of Free Will Worth Wanting to get a fair treatment of the subject.

In the end, the notion of free will is rather like the notion of god—ill-defined and with no supporting evidence. For now, I have no choice but to withhold belief.

Space: everything except the Earth

When I was growing up I wasn’t particularly interested in real-world space exploration. I liked Star Trek and The Hitchhiker’s Guide to the Galaxy, but don’t remember ever being excited about a Space Shuttle launch, Hubble, Mir or the ISS. However, the precision engineering of last year’s Curiosity landing on Mars really caught my attention, and since then I’ve been learning about everything that I’ve missed out on. Here are some awesome things, other than the Earth, in rough chronological order.

The Universe is big. Really big. The amount of knowledge and ignorance we have about the Universe is exhilarating, and it’s changing all the time. In the absence of a quantum theory of gravity, Lawrence Krauss’s A Universe from Nothing is an interesting bit of speculation about where it all came from. Our knowledge of the age just got better, with ESA announcing 13.82 billion years as the new best bet. At the other end of time, how the Universe will end seems to be unknown, but most of the hypotheses point to a Universe that doesn’t care about our feelings.

Robert Goddard’s A Method of Reaching Extreme Altitudes (PDF) from 1919 is surprisingly comprehensible to a software engineer from 1984. The illustrations and photos are wonderful, in particular the Coston ship rocket bundle (p. 48, fig. 7) reminded me of a certain xkcd what if? Also, I was intrigued to find on the topic of “recovery of apparatus on return,” that Goddard suggested a limited form of powered descent (p. 53):

If it is considered desirable, for any reason, to dispense with a sufficiently large parachute, the retarding of the apparatus may be accomplished to any degree by having the rocket consist, at its highest point of flight, not merely of instruments plus parachute, but of instruments together with a chamber, and considerable propellant material. Then, after the rocket has descended to some lower level, […] this propellant material can be ejected, so that the velocity is considerably checked before the apparatus reaches as low an altitude as, say, 5,000 ft.

Goddard also discussed briefly the issue of reaching the Moon; the mocking New York Times editorial and the 49-year-late “correction” are at the same time amusing and tragic.

Reach the Moon we did. If Up Goer Five makes it seem simple, then Stages to Saturn: A Technological History of the Apollo/Saturn Launch Vehicles (PDF) uncovers the amazing engineering of the many engines and stages of the Saturn rockets. (I’m currently editing an EPUB version of this book.) Just days ago, Jeff Bezos recovered two F-1 engines from the bottom of the Atlantic Ocean, with the intent of turning them into museum exhibits. While I’d love to see them, it saddens me that humanity hasn’t had the capability for lunar exploration in my lifetime, and that it’s now a subject for history books and museums.

In the shadow of the Saturn V stands the Soviet N1. Achievements include 30 (!) engines on the first stage and the largest artificial non-nuclear explosion in history, but not reaching space, orbit, or the Moon. Information on the Soviet space program is hard to come by, so Rockets and People, Volume 4: The Moon Race seems like a very valuable account of these events, which I hope to find the time to read soon.

The Shuttle era began before I was born and only recently came to an end. Even though the destination (LEO) is mundane compared to the Saturn V’s, I still thoroughly enjoyed 45 mins of HD quality, high-speed footage of the launch sequences of STS-114, STS-117, and STS-124, with commentaries from NASA rocket engineers (via reddit).

SpaceX does plenty of things to be excited about. It was only a few weeks ago that they completed a 24-story test flight with its Grasshopper reusable rocket—powered descent all the way down, and on a far greater scale than what Goddard contemplated. If full reusability can be made to work, it should lower launch costs dramatically and thus expand access to space to a whole new level. In the category of awesome power, the Falcon Heavy will fire 27 (!) Merlin 1D engines at liftoff, making it the most powerful rocket in my lifetime. I can’t wait to see it roar.

Mars is what got me interested in non-Earth matters, and it must be the next destination for humanity. I very much enjoyed Robert Zubrin’s The Case for Mars, in which he argues for a simple mission structure and producing the methane for the return journey on Mars using atmospheric carbon dioxide. Elon Musk clearly wants to go, calling it “planetary redundancy” and “life insurance.” Perhaps not incidentally, SpaceX is working on a methane-powered engine. Of the publicly announced efforts to actually go, neither Inspiration Mars (human flyby in 2018) nor Mars One (one-way colonization in 2023) seem completely impossible. To see humanity take this step in my lifetime is the most exciting prospect of all.


Innan ultraljudet var du fortfarande ganska abstrakt för mig, jag visste att du fanns där inne men ingenting annat. Att se dig flimra runt på skärmen och få en enda bit information om dig – pojke – gjorde dig genast mycket mer konkret. Den kvällen började jag fundera på namn.

Alva är ett fint namn, vilket ledde tankarna till Thomas Alva Edison. De flesta av mina förslag var ovanliga namn och röstades snabbt ned, så när du blir äldre kan du tacka mamma för att du inte heter Newton eller Kepler. Mammas favoritnamn var Liam. Det dröjde några månader, men till slut så fick du heta just Edison.

Verkligheten blev som förvrängd inne i förlossningsrummet. Det hängde en klocka på väggen som långsamt, långsamt räknade upp sekunderna. Ibland var det långsamma tickandet det enda jag kunde höra, men ibland hörde jag det inte alls. Vi ville så gärna träffa dig, men du verkade helt oberörd av tidens gång.

När du äntligen kom fram så började tiden gå snabbare igen. Du skrek lite ynkligt och det var omöjligt att hålla tillbaka tårarna. En barnmorska gratulerade mig och jag lyckades öppnade munnen, men vet inte om mitt “tack” hördes alls. Plötsligt så var det bara du och jag. Du låg där och tittade medan jag sjöng mormors lilla kråka, men utan att köra ned i diket.

Ett ögonblick senare har du redan fyllt åtta månader. Du har börjat säga “ba ba ba” när du leker och ibland låter det precis som att du ropar efter mig. Du kryper baklänges och dansar när moster My dansar till koreansk pop. Snart kommer du att gå och prata och flytta hemifrån och pappa kommer att sakna dig. Då kommer jag kanske att ta fram de här bilderna och titta, en bild för varje månad du har varit hos oss.

30 juni 2012: 5 dagar gammal

18 juli 2012: Hos farmor och farfar i Stora Hultrum

5 augusti 2012: Nybadad

23 september 2012: Älskad

27 oktober 2012: Kungsparken i Göteborg

22 november 2012: Nyklippt troll

26 december 2012: Jul i Stora Hultrum

24 januari 2013: Hos mormor och morfar i Hà Nội

25 februari 2013: På badsemester i Đà Nẵng

Pappa älskar dig, Edison.

Vad vill Sverigedemokraterna oss?

Efter det tråkiga valresultatet är det frestande att ropa “Hitler kommer!” och raljera över hur dumma medborgarna är, men det är ganska osakligt och tjatigt. Eftersom min underbara fru är nybliven invandrare har jag istället “roat” mig med att läsa Sverigedemokraternas invandringspolitiska program (cache) för att se vad de har att erbjuda oss.

Den härboende personen skall påta sig försörjningsansvar för den anhörige under en femårsperiod och skall dessutom betala en engångssumma, uppgående till ett prisbasbelopp, som ett bidrag till statens utgifter för den anhöriges svenskundervisning och övriga anpassningskostnader.

Jag skulle alltså få betala 42400 kr för min frus uppehållstillstånd, som “ett bidrag” till sfi. I andra sammanhang när vi temporärt belastar samhället – t ex när man skaffar barn eller läser på högskola – så slås den kostnaden ut på den skatt och moms man betalar, varför inte så även här? Det är svårt att se det som något annat än en straffavgift för att jag råkade hitta kärleken utanför Sverige.

Alla uppehållstillstånd som tilldelas nytillkomna utlänningar skall vara tillfälliga. Möjligheten att utfärda permanenta uppehållstillstånd som ett mellansteg mellan temporärt uppehållstillstånd och medborgarskap skall därmed avskaffas. Det tillfälliga uppehållstillståndet gäller för ett år i taget […]

Inget PUT, någonsin. Dessutom så ska vi springa till Migrationsverket en gång om året. Kul! Hur länge ska då detta pågå?

Kraven för att få svenskt medborgarskap skall skärpas kraftigt. Grundläggande krav skall vara att man varit bosatt i landet i minst tio år och att man under denna tid uppvisat en klanderfri vandel.

Tio år! Om några år när vi fått barn så vore det väldigt praktiskt med svenska pass för hela familjen vid utlandsresor, eftersom det ofta är olika visumregler för vietnamesiska och svenska medborgare. SD erbjuder istället mer krångel och spring på ambassader!

Därtill skall man också underteckna en deklaration där man bekräftar sin lojalitet med Sverige och förbinder sig att respektera svenska lagar och övriga samhällsregler.

Att vara lojal mot Sverige är ungefär lika befängt som att vara lojal mot Småland. Det är i bästa fall en löjlig symbolhandling och i värsta fall ett löfte om att alltid sätta Sveriges intressen framför andra länders, alltså motsatsen till internationell solidaritet. Sådant nonsens ska ingen behöva skriva under på.

Den som är svensk medborgare skall inte kunna inneha annat medborgarskap utöver det svenska.

Det här innebär att min fru och mina framtida barn inte kommer att kunna ha vietnamesiska pass och alltså blir tvungna att ansöka om visum när de ska hälsa på mormor och morfar vid Tết (nyår). Det har dock SD löst finurligt genom att förtydliga att vissa nyår är bättre än andra:

Lovdagar i anslutning till religiösa högtider skall endast omfatta traditionella svenska och kristna högtider.

(Förvisso är Tết inte en religiös högtid, men jag betvivlar att det spelar någon roll.)

Sammantaget är det tydligt att ni Sverigedemokrater vill djävlas en hel del med mig och min familj. Ni “tar avstånd från mångkulturalism” och vill inte att min frus kultur “jämställs med, eller värderas högre än,” den svenska. Jag ber er då att dra åt helvete, för vi tänker fortsätta fira okristna högtider och prata olika språk, huller om buller!

När de faktiska politiska programmen är så här patetiska så borde ingen vara orolig för att “ta debatten” med Sverigedemokraterna – citera helt enkelt deras egna åsikter så inser de flesta vilket trams det faktiskt rör sig om.

Sakliga kommentarer välkomnas, andra ej.

SRT research

Discussions in the WHATWG and W3C over several months have led up to the announcement of a new <track> element and the WebSRT format. WebSRT is intended to be mostly compatible with existing SRT content and software, in order to hitch a free ride on the popularity of SRT.

Unfortunately, there was never a proper SRT parsing specification, so all media players implement their own parsers and error handling, much like was the case with HTML before HTML5. If these media players are going to support any of the new features in WebSRT, they will have to do so by modifying existing SRT parsers, as there’s nothing to differentiate SRT and WebSRT. Interoperability would be helped if they were are able to converge towards the same parsing algorithm, but they can only do that if WebSRT handles existing content as good as or better than current algorithms. If we cannot achieve that, it might be better to invent a format that has no legacy compatibility constraints.

There’s been some testing of existing media players, but not much analysis of existing content. I asked OpenSubtitles if they could help out, upon which they very  kindly provided me with the latest 10000 uploaded SRT* files. I wrote a Python script to analyze them, and I think the results are interesting.

First a note on character encoding. Only 666 files were valid UTF-8 and out of those 472 were pure 7-bit ASCII, so deliberate use of UTF-8 doesn’t even reach 2%. Since WebSRT assumes UTF-8, little existing content can be reused as-is.

This is the typical structure of SRT (source):

00:00:10,000 --> 00:00:16,000
The Conceited General

00:01:08,520 --> 00:01:10,240
The general returns victorious

I’ll use WebSRT terminology: above are 2 cues, each with 3 lines for identifier, timings and the cue text followed by a blank line. Unfortunately, assuming that a blank line separates cues turns out to be unreliable, as 241 files at some point omitted that blank line. In my code, I let a timing line start a new cue even if not preceded by a blank line. I’m not sure what the best general approach is.

The identifier line is mostly useless and has been made optional in WebSRT. I defined any line preceding a timing line as being the identifier. Under this assumption, 571 files had identifiers that didn’t increase by 1 per cue and 55 files had identifiers which weren’t numbers at all. This doesn’t seem to matter to existing players.

The timings are a bit more interesting. No less than 1707 files had overlapping cues. Most existing players handle this by simply showing (only) the next cue when it begins, so such overlap goes unnoticed. However, the WebSRT parser makes no such adjustments, intending that overlapping cues be shown simultaneously. This will quite certainly be a problem if existing content is reused. Also worth noting is that only 4 files consequently used a period (.) to separate seconds and milliseconds, 2 files mixed (apparent typos) and all the rest used only commas (,). Only 1 file used the SubRip X1: … syntax and 38 files had something else trailing the timings. This was mostly trailing punctuation (.,?) or due to a missing newline before the cue text or random typos.

What remains is the cue text itself. Markup, which I defined as anything matching the regular expression ‘<(\w+)>’ or the string ‘<font’, was surprisingly common, occurring in 5525 files. The most common are <i> (5273), <b> (937), <font …> (346) and <u> (71). The WebSRT parser handles italic, bold and ruby markup, ignoring the rest. The fact that markup is so common means that any robust SRT (not just WebSRT) parser must handle it in some way, even if only by ignoring it.

That’s what I could gather from the data I have. If there’s something you want me to check, just leave a comment. Many thanks to OpenSubtitles for providing the data.

*They noted that this regular expression was used to identify SRT files: /^\d\d:\d\d:\d\d[,.]\d\d\d\s*–>\s*\d\d:\d\d:\d\d[,.]\d\d\d\s*(X1:\d+\s+X2:\d+\s+Y1:\d+\s+Y2:\d+)?\s*$/m This means that very broken files won’t have been included.