Project · Break a Vigenère cipher

project

hard

module project

Ship something real. Submit your work when you're done.

Brief

You are given a ciphertext encrypted with a Vigenère cipher using an English-language key of length 6. Build the full cryptanalytic pipeline in Python: (1) estimate the key length using the index of coincidence by trying candidate lengths 1..20 and identifying the length whose split-stream IC most closely matches English (~0.067), (2) for each of the 6 streams use chi-squared frequency analysis to recover the Caesar shift that aligns the stream with English letter frequencies, (3) reconstruct the key from those shifts and decrypt the full plaintext. Your script must work without human intervention on any sufficiently long ciphertext (>= 500 characters) of the same form.

Deliverables

vigenere_break.py — a single Python file implementing index_of_coincidence(text), find_key_length(ct, max_len=20), recover_key(ct, key_length), decrypt(ct, key) and a __main__ block that runs end-to-end on the provided ciphertext.
key_length_table.txt — printed output showing the IC score for each candidate length 1..20, with length 6 visibly winning (closest to ~0.067).
recovered_key.txt — the recovered 6-letter key (printed by your script).
plaintext.txt — the decrypted plaintext, readable English with no manual fixups.
README.md — a short writeup explaining (a) why the index of coincidence reveals key length, (b) how splitting into streams reduces the problem to 6 Caesar ciphers, (c) what would change about your attack if the key length were 60 instead of 6.

How we grade it

find_key_length returns 6 for the provided ciphertext, with the IC score for length 6 within 0.005 of 0.067.
recover_key returns the correct 6-letter key (verifiable by decrypting and reading English).
decrypt produces plaintext matching the reference plaintext exactly (case-insensitive, ignoring spaces/punctuation).
The pipeline runs in under 5 seconds on any 500+ character Vigenère ciphertext with key length <= 20 — no manual tuning per input.
README correctly explains why the IC of English text (~0.067) differs from the IC of random text (~0.0385) and why splitting at the true key length restores English-like IC in each stream.

Hints

The index of coincidence of a text is the probability that two letters drawn uniformly at random from the text are equal: IC = sum(n_i * (n_i - 1)) / (N * (N - 1)) where n_i is the count of letter i and N is the total. For English it is ~0.067; for uniformly random letters it is ~0.0385 (= 1/26).

To find the key length, for each candidate length L split the ciphertext into L streams (positions i, i+L, i+2L, ... for each i in 0..L-1), compute the IC of each stream, and average. The true L produces an average IC near 0.067; wrong lengths produce ICs near 0.0385.

For each stream, finding the Caesar shift is just chi-squared minimisation against the English distribution — reuse the chi_squared function from the frequency-analysis task. The shift s that minimises chi-squared on the stream is the s-th key letter.

Be careful with iteration boundaries: stream i of a ciphertext of length N with key length L has roughly N // L characters. Strip non-alphabetic characters and uppercase before any analysis.

Sanity check: if your recovered key looks like 'KZJXQB' (a high-entropy gibberish key), you have a bug — likely an off-by-one in the stream split. A correct recovery on an English-key ciphertext will yield a real English word like 'CRYPTO' or 'SECRET'.

Expected output

$ python vigenere_break.py Trying key lengths 1..20: L= 1 avg IC = 0.0421 L= 2 avg IC = 0.0428 L= 3 avg IC = 0.0419 L= 4 avg IC = 0.0431 L= 5 avg IC = 0.0425 L= 6 avg IC = 0.0668 <-- best L= 7 avg IC = 0.0419 ... Best key length: 6 Recovering key by chi-squared on each stream: stream 0: shift 2 -> 'C' stream 1: shift 17 -> 'R' stream 2: shift 24 -> 'Y' stream 3: shift 15 -> 'P' stream 4: shift 19 -> 'T' stream 5: shift 14 -> 'O' Recovered key: CRYPTO Decrypted plaintext (first 200 chars): THEINDEXOFCOINCIDENCEISASTATISTICALMEASUREDEVELOPED BYWILLIAMFRIEDMANINTHE1920SFORATTACKINGPOLYALPHABET ICCIPHERSITEXPLOITSTHEFACTTHATNATURALLANGUAGEHASA...

Stretch goals

Replace the chi-squared scoring with a bigram or trigram log-likelihood model trained on a large English corpus (e.g. Project Gutenberg). Compare accuracy on short ciphertexts (< 200 chars) where chi-squared starts to fail.

Extend the pipeline to handle key lengths up to 100. Add a Kasiski examination as a second key-length estimator (find repeated trigrams, take GCDs of the gaps) and combine its votes with the IC method.

Make the attack adversarial: write a generator that produces Vigenère ciphertexts whose plaintext is itself uniformly random over the alphabet. Confirm your attack fails on these — and explain why this is exactly what the OTP guarantees.

Time the attack as a function of ciphertext length; plot ciphertext length vs. success rate over 100 random English plaintexts of each length 100, 200, ..., 2000.

Implement a complete known-plaintext attack: given any 6 consecutive plaintext characters and their ciphertext positions, recover the key without using IC at all.