Japanese pronunciation is one of the more learnable parts of the language for English speakers because the sound inventory is smaller than English and the rules are consistent. Five vowels, a handful of tricky consonants, and two rhythm rules (mora timing and pitch accent) cover almost everything. This guide names each of those pieces, shows minimal-pair examples so you can hear the difference, and gives you a 15-minute daily drill that works alongside any JLPT preparation you are already doing.
The Japanese sound system in one page
Five vowels, about 15 consonants, mora rhythm, and pitch accent.
Compared with English (roughly 24 vowels and diphthongs + 24 consonants), Japanese is almost minimalist. That small inventory is why spelling is predictable — every hiragana character maps one-to-one to a sound — and why transliteration systems like romaji work smoothly. The same compactness is also why vowel length and small っ matter so much: with fewer sounds available, the language uses timing differences to separate words.
Vowels, long vowels, and why length matters
All five Japanese vowels are short. Long vowels are exactly double length.
Japanese has exactly five vowel qualities: a (like father but shorter), i (like machine), u (like put but with unrounded lips), e (like pet but tenser), and o (like forty). Every vowel is short by default. A written long vowel is exactly twice the length of a short vowel — not "a longer a" but literally two beats of the same sound glued together.
English speakers often shorten long vowels unconsciously because English uses vowel quality, not vowel length, to distinguish most words. Train your ear with minimal pairs: say the word slowly, clap once per mora, and you will feel whether "grandmother" takes five claps or four. The Kana to Romaji Converter tool shows the romaji with doubled letters (obaasan not obasan) so you can see the length visually.
Seven consonants English speakers get wrong
Most Japanese consonants are easy. These seven need active practice.
The seven to drill specifically
- R (ら り る れ ろ) — a single tap with the tongue, closer to a Spanish R than English R/L. Never rhotic, never with English L.
- F (ふ only) — soft bilabial fricative: lips close, teeth do not touch lower lip. ふ is the only F in the language; fi, fe, fu, fo appear only in loanwords as ファ, フィ, フェ, フォ.
- V — does not exist in native Japanese; loanwords use B (バイオリン for violin). Modern katakana ヴ appears but is still often read as B.
- TH — does not exist; loanwords substitute S/Z (サンキュー for thank you, ザ for the). English speakers must consciously drop the English TH.
- SH/CH/TS — these are single sounds (one mora), not two. し is always shi, ち is chi, つ is tsu. Do not try to insert extra consonants.
- H in ひ and ふ — ひ is pronounced with a light palatal friction (close to German ich), not English H. ふ is the F sound above, not English H.
- N (ん) — word-final or mid-word ん assimilates: ほんや (bookstore) uses a palatal N before the ya row; しんぶん (newspaper) uses M before B. Let this happen naturally rather than forcing a clean N.
The L → R substitution is the most famous problem but the vowel-length rule above breaks more words than the R sound does. If you have only 5 minutes to practise a single consonant, pick Japanese R and drill the ら-り-る-れ-ろ row with a native audio source until you stop producing either English L or English R.
Mora timing: the rhythm backbone
Each mora gets roughly equal time. English stress-timing breaks this.
Japanese is a mora-timed language, which means each basic sound unit (mora) takes roughly the same length of time, regardless of whether it feels "important" to you as a speaker. English is stress-timed — we compress unstressed syllables and stretch stressed ones, so "elephant" is roughly three syllables but really two and a half beats in speech. If you bring that habit into Japanese you will unconsciously shorten unstressed vowels, which is the single biggest reason a grammatically correct sentence still sounds "off" to a Japanese listener.
The mora, not the syllable, is Japan's fundamental timing unit. Haiku is counted in moras, not syllables. When you look up a word's pitch accent, the marks apply to moras, not syllables. Internalising this one idea fixes more pronunciation problems than any individual-sound drill.
Small っ and ん — the two invisible moras
Both take a full mora of time but produce no vowel sound.
The small tsu (っ) and the syllabic n (ん) are the two moras that trip up learners because they have no vowel to anchor them. Both must still be held for a full mora of time.
Small っ geminates (doubles) the following consonant. きて (kite, come) is two moras — ki, te. きって (kitte, stamp) is three moras — ki, っ (silent held beat), te. Hold the silent beat. English speakers often squish the っ into the consonant and produce something like "kittay" instead of "kit·te".
ん is also a full mora. しんぶん (shinbun, newspaper) is four moras: shi, n, bu, n. The ん also assimilates to the next consonant: it sounds like [m] before B/P/M, like [ŋ] before K/G, and like [n] before T/D/N. English speakers tend to glide through ん as if it were a consonant attached to the next vowel. Give it its full beat and the word sounds correct immediately.
Pitch accent without the pain
Every Japanese word follows one of four basic pitch patterns.
Pitch accent in Japanese is simpler than stress in English because there are only a handful of patterns, and every word falls into one of them. In standard Tokyo Japanese, every word of two moras or longer is one of these four types:
The four Tokyo pitch patterns
- Heiban (平板, flat): low-high-high-high... and stays high. No drop. Example: さくら (LHH), 日本語 (LHHH).
- Atamadaka (頭高, head-high): high-low-low-low... Drops after the first mora. Example: 箸 (HL) chopsticks, でんき (HLL) electricity.
- Nakadaka (中高, middle-high): low-high-...-drop-low. Rises, holds, then falls. Example: たまご (LHL) egg, おかね (LHL) money.
- Odaka (尾高, tail-high): low-high-high-high, drop on the following particle. Example: 橋 (LH) bridge — sounds like 花 (flower, also LH) alone but 橋が (LHL) vs 花が (LHH).
You do not need perfect pitch production from day one, but you should start recognising these patterns as early as possible because the minimal pairs are all high-frequency words. The free OJAD online dictionary marks pitch for any Japanese word; the Forvo pronunciation site lets you hear native speakers say individual words for free.
Tokyo vs Osaka pitch — which to learn first
Learn Tokyo (標準語) first, then optionally add Osaka/Kansai later.
Standard Japanese (標準語) is based on Tokyo speech and is what the JLPT, NHK News, and most textbook audio use. Kansai-ben (Osaka/Kyoto) uses largely the opposite pitch — words that are HL in Tokyo are often LH in Osaka, and vice versa. Learning both simultaneously confuses most learners; learn Tokyo fluently, then pick up Kansai by exposure if you live or work in the region.
If you live in Osaka or interact heavily with Kansai speakers, the practical move is still to learn Tokyo patterns for exam and media purposes, then let Osaka patterns settle in naturally through daily exposure. Don't try to consciously switch between the two pitch systems — you will end up producing a mix that sounds neither.
Six mistakes English speakers make
These habits cost intelligibility faster than any single-sound error.
Fix these before chasing advanced pitch
- Shortening long vowels. おばあさん becomes おばさん. This changes the word, not just the sound.
- Over-aspirating p, t, k at word starts. Japanese stops are unaspirated — say パ like the P in spark, not the P in park.
- Rising intonation on yes-no questions. In Japanese, the particle か already marks the question; a rising pitch on the final vowel sounds performative or foreign.
- Inserting English R into ら-り-る-れ-ろ. Japanese R is a tap — closer to the D in butter (American) than any English R or L.
- Stressing one syllable like in English. Japanese has no syllable stress; applying English stress patterns compresses vowel length elsewhere and breaks mora timing.
- Dropping the small っ or treating ん as a consonant. Both are full moras; not holding them cuts the word short by one beat.
A 15-minute daily pronunciation drill
Same block every day. Adjust the words to your JLPT level.
Efficiency matters more than total time. Fifteen focused minutes daily beats an hour once a week. Use this block:
15-minute daily pronunciation block
- Minutes 0–3: Five kana rows read aloud with clean short vowels. Pick any five rows and say each character slowly, then in full rows, then in mixed order.
- Minutes 3–7: Shadow one minute of native audio (NHK Easy News or a JLPT listening sample at your level). Shadow means you repeat simultaneously with the speaker, not after. Start with 30-second chunks.
- Minutes 7–10: Pitch-accent minimal pair drill. Pick three pairs (雨/飴, 橋/箸, 神/紙) and produce both pitches. Use OJAD or Forvo to check.
- Minutes 10–13: Record yourself saying five sentences from your current JLPT level and compare with native audio. Focus on mora timing first, vowel length second, pitch third.
- Minutes 13–15: Free speak (monologue) for two minutes on any topic. This converts recognition into production.
The shadowing step is the single highest-value minute of this block. Native-speed audio contains rhythm and pitch information at the speed you will actually need to understand, and shadowing forces you to produce the same pattern. Twenty consecutive days of this routine produces audible improvement.
How to measure real pronunciation progress
Use objective measurements, not feelings.
Pronunciation improvement is gradual and easy to miss without tracking. Record yourself reading the same short passage monthly. After two months compare the recordings back to back. You will almost always hear improvements you forgot you made. Use these as the monthly checkpoints:
Monthly progress checks
- Minimal-pair recognition: can you hear the difference in 雨/飴, 橋/箸, 神/紙 at normal speaking speed? Test with Forvo or a native friend.
- Mora-timing consistency: clap through a 10-mora sentence. Do the claps land evenly, or do you rush the end?
- Long-vowel maintenance: read five long-vowel words and check the length matches the native recording.
- Shadowing lag: how many words behind the speaker are you when you shadow? Aim for half a beat or less by month three.
- Free-speech fluidity: record a two-minute monologue monthly. Count pauses and restarts. Both should drop over time.
Frequently Asked Questions
The usual offenders are the Japanese R (a tap that is neither English R nor L), the F sound in ふ (a soft bilabial fricative, not the English F made with teeth and lip), long vs short vowels (おばさん vs おばあさん are different words), and the small っ that geminates the next consonant (きて vs きって). Pitch accent also causes trouble because English marks stress but Japanese marks pitch — the two feel similar but are handled by different muscles.
Make pronunciation practice part of your normal Japanese routine
Fifteen focused minutes a day alongside your JLPT study builds clearer sound, rhythm, and confidence together. Start with the level check below, then pick up the daily drill above.