Using JavaScript to split text string into word tokens, taking account of punctuation and whitespace and UTF-8 charset
I got an interesting problem today. I was supposed to check some HTML form before submitting to see if the text entered by the user in textarea has some specific words in it. Googling around I found a lot of stuff like "how to split text separated by commas" and such, but I simply wanted to extract words from a paragraph like this one.
My instinct was to use String.split() function, but it splits on a single character and I would have to write a recursive or iterative function to split on all non-word characters. Not being able to predict all the crap users can enter, this did not look like the right choice.
Luckily, I discovered String.match() which uses regex and is able to split text into an array of words, using something like this:
var arr = inputString.match(/\w+/g);
Cool, eh? Now, this all went fine for ASCII English text. But I need to work with UTF-8, or more specifically, Serbian language. Serbian Latin script used by my users has only 5 characters that are not from ASCII set, so I wrote a small replace function to replace those 5 with their closest matches. The final code looks like this:
var s = srb2lat(inputString.toUpperCase()); var a = s.match(/\w+/g); for (var i = 0; a && i < a.length; i++) { if (a[i] == 'SPECIAL') alert('Special word found!'); } function srb2lat(str) { var len = str.length; var res = ''; var rules = { 'Đ':'DJ', 'Ž':'Z', 'Ć':'C', 'Č':'C', 'Š':'S'}; for (var i = 0; i < len; i++) { var ch = str.substring(i, i+1); if (rules[ch]) res += rules[ch]; else res += ch; } return res; }
If you use some other language, just replace the rules array with different transliteration rules.
Tweet to @mbabuskov Tweet Milan Babuškov, 2011-12-01