I'm using a Chinese database in text that saves records within this format:
Traditional Simplified [pin1 yin1] /British equivalent 1/equivalent 2/
I have attempted parsing it using delimiters (in Java).
This is exactly what I've to date:
String delims = "[\\[\\]/]+"; String tokens = str.split(delims);
However , the British equivalent also consists of delimiter tokens.
⿔ ⿔ [gui1] /variant of 龜龟[gui1]/
Wouldso would someone parse this String?
I am looking to get the next information in the String:
British Equivalent: variant of 龜龟[gui1]
Use regex to cleanup the entire string.
String text = "⿔ ⿔ [gui1] /variant of 龜|龟[gui1]/"; String pattern = "(\\S+)\\s*(\\S+)\\s*\\[(.+?)\\]\\s*/(.+?)/"; text = text.replaceAll(pattern, "$1;$2;$3;$4"));
find continuous non-whitened space group
find continuous whitened space
find everything inside [ bla bla bla ].
'?' will match least possible answer.
e.g. [ bla bla ] instead of [ bla bla] [ble ble ]
variant of 龜|龟[gui1]
just like above, but find everything inside / bla bla /
'?' will match least
You can look at the regex here
⿔;⿔;gui1;variant of 龜|龟[gui1]
You can also continue using
; as delims to separate them
String tokens = text.split(";");