I'm using a Chinese database in text that saves records within this format:

Traditional Simplified [pin1 yin1] /British equivalent 1/equivalent 2/

I have attempted parsing it using delimiters (in Java).

This is exactly what I've to date:

                    String delims = "[\\[\\]/]+";
                    String tokens[] = str.split(delims);

However , the British equivalent also consists of delimiter tokens.

For example:

⿔ ⿔ [gui1] /variant of 龜龟[gui1]/

Wouldso would someone parse this String?

I am looking to get the next information in the String:

Simplified: ⿔

Traditional: ⿔

Pinyin: gui1

British Equivalent: variant of 龜龟[gui1]

Use regex to cleanup the entire string.

String text = "⿔ ⿔ [gui1] /variant of 龜|龟[gui1]/";

String pattern =    "(\\S+)\\s*(\\S+)\\s*\\[(.+?)\\]\\s*/(.+?)/";

text = text.replaceAll(pattern, "$1;$2;$3;$4"));

(\\S+) --->
find continuous non-whitened space group

\\s* --->
find continuous whitened space

\\[(.+?)\\] ---> gui1
find everything inside [ bla bla bla ].
'?' will match least possible answer.
e.g. [ bla bla ] instead of [ bla bla] [ble ble ]

/(.+?)/ ---> variant of 龜|龟[gui1]
just like above, but find everything inside / bla bla /
'?' will match least

You can look at the regex here


Now text becomes:
⿔;⿔;gui1;variant of 龜|龟[gui1]

You can also continue using ; as delims to separate them

String tokens[] = text.split(";");