When storing data in mysql while using UTF8 charset, will it seem sensible to flee entity figures once the information is being input or is it more beneficial to keep it in raw form and change it when tugging out?

For example, let us say someone makes its way into a bullet (&bull) character right into a text box. When saving that data, if it is transformed into • prior to being input? Or wouldn't it seem sensible to go in it as being a bullet, then convert when tugging out?

I suppose I am simply not sure around the guidelines for storing non-ascii data. Any ideas could be greatly appreciated.

If you work with the UTF-8 charset for the whole application (i.e. MySQL, but the encoding of the HTML pages, your scripts, code, and all sorts of that), there's you don't need to tranform "special figures" into organizations : just send your text data as UTF-8 too -)

Keep data as-is. Perform any conversions essential for display at run-time.

If you store it as being HTML (with organizations) you create several issues

  • You lock your computer data towards the HTML format, not only "text content"
  • Messes up data sizes (e.g., varchar(255) or use of SQL string functions like substring() or reverse())
  • Searching against individuals figures becomes impossible without also transforming the search input

The objective of getting away would be to transmit data on the funnel that doesn't allow certain figures. Since an UTF-8 database are designed for UTF-8 figures all right, you've got no reason to flee anything for storage. Actually, since steered clear of text is harder to control (string functions won't work correctly, for example), it is almost always advised to not perform a pointless getting away.

Take into account that the database can host data for multiple programs.

For the reason that atmosphere, the phrase a string within the database is determined through the database, not the applying. Build your application comply with the information standards making the conversions explicit inside your data layer.

For instance, when the database is really a more recent schema and also the DBA has defined that strings is going to be saved in UTF-8, then all strings passed out of your application ought to be UTF-8.

If, however, the database is really a legacy system and also the target for the information is an 8 bit character set, then perform the conversion inside your application towards the appropriate code page and/or fail whenever you encounter a non-conforming value.

Most more recent database schemas that communicate with the net should standardise on UTF-8 or UTF-16. If you're building the database, begin with localising it first after which, once you have made the decision around the internal string representations, pressure all of the programs that email it to adapt for your standards.