javascript - Character Encoding: â? - Stack Overflow

I am trying to piece together the mysterious string of characters â?? I am seeing quite a bit of in our

I am trying to piece together the mysterious string of characters â?? I am seeing quite a bit of in our database - I am fairly sure this is a result of conversion between character encodings, but I am not pletely positive.

The users are able to enter text (or cut and paste) into a Ext-Js rich text editor. The data is posted to a severlet which persists it to the database, and when I view it in the database i see those strange characters...

  1. is there any way to decode these back to their original meaning, if I was able to discover the correct encoding - or is there a loss of bits or bytes that has occured through the conversion process?

  2. Users are cutting and pasting from multiple versions of MS Word and PDF. Does the encoding follow where the user copied from?

Thank you


website is UTF-8 We are using ms sql server 2005;

SELECT serverproperty('Collation') -- Server default collation. Latin1_General_CI_AS

SELECT databasepropertyex('xxxx', 'Collation') -- Database default SQL_Latin1_General_CP1_CI_AS

and the column:

Column_name Type    Computed    Length  Prec    Scale   Nullable    TrimTrailingBlanks  FixedLenNullInSource    Collation
text    varchar no  -1                  yes no  yes SQL_Latin1_General_CP1_CI_AS

The non-Unicode equivalents of the nchar, nvarchar, and ntext data types in SQL Server 2000 are listed below. When Unicode data is inserted into one of these non-Unicode data type columns through a mand string (otherwise known as a "language event"), SQL Server converts the data to the data type using the code page associated with the collation of the column. When a character cannot be represented on a code page, it is replaced by a question mark (?), indicating the data has been lost. Appearance of unexpected characters or question marks in your data indicates your data has been converted from Unicode to non-Unicode at some layer, and this conversion resulted in lost characters.

So this may be the root cause of the problem... and not an easy one to solve on our end.

I am trying to piece together the mysterious string of characters â?? I am seeing quite a bit of in our database - I am fairly sure this is a result of conversion between character encodings, but I am not pletely positive.

The users are able to enter text (or cut and paste) into a Ext-Js rich text editor. The data is posted to a severlet which persists it to the database, and when I view it in the database i see those strange characters...

  1. is there any way to decode these back to their original meaning, if I was able to discover the correct encoding - or is there a loss of bits or bytes that has occured through the conversion process?

  2. Users are cutting and pasting from multiple versions of MS Word and PDF. Does the encoding follow where the user copied from?

Thank you


website is UTF-8 We are using ms sql server 2005;

SELECT serverproperty('Collation') -- Server default collation. Latin1_General_CI_AS

SELECT databasepropertyex('xxxx', 'Collation') -- Database default SQL_Latin1_General_CP1_CI_AS

and the column:

Column_name Type    Computed    Length  Prec    Scale   Nullable    TrimTrailingBlanks  FixedLenNullInSource    Collation
text    varchar no  -1                  yes no  yes SQL_Latin1_General_CP1_CI_AS

The non-Unicode equivalents of the nchar, nvarchar, and ntext data types in SQL Server 2000 are listed below. When Unicode data is inserted into one of these non-Unicode data type columns through a mand string (otherwise known as a "language event"), SQL Server converts the data to the data type using the code page associated with the collation of the column. When a character cannot be represented on a code page, it is replaced by a question mark (?), indicating the data has been lost. Appearance of unexpected characters or question marks in your data indicates your data has been converted from Unicode to non-Unicode at some layer, and this conversion resulted in lost characters.

So this may be the root cause of the problem... and not an easy one to solve on our end.

Share Improve this question edited Dec 28, 2010 at 17:26 akaphenom asked Dec 28, 2010 at 15:33 akaphenomakaphenom 6,89611 gold badges61 silver badges112 bronze badges 2
  • Missing info that can be pretty relevant: DBMS, DB charset, web site charset, language of the information (English, French, Japanese...). – Álvaro González Commented Dec 28, 2010 at 16:01
  • One more test you can do: type –—‘’‚“”„†‡•…‰‹›€™ in Microsoft Word and try to find out at which point of the process it bees corrupt. – Álvaro González Commented Dec 30, 2010 at 8:12
Add a ment  | 

4 Answers 4

Reset to default 3

â is encoded as 0xE2 in ISO-8859-1 and windows-1252. 0xE2 is also a lead byte for a three-byte sequence in UTF-8. (Specifically, for the range U+2000 to U+2FFF, which includes the windows-1252 characters –—‘’‚“”„†‡•…‰‹›€™).

So it looks like you have text encoded in UTF-8 that's getting misinterpreted as being in windows-1252, and displays as a â followed by two unprintable characters.

This is an something of an educated guess that you're just experiencing a naive conversion of Word/PDF documents to HTML. (windows-1252 to utf8 most likely) If that's the case probably 2/3 of the mysterious characters from Word documents are "smart quotes" and most of the rest are a result of their other "smart" editing features, elipsis, em dashes, etc. PDF's probably have similar features.

I would also guess that if the formatting after pasting into the ExtJS editor looks OK, then the encoding is getting passed along. Depending on the resulting use of the text, you may not need to convert.

If I'm still on base, and we're not talking about internationalization issues, then I can add that there are Word to HTML converters out there, but I don't know the details of how they operate, and I had mixed success when evaluating them. There is almost certainly some small information loss/error involved with such converters, since they need to make guesses about the original source of the "smart" characters. In my isolated case it was easier to just go back to the users and have them turn off the "smart" features.

The issue is clear: if the browser is good enough, a form in a web page can accept any Unicode character you can type or paste. If the character belongs to the HTML charset, it will be sent as is. If it doesn't, it'll get converted to an HTML entity. SQL Server will perform the appropriate conversion and silently corrupt your data when a character does not have an equivalent.

There's not much you can do to fully fix it but you can make a workaround: let your servlet perform the conversion. This way you have full control about it. You can, for instance, pile a list of the most mon non-Latin1 characters users paste (smart quotes, unicode spaces...), which should be fairly easy to identify from context, and replace them with something else better that ?. Or you use a library that makes this for you.

Or you can switch your DB to Unicode :)

you're storing unicode data that uses 2 bytes per charcter into a varchar type columns that uses 1 byte per character. any text that uses 2 bytes per chars will have 1 byte lost when stored in the db.

all you need to do is change varchar column to nvarchar.
and then change sql parameters you're using in code of course.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745047962a4608211.html

相关推荐

  • javascript - Character Encoding: â? - Stack Overflow

    I am trying to piece together the mysterious string of characters â?? I am seeing quite a bit of in our

    7小时前
    40

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信