Gnus development mailing list
 help / color / mirror / Atom feed
* numeric entities
@ 2010-12-06  7:33 Katsumi Yamaoka
  2010-12-06 11:49 ` Julien Danjou
  2010-12-06 14:44 ` Lars Magne Ingebrigtsen
  0 siblings, 2 replies; 8+ messages in thread
From: Katsumi Yamaoka @ 2010-12-06  7:33 UTC (permalink / raw)
  To: ding

When reading html articles, I sometimes see numeric entities like
"›".  Currently `shr' and `gnus-w3m' render it as "\233", but
it should be "›", i.e. U+8250.  Here is a conversion table stolen
from emacs-w3m (#155 is there as #x9B):

(defvar mm-url-extra-numeric-entities
  (mapcar
   (lambda (item)
     (cons (car item) (mm-ucs-to-char (cdr item))))
   '((#x80 . #x20AC) (#x82 . #x201A) (#x83 . #x0192) (#x84 . #x201E)
     (#x85 . #x2026) (#x86 . #x2020) (#x87 . #x2021) (#x88 . #x02C6)
     (#x89 . #x2030) (#x8A . #x0160) (#x8B . #x2039) (#x8C . #x0152)
     (#x8E . #x017D) (#x91 . #x2018) (#x92 . #x2019) (#x93 . #x201C)
     (#x94 . #x201D) (#x95 . #x2022) (#x96 . #x2013) (#x97 . #x2014)
     (#x98 . #x02DC) (#x99 . #x2122) (#x9A . #x0161) (#x9B . #x203A)
     (#x9C . #x0153) (#x9E . #x017E) (#x9F . #x0178)))
  "*Alist of extra numeric entities and characters other than ISO 10646.")

I can implement it in mm-url.el, that is effective to `gnus-w3m',
but I hesitate to use it in `mm-shr' before calling
`libxml-parse-html-region'.  WDYT? (IOW, isn't it better to make
`libxml-parse-html-region' do it by itself?  It's too much for me
though.)



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: numeric entities
  2010-12-06  7:33 numeric entities Katsumi Yamaoka
@ 2010-12-06 11:49 ` Julien Danjou
  2010-12-06 18:35   ` Andreas Schwab
  2010-12-06 14:44 ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 8+ messages in thread
From: Julien Danjou @ 2010-12-06 11:49 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: ding

[-- Attachment #1: Type: text/plain, Size: 604 bytes --]

On Mon, Dec 06 2010, Katsumi Yamaoka wrote:

> When reading html articles, I sometimes see numeric entities like
> "›".  Currently `shr' and `gnus-w3m' render it as "\233", but
> it should be "›", i.e. U+8250.  Here is a conversion table stolen
> from emacs-w3m (#155 is there as #x9B):

I think you're wrong about the real problem. › is \233 if the
document is UTF-8. The problem you may have is that your document is not
UTF-8 and therefore #155 should not be \233 but › in the encoding you expect.

-- 
Julien Danjou
// ᐰ <julien@danjou.info>   http://julien.danjou.info

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: numeric entities
  2010-12-06  7:33 numeric entities Katsumi Yamaoka
  2010-12-06 11:49 ` Julien Danjou
@ 2010-12-06 14:44 ` Lars Magne Ingebrigtsen
  2010-12-07  5:06   ` Katsumi Yamaoka
  1 sibling, 1 reply; 8+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-12-06 14:44 UTC (permalink / raw)
  To: ding

Katsumi Yamaoka <yamaoka@jpl.org> writes:

> When reading html articles, I sometimes see numeric entities like
> "&#155;".  Currently `shr' and `gnus-w3m' render it as "\233", but
> it should be "›", i.e. U+8250.  Here is a conversion table stolen
> from emacs-w3m (#155 is there as #x9B):
>
> (defvar mm-url-extra-numeric-entities

It's this mostly the same as `gnus-article-dumbquotes-map'?  Looks
somewhat bigger, though.  So perhaps that should be installed, and then
`article-treat-dumbquotes' could just use that map instead?

> I can implement it in mm-url.el, that is effective to `gnus-w3m',
> but I hesitate to use it in `mm-shr' before calling
> `libxml-parse-html-region'.  WDYT? (IOW, isn't it better to make
> `libxml-parse-html-region' do it by itself?  It's too much for me
> though.)

I think `libxml-parse-html-region' should just mainly parse what it's
given, for greater flexibility.  But perhaps `mm-shr' and `gnus-w3m'
should just convert these automatically -- they never actually make much
sense. 

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: numeric entities
  2010-12-06 11:49 ` Julien Danjou
@ 2010-12-06 18:35   ` Andreas Schwab
  2010-12-07  0:06     ` Katsumi Yamaoka
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Schwab @ 2010-12-06 18:35 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: ding

Julien Danjou <julien@danjou.info> writes:

> I think you're wrong about the real problem. &#155; is \233 if the
> document is UTF-8. The problem you may have is that your document is not
> UTF-8 and therefore #155 should not be \233 but › in the encoding you expect.

The interpretation of numeric entities is independent of the document
encoding.  HTML always uses ISO10646 as the document character set, thus
&#155 always refers to CONTROL SEQUENCE INTRODUCER (CSI).

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: numeric entities
  2010-12-06 18:35   ` Andreas Schwab
@ 2010-12-07  0:06     ` Katsumi Yamaoka
  2010-12-07  9:28       ` Julien Danjou
  0 siblings, 1 reply; 8+ messages in thread
From: Katsumi Yamaoka @ 2010-12-07  0:06 UTC (permalink / raw)
  To: ding

Andreas Schwab wrote:
> Julien Danjou <julien@danjou.info> writes:

>> I think you're wrong about the real problem. &#155; is \233 if the
>> document is UTF-8. The problem you may have is that your document is not
>> UTF-8 and therefore #155 should not be \233 but › in the encoding you expect.

> The interpretation of numeric entities is independent of the document
> encoding.  HTML always uses ISO10646 as the document character set, thus
> &#155 always refers to CONTROL SEQUENCE INTRODUCER (CSI).

Please see how Firefox, IE, or others render this utf-8 page:

http://www.jpl.org/numentities.html



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: numeric entities
  2010-12-06 14:44 ` Lars Magne Ingebrigtsen
@ 2010-12-07  5:06   ` Katsumi Yamaoka
  2010-12-16 17:42     ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 8+ messages in thread
From: Katsumi Yamaoka @ 2010-12-07  5:06 UTC (permalink / raw)
  To: ding

Lars Magne Ingebrigtsen wrote:
> Katsumi Yamaoka <yamaoka@jpl.org> writes:

>> When reading html articles, I sometimes see numeric entities like
>> "&#155;".  Currently `shr' and `gnus-w3m' render it as "\233", but
>> it should be "›", i.e. U+8250.  Here is a conversion table stolen
>> from emacs-w3m (#155 is there as #x9B):
>>
>> (defvar mm-url-extra-numeric-entities

> It's this mostly the same as `gnus-article-dumbquotes-map'?  Looks
> somewhat bigger, though.  So perhaps that should be installed, and then
> `article-treat-dumbquotes' could just use that map instead?

`gnus-article-dumbquotes-map' uses only ASCII characters, so it
seems still helpful to people who use an old terminal emulator.
That is for normal text, not html, isn't it?  So, if we make
`article-treat-dumbquotes' do "\200;"->"€" things, it may have
to be for only environments that support such non-ASCII characters.
OTOH, those who use `shr' or `gnus-w3m' will probably use a modern
terminal or Emacs' display engine.

>> I can implement it in mm-url.el, that is effective to `gnus-w3m',
>> but I hesitate to use it in `mm-shr' before calling
>> `libxml-parse-html-region'.  WDYT? (IOW, isn't it better to make
>> `libxml-parse-html-region' do it by itself?  It's too much for me
>> though.)

> I think `libxml-parse-html-region' should just mainly parse what it's
> given, for greater flexibility.  But perhaps `mm-shr' and `gnus-w3m'
> should just convert these automatically -- they never actually make much
> sense.

Done for `mm-shr' and `gnus-w3m'.  If it slows Gnus, making
`mm-extra-numeric-entities' a char-table may be better.  Maybe so
is `mm-url-html-entities'.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: numeric entities
  2010-12-07  0:06     ` Katsumi Yamaoka
@ 2010-12-07  9:28       ` Julien Danjou
  0 siblings, 0 replies; 8+ messages in thread
From: Julien Danjou @ 2010-12-07  9:28 UTC (permalink / raw)
  To: Katsumi Yamaoka; +Cc: ding

[-- Attachment #1: Type: text/plain, Size: 549 bytes --]

On Tue, Dec 07 2010, Katsumi Yamaoka wrote:

> Please see how Firefox, IE, or others render this utf-8 page:
>
> http://www.jpl.org/numentities.html

I think there's a hack in the browser.

  http://www.fileformat.info/info/unicode/char/8c/index.htm

Show the same thing, but the Unicode representation can be shown
directly.

See the spec for U+008C at:
    http://unicode.org/charts/PDF/U0080.pdf

Clearly, it's not Πwhich is really 338 in Unicode.

-- 
Julien Danjou
// ᐰ <julien@danjou.info>   http://julien.danjou.info

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: numeric entities
  2010-12-07  5:06   ` Katsumi Yamaoka
@ 2010-12-16 17:42     ` Lars Magne Ingebrigtsen
  0 siblings, 0 replies; 8+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-12-16 17:42 UTC (permalink / raw)
  To: ding

Katsumi Yamaoka <yamaoka@jpl.org> writes:

> `gnus-article-dumbquotes-map' uses only ASCII characters, so it
> seems still helpful to people who use an old terminal emulator.
> That is for normal text, not html, isn't it?  So, if we make
> `article-treat-dumbquotes' do "\200;"->"€" things, it may have
> to be for only environments that support such non-ASCII characters.
> OTOH, those who use `shr' or `gnus-w3m' will probably use a modern
> terminal or Emacs' display engine.

Yeah, that's true, so it may just be better to keep those separate.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-12-16 17:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-06  7:33 numeric entities Katsumi Yamaoka
2010-12-06 11:49 ` Julien Danjou
2010-12-06 18:35   ` Andreas Schwab
2010-12-07  0:06     ` Katsumi Yamaoka
2010-12-07  9:28       ` Julien Danjou
2010-12-06 14:44 ` Lars Magne Ingebrigtsen
2010-12-07  5:06   ` Katsumi Yamaoka
2010-12-16 17:42     ` Lars Magne Ingebrigtsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).