* numeric entities
@ 2010-12-06 7:33 Katsumi Yamaoka
2010-12-06 11:49 ` Julien Danjou
2010-12-06 14:44 ` Lars Magne Ingebrigtsen
0 siblings, 2 replies; 8+ messages in thread
From: Katsumi Yamaoka @ 2010-12-06 7:33 UTC (permalink / raw)
To: ding
When reading html articles, I sometimes see numeric entities like
"›". Currently `shr' and `gnus-w3m' render it as "\233", but
it should be "›", i.e. U+8250. Here is a conversion table stolen
from emacs-w3m (#155 is there as #x9B):
(defvar mm-url-extra-numeric-entities
(mapcar
(lambda (item)
(cons (car item) (mm-ucs-to-char (cdr item))))
'((#x80 . #x20AC) (#x82 . #x201A) (#x83 . #x0192) (#x84 . #x201E)
(#x85 . #x2026) (#x86 . #x2020) (#x87 . #x2021) (#x88 . #x02C6)
(#x89 . #x2030) (#x8A . #x0160) (#x8B . #x2039) (#x8C . #x0152)
(#x8E . #x017D) (#x91 . #x2018) (#x92 . #x2019) (#x93 . #x201C)
(#x94 . #x201D) (#x95 . #x2022) (#x96 . #x2013) (#x97 . #x2014)
(#x98 . #x02DC) (#x99 . #x2122) (#x9A . #x0161) (#x9B . #x203A)
(#x9C . #x0153) (#x9E . #x017E) (#x9F . #x0178)))
"*Alist of extra numeric entities and characters other than ISO 10646.")
I can implement it in mm-url.el, that is effective to `gnus-w3m',
but I hesitate to use it in `mm-shr' before calling
`libxml-parse-html-region'. WDYT? (IOW, isn't it better to make
`libxml-parse-html-region' do it by itself? It's too much for me
though.)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: numeric entities
2010-12-06 7:33 numeric entities Katsumi Yamaoka
@ 2010-12-06 11:49 ` Julien Danjou
2010-12-06 18:35 ` Andreas Schwab
2010-12-06 14:44 ` Lars Magne Ingebrigtsen
1 sibling, 1 reply; 8+ messages in thread
From: Julien Danjou @ 2010-12-06 11:49 UTC (permalink / raw)
To: Katsumi Yamaoka; +Cc: ding
[-- Attachment #1: Type: text/plain, Size: 604 bytes --]
On Mon, Dec 06 2010, Katsumi Yamaoka wrote:
> When reading html articles, I sometimes see numeric entities like
> "›". Currently `shr' and `gnus-w3m' render it as "\233", but
> it should be "›", i.e. U+8250. Here is a conversion table stolen
> from emacs-w3m (#155 is there as #x9B):
I think you're wrong about the real problem. › is \233 if the
document is UTF-8. The problem you may have is that your document is not
UTF-8 and therefore #155 should not be \233 but › in the encoding you expect.
--
Julien Danjou
// ᐰ <julien@danjou.info> http://julien.danjou.info
[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: numeric entities
2010-12-06 11:49 ` Julien Danjou
@ 2010-12-06 18:35 ` Andreas Schwab
2010-12-07 0:06 ` Katsumi Yamaoka
0 siblings, 1 reply; 8+ messages in thread
From: Andreas Schwab @ 2010-12-06 18:35 UTC (permalink / raw)
To: Katsumi Yamaoka; +Cc: ding
Julien Danjou <julien@danjou.info> writes:
> I think you're wrong about the real problem. › is \233 if the
> document is UTF-8. The problem you may have is that your document is not
> UTF-8 and therefore #155 should not be \233 but › in the encoding you expect.
The interpretation of numeric entities is independent of the document
encoding. HTML always uses ISO10646 as the document character set, thus
› always refers to CONTROL SEQUENCE INTRODUCER (CSI).
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: numeric entities
2010-12-06 18:35 ` Andreas Schwab
@ 2010-12-07 0:06 ` Katsumi Yamaoka
2010-12-07 9:28 ` Julien Danjou
0 siblings, 1 reply; 8+ messages in thread
From: Katsumi Yamaoka @ 2010-12-07 0:06 UTC (permalink / raw)
To: ding
Andreas Schwab wrote:
> Julien Danjou <julien@danjou.info> writes:
>> I think you're wrong about the real problem. › is \233 if the
>> document is UTF-8. The problem you may have is that your document is not
>> UTF-8 and therefore #155 should not be \233 but › in the encoding you expect.
> The interpretation of numeric entities is independent of the document
> encoding. HTML always uses ISO10646 as the document character set, thus
> › always refers to CONTROL SEQUENCE INTRODUCER (CSI).
Please see how Firefox, IE, or others render this utf-8 page:
http://www.jpl.org/numentities.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: numeric entities
2010-12-06 7:33 numeric entities Katsumi Yamaoka
2010-12-06 11:49 ` Julien Danjou
@ 2010-12-06 14:44 ` Lars Magne Ingebrigtsen
2010-12-07 5:06 ` Katsumi Yamaoka
1 sibling, 1 reply; 8+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-12-06 14:44 UTC (permalink / raw)
To: ding
Katsumi Yamaoka <yamaoka@jpl.org> writes:
> When reading html articles, I sometimes see numeric entities like
> "›". Currently `shr' and `gnus-w3m' render it as "\233", but
> it should be "›", i.e. U+8250. Here is a conversion table stolen
> from emacs-w3m (#155 is there as #x9B):
>
> (defvar mm-url-extra-numeric-entities
It's this mostly the same as `gnus-article-dumbquotes-map'? Looks
somewhat bigger, though. So perhaps that should be installed, and then
`article-treat-dumbquotes' could just use that map instead?
> I can implement it in mm-url.el, that is effective to `gnus-w3m',
> but I hesitate to use it in `mm-shr' before calling
> `libxml-parse-html-region'. WDYT? (IOW, isn't it better to make
> `libxml-parse-html-region' do it by itself? It's too much for me
> though.)
I think `libxml-parse-html-region' should just mainly parse what it's
given, for greater flexibility. But perhaps `mm-shr' and `gnus-w3m'
should just convert these automatically -- they never actually make much
sense.
--
(domestic pets only, the antidote for overdose, milk.)
larsi@gnus.org * Lars Magne Ingebrigtsen
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: numeric entities
2010-12-06 14:44 ` Lars Magne Ingebrigtsen
@ 2010-12-07 5:06 ` Katsumi Yamaoka
2010-12-16 17:42 ` Lars Magne Ingebrigtsen
0 siblings, 1 reply; 8+ messages in thread
From: Katsumi Yamaoka @ 2010-12-07 5:06 UTC (permalink / raw)
To: ding
Lars Magne Ingebrigtsen wrote:
> Katsumi Yamaoka <yamaoka@jpl.org> writes:
>> When reading html articles, I sometimes see numeric entities like
>> "›". Currently `shr' and `gnus-w3m' render it as "\233", but
>> it should be "›", i.e. U+8250. Here is a conversion table stolen
>> from emacs-w3m (#155 is there as #x9B):
>>
>> (defvar mm-url-extra-numeric-entities
> It's this mostly the same as `gnus-article-dumbquotes-map'? Looks
> somewhat bigger, though. So perhaps that should be installed, and then
> `article-treat-dumbquotes' could just use that map instead?
`gnus-article-dumbquotes-map' uses only ASCII characters, so it
seems still helpful to people who use an old terminal emulator.
That is for normal text, not html, isn't it? So, if we make
`article-treat-dumbquotes' do "\200;"->"€" things, it may have
to be for only environments that support such non-ASCII characters.
OTOH, those who use `shr' or `gnus-w3m' will probably use a modern
terminal or Emacs' display engine.
>> I can implement it in mm-url.el, that is effective to `gnus-w3m',
>> but I hesitate to use it in `mm-shr' before calling
>> `libxml-parse-html-region'. WDYT? (IOW, isn't it better to make
>> `libxml-parse-html-region' do it by itself? It's too much for me
>> though.)
> I think `libxml-parse-html-region' should just mainly parse what it's
> given, for greater flexibility. But perhaps `mm-shr' and `gnus-w3m'
> should just convert these automatically -- they never actually make much
> sense.
Done for `mm-shr' and `gnus-w3m'. If it slows Gnus, making
`mm-extra-numeric-entities' a char-table may be better. Maybe so
is `mm-url-html-entities'.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: numeric entities
2010-12-07 5:06 ` Katsumi Yamaoka
@ 2010-12-16 17:42 ` Lars Magne Ingebrigtsen
0 siblings, 0 replies; 8+ messages in thread
From: Lars Magne Ingebrigtsen @ 2010-12-16 17:42 UTC (permalink / raw)
To: ding
Katsumi Yamaoka <yamaoka@jpl.org> writes:
> `gnus-article-dumbquotes-map' uses only ASCII characters, so it
> seems still helpful to people who use an old terminal emulator.
> That is for normal text, not html, isn't it? So, if we make
> `article-treat-dumbquotes' do "\200;"->"€" things, it may have
> to be for only environments that support such non-ASCII characters.
> OTOH, those who use `shr' or `gnus-w3m' will probably use a modern
> terminal or Emacs' display engine.
Yeah, that's true, so it may just be better to keep those separate.
--
(domestic pets only, the antidote for overdose, milk.)
larsi@gnus.org * Lars Magne Ingebrigtsen
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2010-12-16 17:42 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-06 7:33 numeric entities Katsumi Yamaoka
2010-12-06 11:49 ` Julien Danjou
2010-12-06 18:35 ` Andreas Schwab
2010-12-07 0:06 ` Katsumi Yamaoka
2010-12-07 9:28 ` Julien Danjou
2010-12-06 14:44 ` Lars Magne Ingebrigtsen
2010-12-07 5:06 ` Katsumi Yamaoka
2010-12-16 17:42 ` Lars Magne Ingebrigtsen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).