如何从Delphi TWebBrowser获取HTML源码,并检测其流编码?

2026-04-10 17:172阅读0评论SEO教程
  • 内容介绍
  • 文章标签
  • 相关推荐

本文共计619个文字,预计阅读时间需要3分钟。

如何从Delphi TWebBrowser获取HTML源码,并检测其流编码?

基于此问题,以下是对开头内容的简化

在使用TWebBrowser获取HTML源代码时,如果页面包含Unicode编码,直接输出可能会出现乱码。原因在于D7中TStringStream不支持Unicode。页面可能是UTF-8或其它(Ansi)编码。

基于这个问题: How can I get HTML source code from TWebBrowser

如果我使用具有Unicode代码页的html页面运行this code,则结果是乱码,因为在D7中TStringStream不是Unicode.页面可能是UTF8编码或其他(Ansi)代码页编码.

如何检测TStream / IPersistStreamInit是否为Unicode / UTF8 / Ansi?

我如何始终为此函数返回正确的WideString结果?

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;

如果我用TMemoryStream替换TStringStream,并将TMemoryStream保存到文件中就可以了.它可以是Unicode / UTF8 / Ansi.但我总是希望以WideString的形式返回流:

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString; var // LStream: TStringStream; LStream: TMemoryStream; Stream : IStream; LPersistStreamInit : IPersistStreamInit; begin if not Assigned(WebBrowser.Document) then exit; // LStream := TStringStream.Create(''); LStream := TMemoryStream.Create; try LPersistStreamInit := WebBrowser.Document as IPersistStreamInit; Stream := TStreamAdapter.Create(LStream,soReference); LPersistStreamInit.Save(Stream,true); // result := LStream.DataString; LStream.SaveToFile('c:\test\test.txt'); // test only - file is ok Result := ??? // WideString finally LStream.Free(); end; end;

编辑:我发现这篇文章 – How to load and save documents in TWebBrowser in a Delphi-like way

这完全符合我的需要.但它仅适用于Delphi Unicode编译器(D2009).阅读Conclusion部分:

There is obviously a lot more we could do. A couple of things
immediately spring to mind. We retro-fit some of the Unicode
functionality and support for non-ANSI encodings to the pre-Unicode
compiler code. The present code when compiled with anything earlier
than Delphi 2009 will not save document content to strings correctly
if the document character set is not ANSI.

魔术显然是在TEncoding类(TEncoding.GetBufferEncoding)中.但是D7没有TEncoding.有任何想法吗?

我使用 GpTextStream来处理转换(应该适用于所有Delphi版本):

function GetCodePageFromHTMLCharSet(Charset: WideString): Word; const WIN_CHARSET = 'windows-'; ISO_CHARSET = 'iso-'; var S: string; begin Result := 0; if Charset = 'unicode' then Result := CP_UNICODE else if Charset = 'utf-8' then Result := CP_UTF8 else if Pos(WIN_CHARSET, Charset) <> 0 then begin S := Copy(Charset, Length(WIN_CHARSET) + 1, Maxint); Result := StrToIntDef(S, 0); end else if Pos(ISO_CHARSET, Charset) <> 0 then // ISO-8859 (e.g. iso-8859-1: => 28591) begin S := Copy(Charset, Length(ISO_CHARSET) + 1, Maxint); S := Copy(S, Pos('-', S) + 1, 2); if S = '15' then // ISO-8859-15 (Latin 9) Result := 28605 else Result := StrToIntDef('2859' + S, 0); end; end; function GetWebBrowserHTML(WebBrowser: TWebBrowser): WideString; var LStream: TMemoryStream; Stream: IStream; LPersistStreamInit: IPersistStreamInit; TextStream: TGpTextStream; Charset: WideString; Buf: WideString; CodePage: Word; N: Integer; begin Result := ''; if not Assigned(WebBrowser.Document) then Exit; LStream := TMemoryStream.Create; try LPersistStreamInit := WebBrowser.Document as IPersistStreamInit; Stream := TStreamAdapter.Create(LStream, soReference); if Failed(LPersistStreamInit.Save(Stream, True)) then Exit; Charset := (WebBrowser.Document as IHTMLDocument2).charset; CodePage := GetCodePageFromHTMLCharSet(Charset); N := LStream.Size; SetLength(Buf, N); TextStream := TGpTextStream.Create(LStream, tsaccRead, [], CodePage); try N := TextStream.Read(Buf[1], N * SizeOf(WideChar)) div SizeOf(WideChar); SetLength(Buf, N); Result := Buf; finally TextStream.Free; end; finally LStream.Free(); end; end;

如何从Delphi TWebBrowser获取HTML源码,并检测其流编码?
标签:html

本文共计619个文字,预计阅读时间需要3分钟。

如何从Delphi TWebBrowser获取HTML源码,并检测其流编码?

基于此问题,以下是对开头内容的简化

在使用TWebBrowser获取HTML源代码时,如果页面包含Unicode编码,直接输出可能会出现乱码。原因在于D7中TStringStream不支持Unicode。页面可能是UTF-8或其它(Ansi)编码。

基于这个问题: How can I get HTML source code from TWebBrowser

如果我使用具有Unicode代码页的html页面运行this code,则结果是乱码,因为在D7中TStringStream不是Unicode.页面可能是UTF8编码或其他(Ansi)代码页编码.

如何检测TStream / IPersistStreamInit是否为Unicode / UTF8 / Ansi?

我如何始终为此函数返回正确的WideString结果?

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;

如果我用TMemoryStream替换TStringStream,并将TMemoryStream保存到文件中就可以了.它可以是Unicode / UTF8 / Ansi.但我总是希望以WideString的形式返回流:

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString; var // LStream: TStringStream; LStream: TMemoryStream; Stream : IStream; LPersistStreamInit : IPersistStreamInit; begin if not Assigned(WebBrowser.Document) then exit; // LStream := TStringStream.Create(''); LStream := TMemoryStream.Create; try LPersistStreamInit := WebBrowser.Document as IPersistStreamInit; Stream := TStreamAdapter.Create(LStream,soReference); LPersistStreamInit.Save(Stream,true); // result := LStream.DataString; LStream.SaveToFile('c:\test\test.txt'); // test only - file is ok Result := ??? // WideString finally LStream.Free(); end; end;

编辑:我发现这篇文章 – How to load and save documents in TWebBrowser in a Delphi-like way

这完全符合我的需要.但它仅适用于Delphi Unicode编译器(D2009).阅读Conclusion部分:

There is obviously a lot more we could do. A couple of things
immediately spring to mind. We retro-fit some of the Unicode
functionality and support for non-ANSI encodings to the pre-Unicode
compiler code. The present code when compiled with anything earlier
than Delphi 2009 will not save document content to strings correctly
if the document character set is not ANSI.

魔术显然是在TEncoding类(TEncoding.GetBufferEncoding)中.但是D7没有TEncoding.有任何想法吗?

我使用 GpTextStream来处理转换(应该适用于所有Delphi版本):

function GetCodePageFromHTMLCharSet(Charset: WideString): Word; const WIN_CHARSET = 'windows-'; ISO_CHARSET = 'iso-'; var S: string; begin Result := 0; if Charset = 'unicode' then Result := CP_UNICODE else if Charset = 'utf-8' then Result := CP_UTF8 else if Pos(WIN_CHARSET, Charset) <> 0 then begin S := Copy(Charset, Length(WIN_CHARSET) + 1, Maxint); Result := StrToIntDef(S, 0); end else if Pos(ISO_CHARSET, Charset) <> 0 then // ISO-8859 (e.g. iso-8859-1: => 28591) begin S := Copy(Charset, Length(ISO_CHARSET) + 1, Maxint); S := Copy(S, Pos('-', S) + 1, 2); if S = '15' then // ISO-8859-15 (Latin 9) Result := 28605 else Result := StrToIntDef('2859' + S, 0); end; end; function GetWebBrowserHTML(WebBrowser: TWebBrowser): WideString; var LStream: TMemoryStream; Stream: IStream; LPersistStreamInit: IPersistStreamInit; TextStream: TGpTextStream; Charset: WideString; Buf: WideString; CodePage: Word; N: Integer; begin Result := ''; if not Assigned(WebBrowser.Document) then Exit; LStream := TMemoryStream.Create; try LPersistStreamInit := WebBrowser.Document as IPersistStreamInit; Stream := TStreamAdapter.Create(LStream, soReference); if Failed(LPersistStreamInit.Save(Stream, True)) then Exit; Charset := (WebBrowser.Document as IHTMLDocument2).charset; CodePage := GetCodePageFromHTMLCharSet(Charset); N := LStream.Size; SetLength(Buf, N); TextStream := TGpTextStream.Create(LStream, tsaccRead, [], CodePage); try N := TextStream.Read(Buf[1], N * SizeOf(WideChar)) div SizeOf(WideChar); SetLength(Buf, N); Result := Buf; finally TextStream.Free; end; finally LStream.Free(); end; end;

如何从Delphi TWebBrowser获取HTML源码,并检测其流编码?
标签:html