<?xml version="1.0" encoding="utf-8" ?>
<?xml-stylesheet href="http://rss.egloos.com/style/blog.xsl" type="text/xsl" media="screen"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
	<title>델파이로 만드는 검색엔진</title>
	<link>http://wyb330.egloos.com</link>
	<description>델파이를 이용한 검색엔진 개발에 대한 블로그</description>
	<language>ko</language>
	<pubDate>Wed, 04 Nov 2009 03:25:12 GMT</pubDate>
	<generator>Egloos</generator>
	<image>
		<title>델파이로 만드는 검색엔진</title>
		<url>http://pds4.egloos.com/logo/200704/12/77/b0010977.png</url>
		<link>http://wyb330.egloos.com</link>
		<width>80</width>
		<height>98</height>
		<description>델파이를 이용한 검색엔진 개발에 대한 블로그</description>
	</image>
  	<item>
		<title><![CDATA[ 자바에 대한 공부를 시작하다. ]]> </title>
		<link>http://wyb330.egloos.com/4269077</link>
		<guid>http://wyb330.egloos.com/4269077</guid>
		<description>
			<![CDATA[ 
  Delphi .Net의 단종으로 인해 앞으로 검색서비스를 자바를 이용해 구축해 볼까 하고<br />
자바를 공부하기 시작했다. 루씬과 관련된 다른 프로젝트(slor,hadoop)도 자바로 되어 있어서<br />
자바를 한번 공부해 볼 필요성은 느끼고 있던 차였다.<br />
90년대 자바의 초창기 때 자바로 데스크탑 어플리케이션을 만들어 볼까하고 잠시 자바를 들여다<br />
본 이후 처음으로 자바를 다시 보게 되었다.<br />
[Head First Java]라는 책이 편집 스타일이 아주 참신해서 구입했는데 나하고는 안맞는 것 같다. <br />
내게는 기술 문서 매뉴얼 스타일이 더 어울리는 듯 :)<br />
자바랑 C#이랑 비슷한 점이 많아서 배우는 데 시간은 많이 절약될 것 같다.<br />
<br />
			 ]]> 
		</description>

		<comments>http://wyb330.egloos.com/4269077#comments</comments>
		<pubDate>Wed, 04 Nov 2009 03:25:12 GMT</pubDate>
		<dc:creator>미노</dc:creator>
	</item>
	<item>
		<title><![CDATA[ 아~~ Delphi .Net 단종... ]]> </title>
		<link>http://wyb330.egloos.com/4254854</link>
		<guid>http://wyb330.egloos.com/4254854</guid>
		<description>
			<![CDATA[ 
  델파이 2009부터 Delphi .Net를 지원하지 않는다는 걸 얼마전에야 알았다.<br />
그동안 RAD2006의 Delphi .Net를 이용해 색인과&nbsp; 검색 기능을 만들었는데<br />
이제 어쩌란 말이냐...<br />
Delphi .Net이 지원되는 마지막 버전이 2007이니 2007버전을 사용하든가 아니면<br />
최신 버전에서 지원되는 Delphi Prism를 사용하는 두방법이 있지만 두 방법 모두<br />
좋은 방법은 아닌 것 같다. Delphi Prism은 델파이와는 다른 언어이고 잎으로의 미래도<br />
불투명하기 때문에 Delphi .Net의 대안은 되기 어려울 것 같다.<br />
<br />
아마도 이제는 델파이을 버리고 자바를 해야할 것 같다.<br />
루씬이 원래 자바로 되어 있고 관련 프로젝트도 자바이기 때문에 자바로 하면 여러 잇점이<br />
있다. 그리고 서버도 윈도우즈를 사용하지 않고 리눅스를 사용할 수 있기 때문에 서버 유지<br />
비용도 줄일 수 있을 것이다.<br />
<br />
<br />
			 ]]> 
		</description>
		<category>델파이</category>

		<comments>http://wyb330.egloos.com/4254854#comments</comments>
		<pubDate>Wed, 14 Oct 2009 04:13:35 GMT</pubDate>
		<dc:creator>미노</dc:creator>
	</item>
	<item>
		<title><![CDATA[ min hash를 이용한 유사 문서 판별 프로그램 ]]> </title>
		<link>http://wyb330.egloos.com/4252806</link>
		<guid>http://wyb330.egloos.com/4252806</guid>
		<description>
			<![CDATA[ 
  이전에는 유사문서 판별 알고리즘으로 I-Match를 사용했는데, 이번에 shingling과 min hash를 이용해서<br />
유사 문서 판별 프로그램을 만들었다.<br />
<br />
<div style="text-align:center"><img class="image_mid" border="0" onmouseover="this.style.cursor='pointer'" alt="" src="http://pds16.egloos.com/pds/200910/11/77/b0010977_4ad1612c36b50.png" width="500" height="383.076923077" onclick="Control.Modal.openDialog(this, event, 'http://pds16.egloos.com/pds/200910/11/77/b0010977_4ad1612c36b50.png');" /></div><br />
위의 그림에서 알 수 있듯이 약 5500여개의 소스 파일을 대상으로 유사 문서 판별을 하는데 16분 정도 소요가 되었다.<br />
위 결과는 메모리 해시 테이블을 이용했을 경우인데, 만일 디스크 기반 해시 방법으로 했을 때는 40분 이상 걸렸다.<br />
대용량 문서를  고려하면 디스크 기반으로 갈 수 밖에 없는데 , 그럴 경우 디스크 I/O로 인해 시간이 훨씬 더 걸릴 수<br />
밖에 없을 것이다. 디스크 기반 해시 알고리즘을 최적화해서 시간을 단축시켜봐야겠다. <br />
min hash가 확률을 바탕으로 한 것이라 반복 실행시 결과가 다르게나온다는 단점이 있지만, 유사 문서 판별 결과는<br />
무난한 편이다.<br />
<br />
<br />
<br />
			 ]]> 
		</description>
		<category>검색엔진</category>

		<comments>http://wyb330.egloos.com/4252806#comments</comments>
		<pubDate>Sun, 11 Oct 2009 04:45:15 GMT</pubDate>
		<dc:creator>미노</dc:creator>
	</item>
	<item>
		<title><![CDATA[ StandardAnalyzer에서 아쉬운 점들 ]]> </title>
		<link>http://wyb330.egloos.com/4229705</link>
		<guid>http://wyb330.egloos.com/4229705</guid>
		<description>
			<![CDATA[ 
  Lucene에서 분석기로 가장 많이 사용하는 것이 StandardAnalyzer일 것이다.<br />
그런데 StandardTokenizer에서 토큰으로 사용되는 기호가 몇가지 안되기 때문에<br />
C++, C# 모두 C로 토큰처리한다. 그리고 http://www.example.com과 같은<br />
url도 토큰으로 인식하지 못한다.<br />
그래서 예전에 만들었던 <a target="_blank" href="http://wyb330.egloos.com/3048784">StandardTokenizer</a> 을 가지고 위의 요구를 만족시키는<br />
토큰나이저를 만들었다. 그리고 기호를 시작하는 몇몇 토큰도 인식하도록 했다.<br />
대신에 StandardTokenizer에서 인식하는 NUM형 중의 몇가지 유형은 구현의<br />
복잡성 때문에 지원하지 않도록 했다.<br />
<br />
이걸 색인에 적용하기 위해서는 그동안 색인된 걸 재색인을 해야하니 색인에 <br />
시간이 많이 소요될 것 같다.<br />
<br />
<br />
<br />
<br />
			 ]]> 
		</description>
		<category>검색엔진</category>

		<comments>http://wyb330.egloos.com/4229705#comments</comments>
		<pubDate>Tue, 08 Sep 2009 05:40:26 GMT</pubDate>
		<dc:creator>미노</dc:creator>
	</item>
	<item>
		<title><![CDATA[ OpenSearch를 이용해 브라우저에 검색사이트 등록 ]]> </title>
		<link>http://wyb330.egloos.com/4224419</link>
		<guid>http://wyb330.egloos.com/4224419</guid>
		<description>
			<![CDATA[ 
  파이어폭스 주소창 오른쪽에 검색을 할 수 있는 검색 입력창이 있는데,<br />
여기에 자신의 검색사이트를 등록할 수 있는 <a target="_blank" href="http://www.opensearch.org/">OpenSearch</a>라는 표준이 있다.<br />
먼저 OpenSearch 형식에 맞는 xml 파일을 작성한다.<br />
예를들어 내가 운영하고 검색사이트의 경우 아래와 같이 작성하였다.<br />
<br />
&lt;?xml version="1.0" encoding="UTF-8" ?&gt;<br />
&lt;OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/"&gt;<br />
&lt;ShortName&gt;DevSearch&lt;/ShortName&gt;<br />
&lt;Description&gt;개발자를 위한 검색엔진&lt;/Description&gt;<br />
&lt;InputEncoding&gt;UTF-8&lt;/InputEncoding&gt;<br />
&lt;Image width="16" height="16"&gt;http://www.devsearch.co.kr/images/favicon.ico&lt;/Image&gt;<br />
&lt;Url type="text/html" method="GET" template="http://www.devsearch.co.kr/SearchCategory.aspx"&gt;<br />
&nbsp; &lt;Param name="query" value="{searchTerms}"/&gt;<br />
&lt;/Url&gt;<br />
&lt;SearchForm&gt;http://www.devsearch.co.kr/&lt;/SearchForm&gt;<br />
&lt;/OpenSearchDescription&gt;<br />
<br />
그 다음에 메인페이지의 head 태그 밑에 아래와 같은 link 태그를 추가한다.<br />
(위의 xml를 OpenSearch.xml으로 저장한 경우)<br />
<br />
&lt;link rel="search" type="application/opensearchdescription+xml" title="DevSearch" href="OpenSearch.xml" /&gt;<br />
<br />
그러면 해당 사이트에 접속했을 때 브라우저의 검색사이트 드롭다운 메뉴에 해당 사이트 추가 메뉴가 나타나고<br />
추가를 누르면 해당 사이트를 이용해 브라우저에서 검색을 할 수 있게 된다.<br />
<br />
<div style="text-align:center"><img class="image_mid" border="0" onmouseover="this.style.cursor='pointer'" alt="" src="http://pds16.egloos.com/pds/200909/01/77/b0010977_4a9cbdd1a3b03.png" width="170" height="227" onclick="Control.Modal.openDialog(this, event, 'http://pds16.egloos.com/pds/200909/01/77/b0010977_4a9cbdd1a3b03.png');" /></div><br />
OpenSearch는 파이어폭스 뿐만아니라 IE도 역시 지원한다. 파이어폭스는 OpenSearch말고 MozSearch라는<br />
자체의 표준이 있는데 , 이것은 IE에서 지원하지 않으므로 OpenSearch형식으로 만드는 게 더 좋을 것이다.<br />
<br />
<br />
<span style="text-decoration: underline;"></span>			 ]]> 
		</description>

		<comments>http://wyb330.egloos.com/4224419#comments</comments>
		<pubDate>Tue, 01 Sep 2009 06:26:29 GMT</pubDate>
		<dc:creator>미노</dc:creator>
	</item>
	<item>
		<title><![CDATA[ RSS 수집 로봇 만들기 ]]> </title>
		<link>http://wyb330.egloos.com/4202899</link>
		<guid>http://wyb330.egloos.com/4202899</guid>
		<description>
			<![CDATA[ 
  예전에 이미 델파이로 RSS 수집 로봇을 만들어 이미 사용하고 있지만<br />
파이썬을 이용해서 RSS 수집 로봇을 한번 만들어 보기로 했다.<br />
파이썬에서는 FeedParser이라는 라이브러리가 RSS 수집에 필요한 기능을<br />
다 제공하므로 훨씬 쉽게 RSS를 수집할 수 있다.<br />
FeedParser는 http://www.feedparser.org/에서 구할 수 있다.<br />
<br />
<br />
import os, time, urllib2<br />
import feedparser<br />
import logfile<br />
<br />
class RSSRobot:<br />
&nbsp;&nbsp;&nbsp; def __init__(self, rssfile, savedir):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.rssfile = rssfile<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.savedir = savedir<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #assert os.path.exists(savedir), "save directory is not exists!"<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if not os.path.exists(savedir):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; os.makedirs(savedir)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.err = open(os.path.join(self.savedir, "error.log"), "a+")<br />
<br />
&nbsp;&nbsp;&nbsp; def __del__(self):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.err.close()<br />
<br />
&nbsp;&nbsp;&nbsp; # 에외가 발생하면 예외가 발생한 link를 로그에 기록한다.<br />
&nbsp;&nbsp;&nbsp; def __addError(self, msg):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.err.write("[%s]%s\n" %(time.strftime("%Y-%m-%d %H:%M:%S"), msg))<br />
<br />
&nbsp;&nbsp;&nbsp; # 파일에서 피드 url 목록을 받아온다.<br />
&nbsp;&nbsp;&nbsp; def __getFeeds(self):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f = open(self.rssfile)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; feeds = f.readlines()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f.close()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return feeds<br />
<br />
&nbsp;&nbsp;&nbsp; # 피드의 아이템 정보를 사전 형식으로 돌려준다.<br />
&nbsp;&nbsp;&nbsp; def __feedItem(self, item):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; article = {}<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; article["link"] = item.link<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if item.has_key("created"): article["created"] = time.strftime("%Y-%m-%d %H:%M:%S", item.created_parsed)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; elif item.has_key("published"): article["created"] = time.strftime("%Y-%m-%d %H:%M:%S", item.published_parsed)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else: article["created"] = ""<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if item.has_key("updated") and (item.updated_parsed != None):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; article["updated"] = time.strftime("%Y-%m-%d %H:%M:%S", item.updated_parsed)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; article["title"] = item.title<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if item.has_key("summary"): article["summary"] = item.summary<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else: article["summary"] = ""<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return article<br />
<br />
&nbsp;&nbsp;&nbsp; def __readFeeds(self):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; articles = []<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; feedlist = self.__getFeeds()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; count = 1<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urls = logfile.LogFile(os.path.join(self.savedir, "CrawedUrls.log"))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for feed in feedlist:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; url = feed.rstrip()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print(u"피드 파싱 중... %s(%d / %d)" %(url, count, len(feedlist)))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; try:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f = feedparser.parse(url)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; except:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.__addError("feed parsing error - %s" %url)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for e in f.entries:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; article = self.__feedItem(e)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # 전에 수집한 url은 다시 수집하지 않는다.<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if urls.isExists(article["link"]): break<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; articles.append(article)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # 수집한 url을 기록한다.<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urls.add(article["link"])<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; count += 1<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return articles<br />
<br />
&nbsp;&nbsp;&nbsp; # url의 내용을 받아온다.<br />
&nbsp;&nbsp;&nbsp; def __getURLContent(self, url):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; time.sleep(self.delay)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; try:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; content = urllib2.urlopen(url).read()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return content<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; except:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.__addError("__getURLContent error - %s" %url)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return ""<br />
<br />
&nbsp;&nbsp;&nbsp; ## 피드 아이템 정보를 파일로 저장한다.<br />
&nbsp;&nbsp;&nbsp; def __saveItem(self, item, filename):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f = open(filename, "w")<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; try:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f.write('&lt;?xml version="1.0" encoding="UTF-8"?&gt;\n')<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f.write("&lt;Document&gt;\n")<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f.write("&lt;Document.URL&gt; %s &lt;/Document.URL&gt;\n" %(item["link"]))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f.write("&lt;Document.Date&gt; %s &lt;/Document.Date&gt;\n" %(item["created"]))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f.write("&lt;Document.Title&gt; %s &lt;/Document.Title&gt;\n" %(self.__encode(item["title"])))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f.write("&lt;Document.Summary&gt; %s &lt;/Document.Summary&gt;\n" %(self.__encode(item["summary"])))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; content = self.__getURLContent(item["link"])<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f.write("&lt;Document.Contents&gt; %s &lt;/Document.Contents&gt;\n" %(content))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f.write("&lt;/Document&gt;")<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; except:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print(item["link"])<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.__addError(item["link"])<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; f.close()<br />
<br />
&nbsp;&nbsp;&nbsp; def __encode(self, text):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return text.encode('utf8')<br />
<br />
&nbsp;&nbsp;&nbsp; ## RSS 문서를 저장할 디렉토리(yyyymmdd\hhmm)를 생성한다.<br />
&nbsp;&nbsp;&nbsp; def __makeDir(self, dir):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if not os.path.exists(dir):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; os.makedirs(dir)<br />
<br />
&nbsp;&nbsp;&nbsp; def execute(self, delay=1):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.delay = delay<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print(u"수집시작... %s" %time.strftime("%Y-%m-%d %H:%M:%S"))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; t1 = time.time()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; articles = self.__readFeeds()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; count = 0<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; savepath = "%s\\%s" %(self.savedir, time.strftime("%Y%m%d\\%H%M"))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.__makeDir(savepath)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for article in articles:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; count += 1<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print("%s (%d)" %(article["link"], count))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; filename = "%s\\%d.xml" %(savepath, count)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; self.__saveItem(article, filename)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print(u"수집종료... %s" %time.strftime("%Y-%m-%d %H:%M:%S"))<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; t2 = time.time()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print(u"소요시간... %s초" %str(t2 - t1))<br />
<br />
<br />
if __name__ == "__main__":<br />
&nbsp;&nbsp;&nbsp; robot = RSSRobot("blog.txt", "f:\\document\\test")<br />
&nbsp;&nbsp;&nbsp; robot.execute()<br />
<br />
<br />
			 ]]> 
		</description>
		<category>파이썬</category>

		<comments>http://wyb330.egloos.com/4202899#comments</comments>
		<pubDate>Mon, 03 Aug 2009 05:08:56 GMT</pubDate>
		<dc:creator>미노</dc:creator>
	</item>
	<item>
		<title><![CDATA[ Lucene 색인을 DB로 변환하기 ]]> </title>
		<link>http://wyb330.egloos.com/4197109</link>
		<guid>http://wyb330.egloos.com/4197109</guid>
		<description>
			<![CDATA[ 
  관리상의 이유로 루씬 색인을 DB로 변환했으면 하는 생각이 들 때가 있다.<br />
그래서 PyLucene를 이용해서 Lucene 색인을 DB로 변환하는 모듈을 작성했다.<br />
이 모듈을 사용하기 위해서는 먼저 PyLucene가 설치되어 있어야 한다.<br />
PyLucene에 대한 자료는 "파이썬 3 Programming" 라는 책에 잘 설명이 되어 있다.<br />
<br />
<br />
##Lucene 색인을 데이터베이스로 변환<br />
<br />
import sqlite3, os.path<br />
<br />
os.environ['PATH'] = os.path.join(os.environ['JAVA_HOME'], r'jre\bin\client') + ';' + os.environ['PATH']<br />
<br />
import lucene<br />
<br />
# 색인의 필드 목록을 구함<br />
def index_fields(indexdir):<br />
&nbsp;&nbsp;&nbsp; reader = lucene.IndexReader.open(indexdir)<br />
&nbsp;&nbsp;&nbsp; fields = reader.getFieldNames(lucene.IndexReader.FieldOption.ALL)<br />
&nbsp;&nbsp;&nbsp; names = []<br />
&nbsp;&nbsp;&nbsp; for field in fields:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; names.append(field)<br />
&nbsp;&nbsp;&nbsp; reader.close()<br />
&nbsp;&nbsp;&nbsp; return names<br />
<br />
<br />
#색인 필드 목록을 이용해 테이블 생성<br />
#DB에 테이블이 이미 존재하면 테이블을 삭제 후 생성<br />
def create_table(db, table, fields):<br />
&nbsp;&nbsp;&nbsp; con = sqlite3.connect(db)<br />
&nbsp;&nbsp;&nbsp; cur = con.cursor()<br />
&nbsp;&nbsp;&nbsp; s = ""<br />
&nbsp;&nbsp;&nbsp; for i in range(len(fields)):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if i == len(fields)-1:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; s = s + fields[i] + " text"<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; s = s + fields[i] + " text, "<br />
&nbsp;&nbsp;&nbsp; #sql = "CREATE TABLE {0} ({1});".format(table, s)<br />
&nbsp;&nbsp;&nbsp; sql = "CREATE TABLE %s (%s);" %(table, s)<br />
&nbsp;&nbsp;&nbsp; print sql<br />
&nbsp;&nbsp;&nbsp; if os.path.exists(db):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; try:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; cur.execute("DROP TABLE %s" %table)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; except:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pass<br />
&nbsp;&nbsp;&nbsp; cur.execute(sql)<br />
<br />
# insert 문장(INSERT INTO 테이블명 (필드목록) VALUES (형식인수);)<br />
def insert_sql(table, fields):<br />
&nbsp;&nbsp;&nbsp; s = ""<br />
&nbsp;&nbsp;&nbsp; for i in range(len(fields)):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if i == len(fields)-1:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; s = s + fields[i]<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; s = s + fields[i] + ","<br />
&nbsp;&nbsp;&nbsp; param = '?,' * (len(fields)-1) + "?"<br />
&nbsp;&nbsp;&nbsp; return "INSERT INTO %s (%s) VALUES ( %s );" %(table, s, param)<br />
<br />
# 문서 doc의 필드값을 리스트 형식으로 돌려준다.<br />
def doc_values(doc, fields):<br />
&nbsp;&nbsp;&nbsp; values = []<br />
&nbsp;&nbsp;&nbsp; for i in range(len(fields)):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; value = doc.get(fields[i])<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; values.append(value)<br />
&nbsp;&nbsp;&nbsp; return values<br />
<br />
#색인을 DB에 복사<br />
def export_index(indexdir, db, table):<br />
&nbsp;&nbsp;&nbsp; reader = lucene.IndexReader.open(indexdir)<br />
&nbsp;&nbsp;&nbsp; count = lucene.IndexReader.numDocs(reader);<br />
&nbsp;&nbsp;&nbsp; print count<br />
&nbsp;&nbsp;&nbsp; con = sqlite3.connect(db)<br />
&nbsp;&nbsp;&nbsp; # 트랜잭션 처리를 하지 않고 자동 커밋 처리<br />
&nbsp;&nbsp;&nbsp; #con.isolation_level = None<br />
&nbsp;&nbsp;&nbsp; cur = con.cursor()<br />
&nbsp;&nbsp;&nbsp; # 색인의 필드 목록을 받아온다.<br />
&nbsp;&nbsp;&nbsp; fields = index_fields(indexdir)<br />
&nbsp;&nbsp;&nbsp; sql = insert_sql(table, fields)<br />
&nbsp;&nbsp;&nbsp; print sql<br />
&nbsp;&nbsp;&nbsp; for i in range(count):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; doc = lucene.IndexReader.document(reader, i)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; cur.execute(sql, doc_values(doc, fields) )<br />
&nbsp;&nbsp;&nbsp; con.commit()<br />
&nbsp;&nbsp;&nbsp; reader.close()<br />
<br />
def select_table(db, table):<br />
&nbsp;&nbsp;&nbsp; con = sqlite3.connect(db)<br />
&nbsp;&nbsp;&nbsp; cur = con.cursor()<br />
&nbsp;&nbsp;&nbsp; cur.execute("SELECT COUNT(*) FROM %s;" %table)<br />
&nbsp;&nbsp;&nbsp; #cur.fetchall()<br />
&nbsp;&nbsp;&nbsp; #print cur.rowcount<br />
&nbsp;&nbsp;&nbsp; for row in cur:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print row<br />
<br />
# 경로의 마지막명을 돌려준다.(c:\dir1\dir2 =&gt; dir2)<br />
def extract_path(path):<br />
&nbsp;&nbsp;&nbsp; if path.endswith('\\'):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; path = path[0:-1]<br />
&nbsp;&nbsp;&nbsp; idx = path.rfind('\\')<br />
&nbsp;&nbsp;&nbsp; if idx == -1: return ""<br />
&nbsp;&nbsp;&nbsp; else:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; s = path[idx+1:]<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return s<br />
<br />
#Lucene 색인을 데이터베이스로 변환<br />
def index2db(indexdir, db):<br />
&nbsp;&nbsp;&nbsp; lucene.initVM(lucene.CLASSPATH)<br />
&nbsp;&nbsp;&nbsp; if not os.path.exists(indexdir):<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print u"색인(%s)이 존재하지 않습니다." %(indexdir)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return<br />
&nbsp;&nbsp;&nbsp; table_name = extract_path(indexdir)<br />
&nbsp;&nbsp;&nbsp; fields = index_fields(indexdir)<br />
&nbsp;&nbsp;&nbsp; create_table(db, table_name, fields)<br />
&nbsp;&nbsp;&nbsp; export_index(indexdir, db, table_name)<br />
<br />
if __name__ == "__main__":<br />
&nbsp;&nbsp;&nbsp; indexdir = "e:\\index\\web"<br />
&nbsp;&nbsp;&nbsp; db = "c:\\test.db"<br />
&nbsp;&nbsp;&nbsp; index2db(indexdir, db)<br />
&nbsp;&nbsp;&nbsp; select_table(db, extract_path(indexdir))<br />
<br />
<br />
DB는 파이썬에서 지원되는 SQLite3를 이용하는데 속도가 무척 빠르네요.<br />
<br />
<br />
			 ]]> 
		</description>
		<category>파이썬</category>

		<comments>http://wyb330.egloos.com/4197109#comments</comments>
		<pubDate>Sun, 26 Jul 2009 05:54:42 GMT</pubDate>
		<dc:creator>미노</dc:creator>
	</item>
	<item>
		<title><![CDATA[ 통합검색시 색인의 구성 ]]> </title>
		<link>http://wyb330.egloos.com/4183755</link>
		<guid>http://wyb330.egloos.com/4183755</guid>
		<description>
			<![CDATA[ 
  통합검색시 검색 속도가 생각보다 느려서 원인을 분석해 보았다.<br />
예를들어 웹문서,블로그,뉴스에 대한 통합검색을 한다고 해보자.<br />
각각의 검색에 대해 0.5초가 걸린다면 통합검색에 걸리는 순수 시간은 1.5초가 될 것이다.<br />
이러한 속도 문제를 해결하기 위해선 멀티쓰레드를 이용해서 통합검색을 구현하면 속도를<br />
줄일 수 있지 않을까 기대해 볼 수 있다. 실제로 쓰레드(비동기)를 이용해 검색을 해보니<br />
순차적으로 검색하는 것과 별반 차이가 없는 결과가 나왔다.<br />
원인을 분석하다 보니 디스크 I/O에 의한 병목 현상인 것으로 나타났다. 색인을 하나의<br />
디스크에 모두 관리하고 있었는데 쓰레드가 여러개라 하더라도 디스크 자원은 하나라서<br />
여기서 병목 현상이 발생했던 것이다. 그래서 색인을 두 개의 디스크로 분리하고 검색해 보니<br />
눈에 띄는 속도 향상이 있었다.<br />
결론적으로 색인은 물리적으로 분산시켜 놓는 것이 통합검색에 유리하다는 것이다.<br />
<br />
			 ]]> 
		</description>
		<category>검색엔진</category>

		<comments>http://wyb330.egloos.com/4183755#comments</comments>
		<pubDate>Wed, 08 Jul 2009 05:52:49 GMT</pubDate>
		<dc:creator>미노</dc:creator>
	</item>
	<item>
		<title><![CDATA[ BITNAMI 이거 물건이네요. ]]> </title>
		<link>http://wyb330.egloos.com/4127465</link>
		<guid>http://wyb330.egloos.com/4127465</guid>
		<description>
			<![CDATA[ 
  그동안 위키를 직접 사용해 볼려고 했는데 설치가 까다로워서 제대로 설치를<br />
할 수 없었다. 그런데 이런 오픈소스를 실행파일 하나로 통합해서 자동으로 설치해 주는<br />
프로그램이 있다는 걸 알았다. <a href="http://bitnami.org">BITNAMI</a>가 그 주인공이다.<br />
여기서 DocuWiki와 Trac 설치 프로그램을 받아서 실행하니 원클릭으로 설치가 되었다.<br />
이럴수가! 이거 대박이네요.<br />
이 외에도 Drupal, Phpbb, MediaWiki 등 많은 오픈소스 툴 설치를 지원하니 그동안<br />
오픈소스 툴 설치하기가 힘들어서 고민하시는 분은 가서 구경해 보세요. 강추입니다.<br />
<br />
<br />
<br />
			 ]]> 
		</description>

		<comments>http://wyb330.egloos.com/4127465#comments</comments>
		<pubDate>Wed, 29 Apr 2009 11:51:01 GMT</pubDate>
		<dc:creator>미노</dc:creator>
	</item>
	<item>
		<title><![CDATA[ 문서의 인코딩 알아내기 ]]> </title>
		<link>http://wyb330.egloos.com/4125422</link>
		<guid>http://wyb330.egloos.com/4125422</guid>
		<description>
			<![CDATA[ 
  델파이 2009에서 유니코드를 지원하기 때문에 TEncoding.GetBufferEncoding를 이용해 문서의<br />
인코딩을 알 수 있다. 하지만 TEncoding.GetBufferEncoding 메서드는 BOM를 기준으로 유니코드를<br />
판단하기 때문에 BOM없는 UTF8은 인식하지 못한다. 그래서 이 문제를 해결하기 위해<br />
UniSynEdit에 있는 유니코드 관련 부분을 짜집기해서 문서의 인코딩을 구별하는 함수를 만들어봤다.<br />
<br />
// checks for a BOM in UTF-8 format or searches the first 4096 bytes for<br />
// typical UTF-8 octet sequences<br />
function IsUTF8(Stream: TStream; out WithBOM: Boolean): Boolean;<br />
const<br />
&nbsp; MinimumCountOfUTF8Strings = 1;<br />
&nbsp; MaxBufferSize = $4000;<br />
var<br />
&nbsp; Buffer: array of Byte;<br />
&nbsp; BufferSize, i, FoundUTF8Strings: Integer;<br />
<br />
&nbsp; // 3 trailing bytes are the maximum in valid UTF-8 streams,<br />
&nbsp; // so a count of 4 trailing bytes is enough to detect invalid UTF-8 streams<br />
&nbsp; function CountOfTrailingBytes: Integer;<br />
&nbsp; begin<br />
&nbsp;&nbsp;&nbsp; Result := 0;<br />
&nbsp;&nbsp;&nbsp; inc(i);<br />
&nbsp;&nbsp;&nbsp; while (i &lt; BufferSize) and (Result &lt; 4) do<br />
&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if Buffer[i] in [$80..$BF] then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(Result)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Break;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(i);<br />
&nbsp;&nbsp;&nbsp; end;<br />
&nbsp; end;<br />
<br />
begin<br />
&nbsp; // if Stream is nil, let Delphi raise the exception, by accessing Stream,<br />
&nbsp; // to signal an invalid result<br />
<br />
&nbsp; // start analysis at actual Stream.Position<br />
&nbsp; BufferSize := Min(MaxBufferSize, Stream.Size - Stream.Position);<br />
<br />
&nbsp; // if no special characteristics are found it is not UTF-8<br />
&nbsp; Result := False;<br />
&nbsp; WithBOM := False;<br />
<br />
&nbsp; if BufferSize &gt; 0 then<br />
&nbsp; begin<br />
&nbsp;&nbsp;&nbsp; SetLength(Buffer, BufferSize);<br />
&nbsp;&nbsp;&nbsp; Stream.ReadBuffer(Buffer[0], BufferSize);<br />
&nbsp;&nbsp;&nbsp; Stream.Seek(-BufferSize, soFromCurrent);<br />
<br />
&nbsp;&nbsp;&nbsp; { first search for BOM }<br />
<br />
&nbsp;&nbsp;&nbsp; if (BufferSize &gt;= Length(UTF8BOM)) and CompareMem(@Buffer[0], @UTF8BOM[0], Length(UTF8BOM)) then<br />
&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WithBOM := True;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Result := True;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Exit;<br />
&nbsp;&nbsp;&nbsp; end;<br />
<br />
&nbsp;&nbsp;&nbsp; { If no BOM was found, check for leading/trailing byte sequences,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; which are uncommon in usual non UTF-8 encoded text.<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; NOTE: There is no 100% save way to detect UTF-8 streams. The bigger<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MinimumCountOfUTF8Strings, the lower is the probability of<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a false positive. On the other hand, a big MinimumCountOfUTF8Strings<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; makes it unlikely to detect files with only little usage of non<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; US-ASCII chars, like usual in European languages. }<br />
<br />
&nbsp;&nbsp;&nbsp; FoundUTF8Strings := 0;<br />
&nbsp;&nbsp;&nbsp; i := 0;<br />
&nbsp;&nbsp;&nbsp; while i &lt; BufferSize do<br />
&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; case Buffer[i] of<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $00..$7F: // skip US-ASCII characters as they could belong to various charsets<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $C2..$DF:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if CountOfTrailingBytes = 1 then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(FoundUTF8Strings)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Break;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $E0:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(i);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (i &lt; BufferSize) and (Buffer[i] in [$A0..$BF]) and (CountOfTrailingBytes = 1) then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(FoundUTF8Strings)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Break;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; end;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $E1..$EC, $EE..$EF:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if CountOfTrailingBytes = 2 then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(FoundUTF8Strings)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Break;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $ED:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(i);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (i &lt; BufferSize) and (Buffer[i] in [$80..$9F]) and (CountOfTrailingBytes = 1) then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(FoundUTF8Strings)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Break;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; end;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $F0:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(i);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (i &lt; BufferSize) and (Buffer[i] in [$90..$BF]) and (CountOfTrailingBytes = 2) then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(FoundUTF8Strings)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Break;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; end;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $F1..$F3:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if CountOfTrailingBytes = 3 then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(FoundUTF8Strings)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Break;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $F4:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(i);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (i &lt; BufferSize) and (Buffer[i] in [$80..$8F]) and (CountOfTrailingBytes = 2) then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(FoundUTF8Strings)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Break;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; end;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $C0, $C1, $F5..$FF: // invalid UTF-8 bytes<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Break;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $80..$BF: // trailing bytes are consumed when handling leading bytes,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // any occurence of "orphaned" trailing bytes is invalid UTF-8<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Break;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; end;<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if FoundUTF8Strings = MinimumCountOfUTF8Strings then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Result := True;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Break;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; end;<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; inc(i);<br />
&nbsp;&nbsp;&nbsp; end;<br />
&nbsp; end;<br />
end;<br />
<br />
<br />
&nbsp; function GetEncoding: TEncoding;<br />
&nbsp; var<br />
&nbsp;&nbsp;&nbsp; BytesRead: Integer;<br />
&nbsp;&nbsp;&nbsp; ByteOrderMask: array[0..5] of Byte; // BOM size is max 5 bytes (cf: wikipedia)<br />
&nbsp;&nbsp;&nbsp; WithBOM: Boolean;<br />
&nbsp; begin<br />
&nbsp;&nbsp;&nbsp; Result:= TEncoding.Default;<br />
&nbsp;&nbsp;&nbsp; Stream.Position:= 0;<br />
&nbsp;&nbsp;&nbsp; BytesRead := Stream.Read(ByteOrderMask[0], SizeOf(ByteOrderMask));<br />
<br />
&nbsp;&nbsp;&nbsp; // UTF16 LSB = Unicode LSB/LE<br />
&nbsp;&nbsp;&nbsp; if (BytesRead &gt;= 2) and (ByteOrderMask[0] = UTF16BOMLE[0])<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and (ByteOrderMask[1] = UTF16BOMLE[1]) then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Result:= TEncoding.Unicode;<br />
<br />
&nbsp;&nbsp;&nbsp; // UTF16 MSB = Unicode MSB/BE<br />
&nbsp;&nbsp;&nbsp; if (BytesRead &gt;= 2) and (ByteOrderMask[0] = UTF16BOMBE[0])<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and (ByteOrderMask[1] = UTF16BOMBE[1]) then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Result:= TEncoding.BigEndianUnicode;<br />
<br />
&nbsp;&nbsp;&nbsp; // UTF8<br />
&nbsp;&nbsp;&nbsp; if (BytesRead &gt;= 3) and (ByteOrderMask[0] = UTF8BOM[0])<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and (ByteOrderMask[1] = UTF8BOM[1]) and (ByteOrderMask[2] = UTF8BOM[2]) then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Result:= TEncoding.UTF8;<br />
<br />
&nbsp;&nbsp;&nbsp; // default case (Ansi)<br />
&nbsp;&nbsp;&nbsp; if Result = TEncoding.Default then<br />
&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if IsUTF8(Stream, WithBom) then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Result:= TEncoding.UTF8;<br />
&nbsp;&nbsp;&nbsp; end;<br />
&nbsp; end;<br />
<br />
<br />
GetEncoding 함수로 문서의 인코딩을 알 수 있는데 UTF32는 지원하지 않는다. <br />
델파이 2009도 그렇고 UniSynEidt도 UTF32 형식은 검사하지 않는데, 아마도<br />
문자당 4바이트라서 현실적으로 이 형식을 사용하는 문서가 없을거라는 점 때문인 것 같다.<br />
<br />
			 ]]> 
		</description>
		<category>델파이</category>

		<comments>http://wyb330.egloos.com/4125422#comments</comments>
		<pubDate>Mon, 27 Apr 2009 02:24:41 GMT</pubDate>
		<dc:creator>미노</dc:creator>
	</item>
</channel>
</rss>
