pywikipedia.saveHTML

Package pywikipedia :: Module saveHTML

[show private | hide private]

Module pywikipedia.saveHTML

This bot downloads the HTML-pages of articles and images
and saves the interesting parts, i.e. the article-text
and the footer to a file like Hauptseite.txt.

TODO:
   change the paths in the HTML-file


Options:

      -o:                Specifies the output-directory where to save the files   

      -images:           Downlaod all images
      -overwrite:[I|A|B] Ignore existing Images|Article|Both and
                         download them even if the exist


Features, not bugs:
* Won't d/l images of an article if you set -overwrite:A

Function Summary
	`extractArticle(data)` takes a string with the complete HTML-file and returns the article which is contained in <div id='article'> and the pagestats which contain information on last change
	`extractImages(data)` takes a string with the complete HTML-file and returns the article which is contained in <div id='article'> and the pagestats which contain information on last change
	`html2txt(str)`
	`main()`

Imported modules:: httplib, md5, os, re, string, StringIO, sys, pywikipedia.wikipedia
Imported variables:: __version__, codepoint2name, entitydefs, name2codepoint

Function Details

extractArticle(data)

takes a string with the complete HTML-file and returns the article which is contained in <div id='article'> and the pagestats which contain information on last change

extractImages(data)

takes a string with the complete HTML-file and returns the article which is contained in <div id='article'> and the pagestats which contain information on last change

Generated by Epydoc 2.1 on Sun Jul 03 17:07:33 2005

http://epydoc.sf.net