Package pywikipedia :: Module saveHTML
[show private | hide private]
[frames | no frames]

Module pywikipedia.saveHTML

This bot downloads the HTML-pages of articles and images
and saves the interesting parts, i.e. the article-text
and the footer to a file like Hauptseite.txt.

TODO:
   change the paths in the HTML-file


Options:

      -o:                Specifies the output-directory where to save the files   

      -images:           Downlaod all images
      -overwrite:[I|A|B] Ignore existing Images|Article|Both and
                         download them even if the exist


Features, not bugs:
* Won't d/l images of an article if you set -overwrite:A

Function Summary
  extractArticle(data)
takes a string with the complete HTML-file and returns the article which is contained in <div id='article'> and the pagestats which contain information on last change
  extractImages(data)
takes a string with the complete HTML-file and returns the article which is contained in <div id='article'> and the pagestats which contain information on last change
  html2txt(str)
  main()

Imported modules:
httplib, md5, os, re, string, StringIO, sys, pywikipedia.wikipedia
Imported variables:
__version__, codepoint2name, entitydefs, name2codepoint
Function Details

extractArticle(data)

takes a string with the complete HTML-file and returns the article which is contained in <div id='article'> and the pagestats which contain information on last change

extractImages(data)

takes a string with the complete HTML-file and returns the article which is contained in <div id='article'> and the pagestats which contain information on last change

Generated by Epydoc 2.1 on Sun Jul 03 17:07:33 2005 http://epydoc.sf.net