Module pywikipedia.saveHTML
This bot downloads the HTML-pages of articles and images
and saves the interesting parts, i.e. the article-text
and the footer to a file like Hauptseite.txt.
TODO:
change the paths in the HTML-file
Options:
-o: Specifies the output-directory where to save the files
-images: Downlaod all images
-overwrite:[I|A|B] Ignore existing Images|Article|Both and
download them even if the exist
Features, not bugs:
* Won't d/l images of an article if you set -overwrite:A
Function Summary |
|
extractArticle (data)
takes a string with the complete HTML-file and returns the article
which is contained in <div id='article'> and the pagestats which
contain information on last change |
|
extractImages (data)
takes a string with the complete HTML-file and returns the article
which is contained in <div id='article'> and the pagestats which
contain information on last change |
|
html2txt(str)
|
|
main()
|
- Imported modules:
-
httplib
,
md5
,
os
,
re
,
string
,
StringIO
,
sys
,
pywikipedia.wikipedia
- Imported variables:
-
__version__
,
codepoint2name
,
entitydefs
,
name2codepoint
extractArticle(data)
takes a string with the complete HTML-file and returns the article
which is contained in <div id='article'> and the pagestats which
contain information on last change
-
|
extractImages(data)
takes a string with the complete HTML-file and returns the article
which is contained in <div id='article'> and the pagestats which
contain information on last change
-
|