Opened 14 years ago
Closed 14 years ago
#1348 closed defect (fixed)
infoSlicer not able to download new articles
Reported by: | walter | Owned by: | walter |
---|---|---|---|
Priority: | Unspecified by Maintainer | Milestone: | Unspecified |
Component: | InfoSlicer | Version: | Unspecified |
Severity: | Blocker | Keywords: | |
Cc: | jpichon | Distribution/OS: | Unspecified |
Bug Status: | New |
Description
It seems that the MediaWiki_parse function in InfoSlicer is having trouble with the format of pages being retrieved from MediaWikis. It is looking to split the downloaded content on the <text> field, but that field is not appearing in the pages. The result is a ValueError exception which is falling through to the default exception handler, which notifies the user that their is a problem with the connection.
More details, including from logs can be found in this thread:
http://lists.sugarlabs.org/archive/sugar-devel/2009-September/019262.html
Attachments (2)
Change History (7)
Changed 14 years ago by jpichon
comment:1 Changed 14 years ago by jpichon
- Cc jpichon added
I'm attaching a patch that fixes the article retrieval issue. I noticed afterwards that most headings were gone from articles from the English wikipedia and a few headings went missing in the other wikipedias as well, the 2nd patch would fix this by treating more tags as having relevant content.
There's still another aesthetic problem, whereby there're a few blank lines at the top of newly downloaded articles. I haven't been able to fix that yet, I just know that it's related to the pre_parse function in HTML_Parser.py. The only workaround I have for now is to reinitialise self.input with BeautifulSoup after calling pre_parse(), I'm not sure if that would be appropriate for a patch. I still hope to figure out what's the real problem.
comment:2 Changed 14 years ago by wadeb
An alternate for importing content from Wikipedia might be to steal code from the WikiBrowse activity.
http://wiki.laptop.org/go/WikiBrowse
WikiBrowse uses the open source mwlib package to convert MediaWiki markup to HTML dynamically. If you used that, you could download the raw markup from WP and not be vulnerable to them changing the HTML.
comment:3 Changed 14 years ago by walter
- Bug Status changed from Unconfirmed to New
Thanks for tracking this down. I'll push the patches and maybe we there is a window we can work on the mwlib port together.
comment:4 Changed 14 years ago by walter
- Status changed from new to accepted
comment:5 Changed 14 years ago by walter
- Resolution set to fixed
- Status changed from accepted to closed
pushed the patch to v6
Fix article retrieval