Opened 8 years ago

Closed 8 years ago

#1348 closed defect (fixed)

infoSlicer not able to download new articles

Reported by: walter Owned by: walter
Priority: Unspecified by Maintainer Milestone: Unspecified
Component: InfoSlicer Version: Unspecified
Severity: Blocker Keywords:
Cc: jpichon Distribution/OS: Unspecified
Bug Status: New

Description

It seems that the MediaWiki_parse function in InfoSlicer is having trouble with the format of pages being retrieved from MediaWikis. It is looking to split the downloaded content on the <text> field, but that field is not appearing in the pages. The result is a ValueError exception which is falling through to the default exception handler, which notifies the user that their is a problem with the connection.

More details, including from logs can be found in this thread:

http://lists.sugarlabs.org/archive/sugar-devel/2009-September/019262.html

Attachments (2)

0001-Fix-article-retrieval-1348.patch (978 bytes) - added by jpichon 8 years ago.
Fix article retrieval
0002-Fix-missing-headings-when-retrieving-article.patch (1.0 KB) - added by jpichon 8 years ago.
Missing headings

Download all attachments as: .zip

Change History (7)

Changed 8 years ago by jpichon

Fix article retrieval

Changed 8 years ago by jpichon

Missing headings

comment:1 Changed 8 years ago by jpichon

  • Cc jpichon added

I'm attaching a patch that fixes the article retrieval issue. I noticed afterwards that most headings were gone from articles from the English wikipedia and a few headings went missing in the other wikipedias as well, the 2nd patch would fix this by treating more tags as having relevant content.

There's still another aesthetic problem, whereby there're a few blank lines at the top of newly downloaded articles. I haven't been able to fix that yet, I just know that it's related to the pre_parse function in HTML_Parser.py. The only workaround I have for now is to reinitialise self.input with BeautifulSoup after calling pre_parse(), I'm not sure if that would be appropriate for a patch. I still hope to figure out what's the real problem.

comment:2 Changed 8 years ago by wadeb

An alternate for importing content from Wikipedia might be to steal code from the WikiBrowse activity.

http://wiki.laptop.org/go/WikiBrowse

WikiBrowse uses the open source mwlib package to convert MediaWiki markup to HTML dynamically. If you used that, you could download the raw markup from WP and not be vulnerable to them changing the HTML.

comment:3 Changed 8 years ago by walter

  • Bug Status changed from Unconfirmed to New

Thanks for tracking this down. I'll push the patches and maybe we there is a window we can work on the mwlib port together.

comment:4 Changed 8 years ago by walter

  • Status changed from new to accepted

comment:5 Changed 8 years ago by walter

  • Resolution set to fixed
  • Status changed from accepted to closed

pushed the patch to v6

Note: See TracTickets for help on using tickets.