Ticket #1348 (closed defect: fixed)

Opened 4 years ago

Last modified 4 years ago

infoSlicer not able to download new articles

Reported by: walter Owned by: walter
Priority: Unspecified by Maintainer Milestone: Unspecified by Release Team
Component: InfoSlicer Version: Unspecified
Severity: Blocker Keywords:
Cc: jpichon Distribution/OS: Unspecified
Bug Status: New

Description

It seems that the MediaWiki_parse function in InfoSlicer is having trouble with the format of pages being retrieved from MediaWikis. It is looking to split the downloaded content on the <text> field, but that field is not appearing in the pages. The result is a ValueError exception which is falling through to the default exception handler, which notifies the user that their is a problem with the connection.

More details, including from logs can be found in this thread:

 http://lists.sugarlabs.org/archive/sugar-devel/2009-September/019262.html

Attachments

0001-Fix-article-retrieval-1348.patch Download (1.0 KB) - added by jpichon 4 years ago.
Fix article retrieval
0002-Fix-missing-headings-when-retrieving-article.patch Download (1.0 KB) - added by jpichon 4 years ago.
Missing headings

Change History

Changed 4 years ago by jpichon

Fix article retrieval

Changed 4 years ago by jpichon

Missing headings

Changed 4 years ago by jpichon

  • cc jpichon added

I'm attaching a patch that fixes the article retrieval issue. I noticed afterwards that most headings were gone from articles from the English wikipedia and a few headings went missing in the other wikipedias as well, the 2nd patch would fix this by treating more tags as having relevant content.

There's still another aesthetic problem, whereby there're a few blank lines at the top of newly downloaded articles. I haven't been able to fix that yet, I just know that it's related to the pre_parse function in HTML_Parser.py. The only workaround I have for now is to reinitialise self.input with BeautifulSoup after calling pre_parse(), I'm not sure if that would be appropriate for a patch. I still hope to figure out what's the real problem.

Changed 4 years ago by wadeb

An alternate for importing content from Wikipedia might be to steal code from the WikiBrowse activity.

 http://wiki.laptop.org/go/WikiBrowse

WikiBrowse uses the open source mwlib package to convert MediaWiki markup to HTML dynamically. If you used that, you could download the raw markup from WP and not be vulnerable to them changing the HTML.

Changed 4 years ago by walter

  • status_field changed from Unconfirmed to New

Thanks for tracking this down. I'll push the patches and maybe we there is a window we can work on the mwlib port together.

Changed 4 years ago by walter

  • status changed from new to accepted

Changed 4 years ago by walter

  • status changed from accepted to closed
  • resolution set to fixed

pushed the patch to v6

Note: See TracTickets for help on using tickets.