Context Navigation

← Previous Ticket
Next Ticket →

#1348 closed defect (fixed)

infoSlicer not able to download new articles

Reported by:	walter	Owned by:	walter
Priority:	Unspecified by Maintainer	Milestone:	Unspecified
Component:	InfoSlicer	Version:	Unspecified
Severity:	Blocker	Keywords:
Cc:	jpichon	Distribution/OS:	Unspecified
Bug Status:	New

Description

It seems that the MediaWiki_parse function in InfoSlicer is having trouble with the format of pages being retrieved from MediaWikis. It is looking to split the downloaded content on the <text> field, but that field is not appearing in the pages. The result is a ValueError exception which is falling through to the default exception handler, which notifies the user that their is a problem with the connection.

More details, including from logs can be found in this thread:

http://lists.sugarlabs.org/archive/sugar-devel/2009-September/019262.html

Attachments (2)

0001-Fix-article-retrieval-1348.patch (978 bytes) - added by jpichon 15 years ago.: Fix article retrieval
0002-Fix-missing-headings-when-retrieving-article.patch (1.0 KB) - added by jpichon 15 years ago.: Missing headings

Download all attachments as: .zip

Change History (7)

Changed 15 years ago by jpichon

Attachment 0001-Fix-article-retrieval-1348.patch added

Fix article retrieval

Changed 15 years ago by jpichon

Attachment 0002-Fix-missing-headings-when-retrieving-article.patch added

Missing headings

comment:1 Changed 15 years ago by jpichon

Cc jpichon added

I'm attaching a patch that fixes the article retrieval issue. I noticed afterwards that most headings were gone from articles from the English wikipedia and a few headings went missing in the other wikipedias as well, the 2nd patch would fix this by treating more tags as having relevant content.

There's still another aesthetic problem, whereby there're a few blank lines at the top of newly downloaded articles. I haven't been able to fix that yet, I just know that it's related to the pre_parse function in HTML_Parser.py. The only workaround I have for now is to reinitialise self.input with BeautifulSoup after calling pre_parse(), I'm not sure if that would be appropriate for a patch. I still hope to figure out what's the real problem.

comment:2 Changed 15 years ago by wadeb

An alternate for importing content from Wikipedia might be to steal code from the WikiBrowse activity.

http://wiki.laptop.org/go/WikiBrowse

WikiBrowse uses the open source mwlib package to convert MediaWiki markup to HTML dynamically. If you used that, you could download the raw markup from WP and not be vulnerable to them changing the HTML.

comment:3 Changed 14 years ago by walter

Bug Status changed from Unconfirmed to New

Thanks for tracking this down. I'll push the patches and maybe we there is a window we can work on the mwlib port together.

comment:4 Changed 14 years ago by walter

Status changed from new to accepted

comment:5 Changed 14 years ago by walter

Resolution set to fixed
Status changed from accepted to closed

pushed the patch to v6

Note: See TracTickets for help on using tickets.

Download in other formats: