Opened 8 years ago

#3660 new defect

Wikipedia: Chars # and " in the article title break data generation process

Reported by: godiard Owned by: godiard
Priority: Unspecified by Maintainer Milestone: Unspecified
Component: Wikipedia Version: Unspecified
Severity: Unspecified Keywords:
Cc: Distribution/OS: Unspecified
Bug Status: Unconfirmed

Description

After processing pages_parser.py, there are links with '"' and # in the .links file, and after make_selection.py are added to pages_selected-level-1

The " produce errors when trying to insert in the sql database, and the # points to index inside other articles, then should be ignored.

Part of the errors were solved (and other were avoided editing the ages-selected file by hand), but this characters should be removed earlier in the process (probably in make_selection.py)

Change History (0)

Note: See TracTickets for help on using tickets.