Opened 10 years ago

Closed 7 years ago

Last modified 6 years ago

#1568 closed defect (fixed)

Recover from broken key pair

Reported by: sayamindu Owned by: sascha_silbe
Priority: Unspecified by Maintainer Milestone:
Component: Sugar Version: Git as of bugdate
Severity: Major Keywords: olpc-test-passed
Cc: dsd, sascha_silbe Distribution/OS:
Bug Status: Unconfirmed

Description

We have had instances of damaged/unreadable ssh key files after a forced power off. It may make sense to call sync() to write these and everything else to the disk to ensure that the system does not remain in an unusable state (due to the damaged ssh key file, Sugar refused to start, sending the display manager in a loop).

Attachments (3)

dlo_9612.patch (940 bytes) - added by sayamindu 10 years ago.
Proposed patch
sugar-1568.patch (2.4 KB) - added by sascha_silbe 9 years ago.
recreate corrupted key pair
sugar-toolkit-1568.patch (1.2 KB) - added by sascha_silbe 9 years ago.
check private key syntax

Download all attachments as: .zip

Change History (25)

Changed 10 years ago by sayamindu

Proposed patch

comment:1 Changed 10 years ago by sayamindu

  • Keywords r? added

comment:2 Changed 10 years ago by dsd

I disagree with this approach. The system could still crash while writing, after writing but before sync, or during sync. Or the file could be corrupted for another reason (for example the system crashing while another file on the same eraseblock is being rewritten by the SD controller). So the bug will still be present, but just less common.

Moreover, this approach could lead to hundreds of sync() calls scattered throughout the codebase.

Sugar should simply be resilient against all corruption like this. If there is something totally fatal it should present an error message and not crash, but otherwise it should just clean it up, regenerating whatever is needed.

And if there is any data that would be a real pain to lose then it should be explicitly flushed on a per-file basis using fsync(), along with a fat comment explaining why it would be a huge pain to lose this. But sugar still has to be able to cope when this file is corrupt either way.

comment:3 Changed 10 years ago by dsd

  • Cc dsd added

comment:4 Changed 10 years ago by sayamindu

While I agree that calling sync() everywhere is not a nice idea, and fsync() is a better option (I have already proposed that we use fsync() at a couple of places in the datastore), there are certain areas which merit the use of a sync, because a large number of files may be in use, some of which are outside the scope of the sugar codebase (gconf databases, ssh keys, etc). For example, for the following log:

1257402816.022310 ERROR root: Keypair exists, skip generation.
1257402817.445231 ERROR dbus.proxies: Introspect error on org.laptop.Sugar.Presence:/org/laptop/Sugar/Presence: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.Spawn.ChildExited: Launch helper exited with unknown return code 1
1257402818.312790 WARNING sugar.presence.presenceservice: Unable to retrieve local user/owner
                   from presence service: org.freedesktop.DBus.Error.Spawn.ChildExited: Launch helper exited with unknown return code 1
Traceback (most recent call last):
  File "/usr/bin/sugar-session", line 173, in <module>
    main()
  File "/usr/bin/sugar-session", line 145, in main
    start_ui_service()
  File "/usr/bin/sugar-session", line 90, in start_ui_service
    ui_service.start()
  File "/usr/lib/python2.6/site-packages/jarabe/view/service.py", line 63, in start
    owner_model = owner.get_model()
  File "/usr/lib/python2.6/site-packages/jarabe/model/owner.py", line 112, in get_model
    _model = Owner()
  File "/usr/lib/python2.6/site-packages/jarabe/model/owner.py", line 77, in __init__
    self._invites = Invites()
  File "/usr/lib/python2.6/site-packages/jarabe/model/invites.py", line 78, in __init__
    owner = ps.get_owner()
  File "/usr/lib/python2.6/site-packages/sugar/presence/presenceservice.py", line 459, in get_owner
    raise RuntimeError("Could not get owner object.")
RuntimeError: Could not get owner object.

It is very difficult to recover from this gracefully.
To be even more fool-proof, I think we can take a look at the places we can recover gracefully (either by regenerating the data or by using sane defaults), but we can never say for sure that sugar will not break due to filesystem issues.

comment:5 follow-up: Changed 10 years ago by dsd

I think we should work through the bugs on a case-by-case issue and fix them. I don't see why the above issue would be difficult to fix (it may be, but theres nothing in the log that suggests to me that it would be). Sugar should aim to be able to cope with any corruption, or at least gracefully display an "I'm broken" error message.

comment:6 in reply to: ↑ 5 Changed 10 years ago by tomeu

Replying to dsd:

I think we should work through the bugs on a case-by-case issue and fix them. I don't see why the above issue would be difficult to fix (it may be, but theres nothing in the log that suggests to me that it would be). Sugar should aim to be able to cope with any corruption, or at least gracefully display an "I'm broken" error message.

This argument sounds good to me in principle, but I think we should be careful and avoid end up shipping less stable releases because we aimed for much we could tackle in the current release cycle.

That said, it may be interesting to see which files are written by sugar and test each of those for corruption and implement fallbacks for all those, instead of waiting for user reports about corruption. Any takers?

comment:7 Changed 10 years ago by tomeu

Oh, and we should try to never use sync() as it will cause stuff unrelated to Sugar to be committed, making our performance profile less predictable (apart from the user experience).

comment:8 Changed 9 years ago by erikos

I guess the outcome of this discussion is to test through all the files written by Sugar for corruptions. Does the submitter agree and we can remove the r? ?

comment:9 Changed 9 years ago by tomeu

  • Keywords r! added; r? removed

removing from the review queue until the submitter comments

Changed 9 years ago by sascha_silbe

recreate corrupted key pair

Changed 9 years ago by sascha_silbe

check private key syntax

comment:10 Changed 9 years ago by sascha_silbe

  • Keywords r? added; r! removed
  • Milestone changed from Unspecified by Release Team to 0.88
  • Owner changed from tomeu to sascha_silbe
  • Severity changed from Unspecified to Major
  • Status changed from new to assigned
  • Summary changed from Call sync() after profile creation to Recover from broken key pair

I've attached patches to recover from a broken key pair. Please open new tickets for anything that breaks due to some file that's written by Sugar getting corrupted.

comment:11 Changed 9 years ago by tomeu

  • Keywords r+ added; r? removed
tomeu: unmadindu: what do you think of the patch in http://bugs.sugarlabs.org/ticket/1568 ? do we also need to sync or that's fine with you?
unmadindu: tomeu: I think this is good enough

Patches look good to me, other than the lack of spaces around the + operator and having logging.error("Error parsing public key.") not log a backtrace (I know this is not introduced by this patch).

comment:12 Changed 9 years ago by erikos

The sugar-toolkit part looks good to me. We can maybe add the backtrace, like Tomeu suggested.

comment:13 Changed 9 years ago by erikos

  • Keywords olpc-0.84 added

comment:14 Changed 9 years ago by sascha_silbe

  • Resolution set to fixed
  • Status changed from assigned to closed

Pushed as bb323a4, thanks!

comment:15 Changed 9 years ago by sayamindu

Cherry-picked.

comment:16 Changed 9 years ago by sayamindu

  • Keywords olpc-0.84+ added; olpc-0.84 removed

comment:17 Changed 9 years ago by sayamindu

  • Keywords olpc-0.84+ removed

comment:18 Changed 7 years ago by sascha_silbe

  • Cc sascha_silbe added
  • Distribution/OS Unspecified deleted
  • Keywords r+ removed
  • Milestone changed from 0.88 to 0.96
  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening because we only reviewed and fixed the sugar-toolkit part, but not sugar. The updated patch is currently being reviewed on sugar-devel.

comment:19 Changed 7 years ago by erikos

  • Keywords olpc-test-pending added
  • Resolution set to fixed
  • Status changed from reopened to closed

has been pushed: 516517a81ecfa0d3d2524f885f30b0cf2b0052c9

Is part of Sugar 0.95.6

comment:20 follow-up: Changed 7 years ago by greenfeld

  • Keywords olpc-test-passed added; olpc-test-pending removed

If the public key or the private key is missing or damaged (in certain manners), we regenerate the SSH keys used by Sugar in OLPC 12.1.0 os11.

However key regeneration also seems to force Sugar to go through it's first boot name/icon color prompt cycle. I do not know if this is intentional.

comment:21 in reply to: ↑ 20 Changed 7 years ago by sascha_silbe

Replying to greenfeld:

If the public key or the private key is missing or damaged (in certain manners), we regenerate the SSH keys used by Sugar in OLPC 12.1.0 os11.

However key regeneration also seems to force Sugar to go through it's first boot name/icon color prompt cycle. I do not know if this is intentional.

It's been that way before the patch already. We're showing the intro screen if any of nick name, colour or SSH key is missing (all) or corrupt (SSH key only).

I don't know whether there was a conscious choice before, but it's at least a reasonable thing to do: If the key is corrupted (presumably by the machine crashing early during first Sugar usage), there's a good chance the name and colour settings are corrupted as well.

comment:22 Changed 6 years ago by dnarvaez

  • Milestone 0.96 deleted

Milestone 0.96 deleted

Note: See TracTickets for help on using tickets.