Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: vm: split no output from test machine errors #4808

Closed
wants to merge 1 commit into from

Conversation

a-nogikh
Copy link
Collaborator

@a-nogikh a-nogikh commented May 14, 2024

It's quite confusing that we in fact combine two kinds of problems into one:

  1. We are no longer able to execute syzkaller-generated programs.
  2. The VM has hanged.

Let's use different error messages and different timeouts for these issues.

In the first case, keep on monitoring for the program execution logs. In the second case, look at any output from the VM. Additionally, attempt periodic SSH connections.

Now it becomes possible to test (2) using C reproducers.

TODO:

  • Update the documentation.
  • Update tests.

It's quite confusing that we in fact combine two kinds of problems into
one:
1) We are no longer able to execute syzkaller-generated programs.
2) The VM has hanged.

Let's use different error messages and different timeouts for these
issues.

In the first case, keep on monitoring for the program execution logs.
In the second case, look at any output from the VM. Additionally,
attempt periodic SSH connections.

Now it becomes possible to test (2) using C reproducers.
@a-nogikh a-nogikh requested a review from dvyukov May 14, 2024 17:21
vmDiagnosisStart = "\nVM DIAGNOSIS:\n"
lostConnectionCrash = "lost connection to test machine"
noOutputCrash = "no output from test machine"
executionStalledCrash = "execution stalled"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be understandable by kernel developers who know nothing about syzkaller. All they will get is this string and likely not much else (no reproducer, no report).

}

func (mon *monitor) monitorExecution() *report.Report {
ticker := time.NewTicker(tickerPeriod * mon.inst.pool.timeouts.Scale)
defer ticker.Stop()

alive := make(chan bool)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change needs a test.

@dvyukov
Copy link
Collaborator

dvyukov commented May 15, 2024

It's quite confusing that we in fact combine two kinds of problems into one:

  1. We are no longer able to execute syzkaller-generated programs.
  2. The VM has hanged.

Let's use different error messages and different timeouts for these issues.

In the first case, keep on monitoring for the program execution logs. In the second case, look at any output from the VM. Additionally, attempt periodic SSH connections.

Now it becomes possible to test (2) using C reproducers.

I don't understand the exact difference. In the first case it's also likely hanged, in the second case it's also not executing programs. So both cases are no output and hanged.

We can create C reproducers for both of these failures already, no?

@a-nogikh
Copy link
Collaborator Author

The periodic executing program output (more precisely, the absence of it) is not a 100% reliable indicator of a kernel problem.

  1. We may have hit some problem in syz-executor/syz-execprog/syz-fuzzer and it just stopped execution.
  2. We don't print executing program in C reproducers. So we should not be looking for executing program in the output in the first place, it will never be there.

In the light of (2) and that we have explicitly tried to avoid running C reproducers for more than 5 minutes not to hit the no output timeout, it's very strange to see C reproducers in https://syzkaller.appspot.com/bug?extid=2e40940976be9f8fce8ba3d1d03b77aee9f4df9d. They should have remained syz reproducers.

So the idea here is to detect kernel hangs more precisely: there must be no output at all and the VM should not accept any new connections anymore. It's the actual no output from machine, or, maybe, machine stalled.

Independently from that, if we test a syz reproducer, we can look for no "executing program" substrings in the output". If there are signs that the VM is alive, but no executions, this is a sign for us that there are some syzkaller bugs that we should look into. Probably prefixing it with SYZFATAL: can make it more clear.

@dvyukov
Copy link
Collaborator

dvyukov commented May 15, 2024

We print "executing program" in C reproducers:

syzkaller$ git grep "executing program" pkg/csource/
pkg/csource/csource.go:                 fmt.Fprintf(buf, "\tif (write(1, \"executing program\\n\", sizeof(\"executing program\\n\") - 1)) {}\n")

@a-nogikh
Copy link
Collaborator Author

Hmm, interesting. I've just looked at ~10 random C repoducers from syzbot and I see executing program in none of them.

@dvyukov
Copy link
Collaborator

dvyukov commented May 15, 2024

Either they were not repeating, or something broke.

@a-nogikh
Copy link
Collaborator Author

a-nogikh commented May 17, 2024

The flag is indeed set before we start reproduction:

Repro: true,

Bug then we clear it right after we have found a reproducer:

defer func() {
if res != nil {
res.Opts.Repro = false
}
}()

We form the actual repro C code right before sending it to the dashboard, now with Repro=false.

cprog, err := csource.Write(repro.Prog, repro.Opts)

It's this way for a very long time already (?), so we likely don't have executing program in all of the currently open bugs.

UPD: sent #4816

@a-nogikh a-nogikh closed this May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants