Design a Malware Pipeline

System Design Discussion

Sep 18, 2025

The Problem

You're a senior cloud architect at a cybersecurity company. Your team needs to build an automated malware analysis pipeline on AWS that can process suspicious files submitted by customers and internal security teams.

Establishing design scope

Candidate: What’s the expected volume of file submissions?

Interviewer: We’re looking at about 10,000 files a day, but we want to be able to scale to 1 million files as we grow

Candidate: Are there any performance requirements that I should be aware for?

Interviewer: Static analysis should be available within 5 minutes. Dynamic analysis can take longer since it’s more resource intensive - maybe 30 minutes. For API queries we need sub 2-second response time.

Candidate: Since we're dealing with malware, security is obviously critical. What are the key security constraints we should consider?

Interviewer: Malware samples must be complete isolated. They cannot affect our production systems or escape to the internet. All data needs to be encrypted and we need a complete audit trail of who accessed what and when.

Candidate: Do we have any compliance considerations?

Interviewer: Yes, we need to retain analysis results and samples for 2 years for compliance purposes. Some customers also have data residency requirements.

Candidate: What kinds of files should we expect? Are we dealing primarily with executables?

Interviewer: It's a mix. About 60% are Windows PE files - executables, DLLs, that sort of thing. Another 25% are Office documents like Word and Excel files. 10% are PDFs, and the remaining 5% are various other formats - scripts, archives, etc.

Candidate: Do we have any compliance considerations?

Interviewer: Yes, we need to retain analysis results and samples for 2 years for compliance purposes. Some customers also have data residency requirements.

Candidate: What kinds of files should we expect? Are we dealing primarily with executables?

Candidate: Do the samples try to communicate externally? Should I plan for network analysis?

Interviewer: Yes, many samples attempt to communicate with command and control servers. We definitely want to capture and analyze that network traffic.

Back of the envelope estimation

Scaling Requirements

10,000 files/day -> 1,000,000 files/day

FPS = 10,000/ 24 hours/3600 seconds = .12 files per second -> 12 files per second

QPS - Let’s assume that the ratio of queries to submissions is 10:1

QPS = 1.2 ->120 queries per second

Storage Requirements

Assume an average file size of 1-2 MB.

Storage requirements for 2 years: 1,000,000 files/day *365 * 2 = 730,000,000 files

730,000,000 files * 2 MB/file = 1.46 petabytes

Analysis Artifact Storage:

Screenshots: 2 MB (dynamic analysis)
Memory dumps: 500 MB (10% of files)
Network captures: 10 MB (dynamic analysis)
Reports/metadata: 100 KB

612 MB/file * 730,000,000 files = 446.76 petabytes

So our initial system will need to start with 4.48 petabytes then scale to 448.22 petabytes to meet our 2 year storage goal. Yikes.

API Endpoints

Submit a sample, we’ll create a post endpoint that will return a signed blob storage URLto do the upload directly to our storage. This will help with the performance of our API.

Post api/v1/data/analyze

sample_id:””
Metadata:{ “source”:”email_attachment”, “customer_id”:””, “tags”:[“phishing”, “urgent”,]

Get status

GET api/v1/data/samples/{sample_id}/status

Get results

GET api/v1/data/samples/{sample_id}

Design

Analysis pipeline

No reason to reinvent the wheel here, we can pull from the standard web-crawler workflow that we all know and love.

Flow:

User uploads file into our s3 bucket with a short-lived, pre-signed, URL which activates our file intake workflow.
Have we seen the content? We can check against our seen content using the md5 has of the file submission. This is where someone smarter would talk about a bloom filter because an in-memory key value search is going to suck.
1. If we’ve found it, then we can just return the report!
2. If not, we dump into the analysis queue
Analysis queue is its own beast and where the bulk of our customization goes. We’ll go over this next.
Once analysis is done, we generate a report and return that to the user’s view.

Storage strategy:

Metadata and System State

SQL all day.

Why?

Consistency is critical - analysis status updates need ACID properties
Clear relationships - Sample -> Analysis Job -> Module results
Complex joins -> “Show me all samples from customer x with risks > 7”
Audit trail - Who submitted what when?

Reports

Document DB

Why?

Read-heavy workload
Single item retrieval
Flexible schema
Auto-scaling

Caching?

We don’t need this yet, we’ll get there.

Analysis Modules

This is the most interesting section for the security design interview. In this section we need to accomplish a couple of things:

Static analysis -> easy
Behavioral analysis -> medium because of all the executable types: mac, linux, windows, etc.
Scalable to include new modules -> super easy with queues
Safe -> HARD

Flow

In this section we’re taking advantage of the result aggregator pattern.

Sample hits the analysis queue.
Sample controller is going to create “job” entries in Dynamo DB to track the progress of all analysis jobs.
Then it fans out the analysis to all the different queue types. This is where we get our extensibility!
As those are completed, they push into artifact storage and dump a “we’re done” message into our finding queue.
Once all the “jobs” are complete, the sample aggregator pulls together all the findings and ships it off to the report queue.

Security Concerns

The main security concerns are around Malware Containment and Escape.

Mitigations

Isolated VPC - Sandbox VPC with NO route to production systems
Network segmentation - Sandbox subnets can't reach each other
Ephemeral instances - Terminate and rebuild after each analysis
Container isolation - Docker with strict resource limits + seccomp profiles
No persistent storage - Analysis artifacts uploaded to S3, then instance destroyed
Controlled internet access - NAT gateway with strict egress filtering

Dataflow INTO the Sandbox VPC

Sample Controller will pass a message into the sandbox VPC via an SQS queue. This message will include a presigned URLto pull in the malware sample and another pre-signed URL to upload the findings. The download will happen through the sandbox NAT gateway via the s3 API which means there is no data flow through the production VPC.

Dataflow OUT OF the Sandbox VPC

Sandbox worker uploads artifacts to s3 with that pre-signed URL. Production system gets the s3 event notification and processes the findings.

IAM Policies

Read-only access to malware samples
Write-only access to results bucket
NO access to production databases
NO Cross-VPC permissions

Front End Flow

We can mostly borrow from any system design for scaled storage. Go check out dropbox or google drive implementations!

Follow-up Questions

How could we provide stronger sandboxing?

Use hardware-virtualized micro-vms, not plain docker. We could use something like Firecracker or full HVM VMs with dedicated AMIs per OS (Windows/Linux/MacOs)
Restore after analysis using immutable baseline + snapshot rollback.
Separate AWS accounts OR at least dedicated sandbox accounts. Also subnets for highest risk analysis to prevent accidental IAM lateral movement.

We mentioned Network Telemetry and safe internet emulation at the beginning. Where is that?

Use a simulated internet/controlled C2 sink(InetSim or a controlled proxy) to allow malware to think it is connected back.
Capture full PCAPvia tcpdump in the sandbox; stream PCAPs to S3 via VPC endpoint. Also capture DNS logs (resolver), HTTP(S) logs via TLS interception if needed (with care for encrypted traffic).

Isolation, IAM and network hardening

VPC endpoints for S3 so data never transits public internet. Block S3 except for VPC endpoints.
Make sure we have least privilege roles per worker type. Separate role for sample pull, artifact upload, and nothing else. Enforce IAM permissions boundary and deny any access to production accounts via SCP/policies.
Use AWS organizations SCPs and separate accounts for telemetry, control plane, and sandbox execution.
Enable VPC flow logs, AWS Cloudtrail, and S3 access logs and pipe to your SIEM. Use object lock to meet any legal/compliance needs.

Your storage strategy is bonkers expensive.

If we have duplicate file submitted we could store only 1 copy. This might be complicated with customer agreements.
We could implement a tiered data storage system. Hot data would be mostly focused on reports and recent artifacts. Warm would be things within some search window. 30 days? Then we have a cold for the compliance needs.
We could also consider compressing memory dumps or other large artifacts. Depends on the COGs there.

What about search?

No reason to role our own right away. Let’s use something like elasticsearch or if we just want reports we have our document database.

Prompting Security

Discussion about this post