AWK and DNA

2022-06-25 computer english blog blogging

Pesquisando sobre otimização de AWK eu encontro este post em que uma pessoa tenta tornar a análise de sequências de DNA na nuvem algo financeiramente e computacionalmente viável. Ela começa tentando o óbvio, usando SQL nas próprias estruturas de CSV hospedadas na S3, passa por tentativas de particionar os blocos, por entender como paralelizar a operação e termina em uma solução que usa AWK junto de GNU Parallel, redirecionando a saída para que a linguagem R processe.

select * from intensityData limit 10;
select * from intensityData where snp = 'rs123456';

"Eight minutes and 4+ terabytes of data queried later I had my results. Athena charges you by data searched at the reasonable rate of $5 per TB. So this single query cost $20 and eight minutes. If we ever wanted to run a model over all the data we better be ready to wait roughly 38 years and pay $50 million. Clearly this wasn’t going to work."

Lessons Learned

  • There's no cheap way to parse 25tb of data at once.
  • Sorting is hard, especially when data is distributed.
  • Never, ever, try and make 2.5 million partitions. (cost: $1k+ USD)
  • Sometimes bespoke data needs bespoke solutions.
  • Don't sleep on the basics. Someone probably solved your problem in the 80s.
  • gnu parallel is magic and everyone should use it.
  • Associative arrays in AWK are super powerful.

Example

yp1234,577,1,3
yp5678,577,3,5
yp9012,132,8,9

# This will create the two files 577.csv and 132.csv in your current directory.
awk -F, '{ print > $2 ".csv" }' file.csv

parallel --block 100M --pipe  \
        "awk -F '\t' '{print \$1\",...\"$30\">\"chunked/{#}_chr\"\$15\".csv\"}'"
[comment] [Gerenciamento de Memória em C/C++]