1
00:00:00,480 --> 00:00:03,480
foreign

2
00:00:11,960 --> 00:00:18,240
who is going to be speaking about small

3
00:00:15,120 --> 00:00:22,920
footprint ETL please clap

4
00:00:18,240 --> 00:00:24,779
[Applause]

5
00:00:22,920 --> 00:00:26,699
thank you very much uh hi there I'm no

6
00:00:24,779 --> 00:00:28,680
cantritz I'm an SRE at geomagical Labs

7
00:00:26,699 --> 00:00:31,260
we do computer vision and augmented

8
00:00:28,680 --> 00:00:33,000
reality stuff for Ikea and I'm here to

9
00:00:31,260 --> 00:00:34,680
talk about building small footprint ETL

10
00:00:33,000 --> 00:00:37,620
systems as Katie so succinctly said

11
00:00:34,680 --> 00:00:39,300
using Django in particular but I'm not

12
00:00:37,620 --> 00:00:41,100
here to talk about work

13
00:00:39,300 --> 00:00:42,860
use one of my side projects form RPG

14
00:00:41,100 --> 00:00:45,120
it's a free online web and mobile game

15
00:00:42,860 --> 00:00:46,379
and more important for our needs it has

16
00:00:45,120 --> 00:00:48,780
a lot of fun data and I don't have to

17
00:00:46,379 --> 00:00:50,340
run this past a million lawyers

18
00:00:48,780 --> 00:00:51,600
if you're not a big gamer don't worry

19
00:00:50,340 --> 00:00:53,219
the big thing to understand is that

20
00:00:51,600 --> 00:00:55,320
games tend to have highly interconnected

21
00:00:53,219 --> 00:00:56,760
data items drop from monsters and their

22
00:00:55,320 --> 00:00:59,579
use and recipes and recipes come from

23
00:00:56,760 --> 00:01:01,379
quests etc etc in SQL terms this looks

24
00:00:59,579 --> 00:01:03,780
like every table has at least one

25
00:01:01,379 --> 00:01:05,820
foreign key usually three or four

26
00:01:03,780 --> 00:01:09,420
as much more of a web structure than

27
00:01:05,820 --> 00:01:10,680
you'd have in a normal rest application

28
00:01:09,420 --> 00:01:12,180
and this isn't really part of the main

29
00:01:10,680 --> 00:01:14,040
topic in case anyone's wondering did he

30
00:01:12,180 --> 00:01:15,780
really spend a year building a ETL

31
00:01:14,040 --> 00:01:17,280
system for a free internet game yes I

32
00:01:15,780 --> 00:01:19,200
did because it was fun and it's a great

33
00:01:17,280 --> 00:01:21,420
way to learn these kinds of tools many

34
00:01:19,200 --> 00:01:22,920
of which I now use at my day job big

35
00:01:21,420 --> 00:01:24,600
shout out to fun side projects where you

36
00:01:22,920 --> 00:01:26,220
can move at your own speed and no one

37
00:01:24,600 --> 00:01:28,080
worries if you are down for a week

38
00:01:26,220 --> 00:01:29,759
anyway moving on

39
00:01:28,080 --> 00:01:31,259
all right so this talks about ETL what

40
00:01:29,759 --> 00:01:33,240
does that even mean the core is quite

41
00:01:31,259 --> 00:01:34,259
literal extract data from somewhere run

42
00:01:33,240 --> 00:01:36,360
it through some kind of transformation

43
00:01:34,259 --> 00:01:38,340
and load it into a database

44
00:01:36,360 --> 00:01:39,540
not every tail is a web scraper but the

45
00:01:38,340 --> 00:01:41,220
two are very similar so we can kind of

46
00:01:39,540 --> 00:01:42,720
think of them the same terms maybe

47
00:01:41,220 --> 00:01:44,700
instead of an Internet website you're

48
00:01:42,720 --> 00:01:46,680
scraping an internal API or maybe it's a

49
00:01:44,700 --> 00:01:47,880
database instead of HTTP but they're all

50
00:01:46,680 --> 00:01:49,200
sort of the same structure if you think

51
00:01:47,880 --> 00:01:50,700
web scraper you're probably in the right

52
00:01:49,200 --> 00:01:52,259
ballpark

53
00:01:50,700 --> 00:01:53,579
to touch on it real briefly a lot of

54
00:01:52,259 --> 00:01:56,340
very fancy folks have been thought

55
00:01:53,579 --> 00:01:58,380
leadering about elt instead of ETL the

56
00:01:56,340 --> 00:02:00,299
same core idea but instead of storing

57
00:01:58,380 --> 00:02:01,799
the transformed data first you store the

58
00:02:00,299 --> 00:02:03,840
raw data so you can re-transform it

59
00:02:01,799 --> 00:02:05,399
later if you need to if that's a feature

60
00:02:03,840 --> 00:02:07,020
that you need by all means pursue it if

61
00:02:05,399 --> 00:02:08,459
your raw data is very big though that's

62
00:02:07,020 --> 00:02:09,599
going to balloon your complexity and

63
00:02:08,459 --> 00:02:11,099
your storage requirements and since

64
00:02:09,599 --> 00:02:13,379
we're here to talk about small systems I

65
00:02:11,099 --> 00:02:15,120
don't think this is for us

66
00:02:13,379 --> 00:02:16,860
and also because we are talking about

67
00:02:15,120 --> 00:02:18,599
scraping I would be remiss if I did not

68
00:02:16,860 --> 00:02:19,739
remind everyone that hostile scraping is

69
00:02:18,599 --> 00:02:21,420
generally against the terms and

70
00:02:19,739 --> 00:02:23,099
conditions of websites make sure you

71
00:02:21,420 --> 00:02:25,500
have permission before you scrape things

72
00:02:23,099 --> 00:02:27,180
any website API or data source that you

73
00:02:25,500 --> 00:02:28,860
don't own if you have questions about

74
00:02:27,180 --> 00:02:31,980
what is allowed please talk to the owner

75
00:02:28,860 --> 00:02:34,680
or a trusted legal professional or both

76
00:02:31,980 --> 00:02:37,260
but all right scrapers aren't all of ETL

77
00:02:34,680 --> 00:02:38,340
we need the T and the L2 the transforms

78
00:02:37,260 --> 00:02:40,140
in many of these systems are going to

79
00:02:38,340 --> 00:02:41,580
have two steps first you're going to

80
00:02:40,140 --> 00:02:43,760
want to parse stuff into structured data

81
00:02:41,580 --> 00:02:46,140
if you're lucky in simple cases this is

82
00:02:43,760 --> 00:02:47,340
json.lodess if you're not lucky it's

83
00:02:46,140 --> 00:02:49,620
going to be something ugly with

84
00:02:47,340 --> 00:02:51,599
beautiful soup or a binary parser or who

85
00:02:49,620 --> 00:02:53,340
knows what then we need to take that

86
00:02:51,599 --> 00:02:54,780
structure data and mold it into a form

87
00:02:53,340 --> 00:02:56,940
that's going to be more useful for our

88
00:02:54,780 --> 00:02:58,319
queries later this can take the form of

89
00:02:56,940 --> 00:02:59,879
something like SQL normalization or

90
00:02:58,319 --> 00:03:01,620
denormalization meaning breaking things

91
00:02:59,879 --> 00:03:03,120
apart into smaller models or gluing them

92
00:03:01,620 --> 00:03:05,280
together into bigger models or more

93
00:03:03,120 --> 00:03:07,379
mundane stuff like just renaming fields

94
00:03:05,280 --> 00:03:09,660
or combining multiple data sources into

95
00:03:07,379 --> 00:03:10,920
a single model stuff like that

96
00:03:09,660 --> 00:03:13,200
um

97
00:03:10,920 --> 00:03:15,120
the transforms can also sometimes be

98
00:03:13,200 --> 00:03:16,379
doing data aggregation at transform time

99
00:03:15,120 --> 00:03:17,640
we'll talk about this a little bit more

100
00:03:16,379 --> 00:03:19,019
later on some of the trade-offs in this

101
00:03:17,640 --> 00:03:21,780
but if you're being really rigorous

102
00:03:19,019 --> 00:03:23,640
about what ETL means then the transform

103
00:03:21,780 --> 00:03:25,440
would also be doing data collapse and

104
00:03:23,640 --> 00:03:27,540
aggregation as well

105
00:03:25,440 --> 00:03:29,879
but uh

106
00:03:27,540 --> 00:03:31,080
we also need to talk about the Elven in

107
00:03:29,879 --> 00:03:32,700
that case it's mostly going to be the

108
00:03:31,080 --> 00:03:34,500
Django auram which I assume most people

109
00:03:32,700 --> 00:03:36,120
here probably know how to use you could

110
00:03:34,500 --> 00:03:38,459
use more complex stuff like Django rest

111
00:03:36,120 --> 00:03:39,659
framework serializers or pedantic but in

112
00:03:38,459 --> 00:03:41,700
most cases it's going to be something

113
00:03:39,659 --> 00:03:43,440
along those lines

114
00:03:41,700 --> 00:03:45,480
async and Django has been a long journey

115
00:03:43,440 --> 00:03:47,099
and that Journey isn't over yet but

116
00:03:45,480 --> 00:03:48,420
async Django is great and you can use it

117
00:03:47,099 --> 00:03:50,159
today for real production applications

118
00:03:48,420 --> 00:03:51,540
I'm going to touch a little bit more on

119
00:03:50,159 --> 00:03:53,700
some of the limitations later but the

120
00:03:51,540 --> 00:03:55,620
overall Top Line thing to uh to know is

121
00:03:53,700 --> 00:03:57,599
that you can really use this I highly

122
00:03:55,620 --> 00:04:00,239
recommend it

123
00:03:57,599 --> 00:04:02,220
why use asynchango as the basis for an

124
00:04:00,239 --> 00:04:05,280
ETL system it lets us keep everything in

125
00:04:02,220 --> 00:04:07,019
one code base uh we we in most detail

126
00:04:05,280 --> 00:04:09,420
systems they have a fairly well deserved

127
00:04:07,019 --> 00:04:10,680
reputation for being finicky tools one

128
00:04:09,420 --> 00:04:12,540
system falls out of sync with another

129
00:04:10,680 --> 00:04:14,400
and then the whole pipeline locks up

130
00:04:12,540 --> 00:04:16,799
until some sat on call engineer gets

131
00:04:14,400 --> 00:04:18,120
page to come out and fix it having fewer

132
00:04:16,799 --> 00:04:19,680
Services means that we have fewer things

133
00:04:18,120 --> 00:04:21,479
that can go wrong and when they do go

134
00:04:19,680 --> 00:04:23,160
wrong we have simpler solutions for them

135
00:04:21,479 --> 00:04:25,320
here we're probably only going to have

136
00:04:23,160 --> 00:04:27,000
two things Django and postgres and if

137
00:04:25,320 --> 00:04:28,259
Django breaks restart Django and we're

138
00:04:27,000 --> 00:04:30,780
back in business

139
00:04:28,259 --> 00:04:32,340
uh also having things inside one service

140
00:04:30,780 --> 00:04:33,960
makes a lot easier to move them between

141
00:04:32,340 --> 00:04:35,340
different deployment tools and structure

142
00:04:33,960 --> 00:04:37,380
providers and it makes local development

143
00:04:35,340 --> 00:04:39,000
a whole lot easier

144
00:04:37,380 --> 00:04:40,560
microservice has been the cool way to do

145
00:04:39,000 --> 00:04:42,120
things for a really long time now and

146
00:04:40,560 --> 00:04:43,740
they are certainly useful when projects

147
00:04:42,120 --> 00:04:45,720
get big and they cross team boundaries

148
00:04:43,740 --> 00:04:47,759
but we're doing small small doesn't need

149
00:04:45,720 --> 00:04:49,320
microservices

150
00:04:47,759 --> 00:04:51,180
you save a lot on the organizational

151
00:04:49,320 --> 00:04:53,520
complexity on the or the infrastructure

152
00:04:51,180 --> 00:04:56,060
complexity don't worry about it embrace

153
00:04:53,520 --> 00:04:56,060
the monolith

154
00:04:56,160 --> 00:05:00,300
thank you lots of talks as well as the

155
00:04:59,040 --> 00:05:01,680
job the Django documentation have

156
00:05:00,300 --> 00:05:03,900
covered the basics of starting an async

157
00:05:01,680 --> 00:05:05,639
app starting with 4.2 it's also

158
00:05:03,900 --> 00:05:08,400
basically the same as starting any

159
00:05:05,639 --> 00:05:10,620
Django app you build your your Django

160
00:05:08,400 --> 00:05:12,240
project you start the first app and then

161
00:05:10,620 --> 00:05:13,979
you start adding views when you add the

162
00:05:12,240 --> 00:05:15,540
view you decorate it as async and that's

163
00:05:13,979 --> 00:05:18,060
it you've got yourself an async Django

164
00:05:15,540 --> 00:05:19,560
project all of the core middleware is

165
00:05:18,060 --> 00:05:20,460
async compatible so you only have to

166
00:05:19,560 --> 00:05:22,500
worry about middleware if you start

167
00:05:20,460 --> 00:05:25,380
adding custom stuff and even if you have

168
00:05:22,500 --> 00:05:26,820
a project that is adding non-asyn

169
00:05:25,380 --> 00:05:29,699
compatible middleware Django will adapt

170
00:05:26,820 --> 00:05:31,199
it for you though at a performance cost

171
00:05:29,699 --> 00:05:32,759
one thing Django doesn't currently

172
00:05:31,199 --> 00:05:34,139
include is an async compatible web

173
00:05:32,759 --> 00:05:35,940
server we heard a little bit of this in

174
00:05:34,139 --> 00:05:37,259
the last talk as well fortunately the

175
00:05:35,940 --> 00:05:39,419
community has several for us to pick

176
00:05:37,259 --> 00:05:41,060
from I personally like yuvicorn it has

177
00:05:39,419 --> 00:05:44,220
the same sort of reload on change

178
00:05:41,060 --> 00:05:46,500
development flow as the core run server

179
00:05:44,220 --> 00:05:47,639
development server and you can use it in

180
00:05:46,500 --> 00:05:49,380
production if you just throw an extra

181
00:05:47,639 --> 00:05:51,479
couple flags on there turn on the axis

182
00:05:49,380 --> 00:05:53,220
logging turn on TLS stuff like that but

183
00:05:51,479 --> 00:05:56,580
you can use the same project or the same

184
00:05:53,220 --> 00:05:58,259
server in both scenarios

185
00:05:56,580 --> 00:06:01,139
for any ETL system we're going to need

186
00:05:58,259 --> 00:06:02,639
background tasks the normal Django way

187
00:06:01,139 --> 00:06:04,860
to do background tasks to celery and

188
00:06:02,639 --> 00:06:06,180
celery beats but that's a whole big set

189
00:06:04,860 --> 00:06:08,100
of things to run you need to run all the

190
00:06:06,180 --> 00:06:10,139
celery demons and then a broker maybe a

191
00:06:08,100 --> 00:06:13,259
result store I don't really want to do

192
00:06:10,139 --> 00:06:15,240
that anymore at least if I can help it

193
00:06:13,259 --> 00:06:16,620
a simpler solution is cron with custom

194
00:06:15,240 --> 00:06:18,120
management commands but I think we can

195
00:06:16,620 --> 00:06:19,919
do even better

196
00:06:18,120 --> 00:06:21,180
one big advantage to the async system is

197
00:06:19,919 --> 00:06:23,580
that we can spawn additional background

198
00:06:21,180 --> 00:06:25,800
tasks to run concurrently inside the web

199
00:06:23,580 --> 00:06:27,960
application so let's do that

200
00:06:25,800 --> 00:06:28,860
the core of any looping task in async

201
00:06:27,960 --> 00:06:31,020
Python is going to look something like

202
00:06:28,860 --> 00:06:32,699
this run the extractor function sleep

203
00:06:31,020 --> 00:06:34,319
repeat

204
00:06:32,699 --> 00:06:35,580
to integrate this with Django though we

205
00:06:34,319 --> 00:06:37,560
need to hook it into the server startup

206
00:06:35,580 --> 00:06:39,539
process the easiest way to do that is

207
00:06:37,560 --> 00:06:41,639
with the ready callback in app configs

208
00:06:39,539 --> 00:06:42,900
so we want to launch our async task in

209
00:06:41,639 --> 00:06:44,759
the background while Django is

210
00:06:42,900 --> 00:06:46,259
initializing and then let Django

211
00:06:44,759 --> 00:06:48,600
continue initializing as the server

212
00:06:46,259 --> 00:06:49,560
starts up all of our code loads our

213
00:06:48,600 --> 00:06:51,120
stuff will just be running in the

214
00:06:49,560 --> 00:06:53,039
background firing off every 30 seconds

215
00:06:51,120 --> 00:06:54,360
or whatever we told it to create task

216
00:06:53,039 --> 00:06:56,720
does exactly this thing that we're

217
00:06:54,360 --> 00:06:56,720
looking for

218
00:06:56,880 --> 00:07:00,900
but we do need to talk about some of the

219
00:06:58,319 --> 00:07:02,460
downsides as well uh with celery Revenue

220
00:07:00,900 --> 00:07:04,620
cue we have a much more durable system

221
00:07:02,460 --> 00:07:06,539
if a task gets accepted by the queue

222
00:07:04,620 --> 00:07:09,240
into celery it's going to run at least

223
00:07:06,539 --> 00:07:12,120
once no matter what barring any really

224
00:07:09,240 --> 00:07:14,280
weird bugs but with this if something

225
00:07:12,120 --> 00:07:15,539
crashes it's just gone okay well we can

226
00:07:14,280 --> 00:07:18,360
add exception handling we can add

227
00:07:15,539 --> 00:07:20,280
retries but still if our process crashes

228
00:07:18,360 --> 00:07:23,460
completely again it's just no longer

229
00:07:20,280 --> 00:07:25,560
there but this is usually okay in ETL

230
00:07:23,460 --> 00:07:27,900
systems if we miss a scrape one hour

231
00:07:25,560 --> 00:07:29,759
we'll get at the next hour no worries as

232
00:07:27,900 --> 00:07:31,319
long as each scrape is bringing in more

233
00:07:29,759 --> 00:07:32,819
than its interval worth of data so if

234
00:07:31,319 --> 00:07:34,080
you scrape every hour if every scrape is

235
00:07:32,819 --> 00:07:36,539
bringing in more than an hour of data

236
00:07:34,080 --> 00:07:39,000
you've got a buffer for failure

237
00:07:36,539 --> 00:07:41,039
this basically builds in the the fault

238
00:07:39,000 --> 00:07:42,960
tolerance for you in cases where you

239
00:07:41,039 --> 00:07:45,419
can't do that you can also build

240
00:07:42,960 --> 00:07:46,560
specific models to track the status of

241
00:07:45,419 --> 00:07:48,780
tasks just like we would have with

242
00:07:46,560 --> 00:07:50,220
rabbitmq so as an example rather than

243
00:07:48,780 --> 00:07:52,199
having a background task for sending an

244
00:07:50,220 --> 00:07:54,180
email we can instead store a database

245
00:07:52,199 --> 00:07:55,800
Row for each pending email we have a

246
00:07:54,180 --> 00:07:57,060
looping task that every few minutes will

247
00:07:55,800 --> 00:07:59,819
go through and try to send everything

248
00:07:57,060 --> 00:08:01,740
that is pending when they succeed it'll

249
00:07:59,819 --> 00:08:04,500
get flushed out of the database

250
00:08:01,740 --> 00:08:07,199
and if not it'll get retried this gives

251
00:08:04,500 --> 00:08:08,759
a similar level of safety to rabbitmq as

252
00:08:07,199 --> 00:08:11,160
long as something gets into that table

253
00:08:08,759 --> 00:08:12,660
it will be tried at least once the more

254
00:08:11,160 --> 00:08:14,220
this is more work though it's always

255
00:08:12,660 --> 00:08:16,199
decide you know you have to weigh the

256
00:08:14,220 --> 00:08:18,360
pros and cons of how much failure

257
00:08:16,199 --> 00:08:20,160
tolerance you want in each sort of piece

258
00:08:18,360 --> 00:08:22,319
of this system

259
00:08:20,160 --> 00:08:24,300
but okay back to async stuff general

260
00:08:22,319 --> 00:08:25,560
rule of the async orm is any method that

261
00:08:24,300 --> 00:08:29,220
would talk to the database is prefix

262
00:08:25,560 --> 00:08:31,139
with an A A get a first a save a update

263
00:08:29,220 --> 00:08:32,760
if you ever miss an A and call the

264
00:08:31,139 --> 00:08:34,380
synchronous method in an async context

265
00:08:32,760 --> 00:08:35,880
Django will raise an exception to remind

266
00:08:34,380 --> 00:08:38,459
you hey you really need to call the

267
00:08:35,880 --> 00:08:39,719
async version of this so don't worry too

268
00:08:38,459 --> 00:08:41,520
much about that

269
00:08:39,719 --> 00:08:43,560
there are two big limitations left in

270
00:08:41,520 --> 00:08:45,480
the async go around transactions don't

271
00:08:43,560 --> 00:08:47,820
work in async code and queries can't

272
00:08:45,480 --> 00:08:49,680
overlap talk about the second one first

273
00:08:47,820 --> 00:08:51,600
the lack of overlap is mostly an

274
00:08:49,680 --> 00:08:53,339
internal detail so you can run multiple

275
00:08:51,600 --> 00:08:54,120
concurrent queries at the async i o

276
00:08:53,339 --> 00:08:55,980
level

277
00:08:54,120 --> 00:08:58,019
set up a you know an async gather

278
00:08:55,980 --> 00:08:59,279
whatever you want to do with it but from

279
00:08:58,019 --> 00:09:00,360
the databases point of view it's going

280
00:08:59,279 --> 00:09:01,740
to see those queries sequentially

281
00:09:00,360 --> 00:09:03,720
because they are run on a single

282
00:09:01,740 --> 00:09:05,880
background worker thread the end result

283
00:09:03,720 --> 00:09:06,899
of this is that you do not yet get a

284
00:09:05,880 --> 00:09:10,440
performance benefit from running

285
00:09:06,899 --> 00:09:12,420
multiple SQL queries in parallel

286
00:09:10,440 --> 00:09:14,459
async transactions require a little bit

287
00:09:12,420 --> 00:09:17,339
more complexity than usual so if we just

288
00:09:14,459 --> 00:09:20,399
naively did transactions in async code

289
00:09:17,339 --> 00:09:21,959
we run the risk of multiple database

290
00:09:20,399 --> 00:09:23,820
queries getting interleaved into the

291
00:09:21,959 --> 00:09:25,980
transaction so we have to pull it all

292
00:09:23,820 --> 00:09:28,200
into one synchronous block and run it

293
00:09:25,980 --> 00:09:29,940
using asgi ref's sync to async helper

294
00:09:28,200 --> 00:09:32,820
this is definitely something the Django

295
00:09:29,940 --> 00:09:34,800
team is looking to improve uh in very

296
00:09:32,820 --> 00:09:36,120
recent releases of some of the database

297
00:09:34,800 --> 00:09:37,800
Library the underlying database

298
00:09:36,120 --> 00:09:39,600
libraries like psycho pg3 we've started

299
00:09:37,800 --> 00:09:41,519
seeing true async support that should

300
00:09:39,600 --> 00:09:44,519
open a lot of doors for better support

301
00:09:41,519 --> 00:09:46,140
in the async orm

302
00:09:44,519 --> 00:09:47,580
because Spectrum from web servers is

303
00:09:46,140 --> 00:09:49,140
such a common extractor you will very

304
00:09:47,580 --> 00:09:51,720
likely need an async compatible HTTP

305
00:09:49,140 --> 00:09:53,580
client my recommendation is httpx it's

306
00:09:51,720 --> 00:09:55,140
got a really simple API it's got great

307
00:09:53,580 --> 00:09:56,940
testing support via another package

308
00:09:55,140 --> 00:09:58,080
called rest specs and it's generally

309
00:09:56,940 --> 00:10:00,000
going to be really familiar to anyone

310
00:09:58,080 --> 00:10:02,399
that's used requests before there's

311
00:10:00,000 --> 00:10:04,800
another one available called AIO HTTP it

312
00:10:02,399 --> 00:10:07,440
has better sort of raw HTTP performance

313
00:10:04,800 --> 00:10:10,260
but it doesn't support hdb2 so that kind

314
00:10:07,440 --> 00:10:12,959
of negates the benefit of being faster

315
00:10:10,260 --> 00:10:15,180
also unlike the RM both of these can

316
00:10:12,959 --> 00:10:17,279
fully overlap requests so you can speed

317
00:10:15,180 --> 00:10:19,080
up fetching by running multiple multiple

318
00:10:17,279 --> 00:10:20,519
gets or posts or whatever in parallel

319
00:10:19,080 --> 00:10:22,019
though of course make sure that you do

320
00:10:20,519 --> 00:10:24,300
not overload the server that is on the

321
00:10:22,019 --> 00:10:25,740
other side of things

322
00:10:24,300 --> 00:10:26,760
a new showing code on slides is a bit

323
00:10:25,740 --> 00:10:28,080
questionable but I want to go through

324
00:10:26,760 --> 00:10:29,339
some examples from my case study

325
00:10:28,080 --> 00:10:31,500
projects to show you just how simple

326
00:10:29,339 --> 00:10:34,560
this can really be in practice

327
00:10:31,500 --> 00:10:37,200
so this is the most basic version of ETL

328
00:10:34,560 --> 00:10:38,940
in Django make an HTTP request and then

329
00:10:37,200 --> 00:10:40,680
throw things into a database we wrap

330
00:10:38,940 --> 00:10:43,440
this using that looping call helper that

331
00:10:40,680 --> 00:10:45,540
we saw before and congrats that's a mini

332
00:10:43,440 --> 00:10:47,820
ETL system and maybe this is enough for

333
00:10:45,540 --> 00:10:49,920
you I have plenty of things in this ETL

334
00:10:47,820 --> 00:10:51,600
for the farm RPG project that look

335
00:10:49,920 --> 00:10:52,920
exactly like this

336
00:10:51,600 --> 00:10:54,079
um maybe with a little bit more error

337
00:10:52,920 --> 00:10:56,640
handling but doesn't fit on the slide

338
00:10:54,079 --> 00:10:58,500
but maybe you want to make things more

339
00:10:56,640 --> 00:11:00,899
complex the nice thing about this being

340
00:10:58,500 --> 00:11:02,519
in code is that rather than being built

341
00:11:00,899 --> 00:11:04,740
out of dozens of different microservice

342
00:11:02,519 --> 00:11:08,000
config files we can adapt adjust and

343
00:11:04,740 --> 00:11:08,000
improve it really easily

344
00:11:08,579 --> 00:11:12,240
capacity for example would be using

345
00:11:10,140 --> 00:11:13,800
Django rest framework to parse out the

346
00:11:12,240 --> 00:11:15,899
incoming data this is really useful if

347
00:11:13,800 --> 00:11:17,880
you're getting back deeply nested data

348
00:11:15,899 --> 00:11:18,899
that contains multiple sub-objects and

349
00:11:17,880 --> 00:11:21,540
you want to parse those into different

350
00:11:18,899 --> 00:11:23,880
tables drf can handle that for you

351
00:11:21,540 --> 00:11:26,579
another thing to note drf doesn't handle

352
00:11:23,880 --> 00:11:29,180
async on its own yet but sync to async

353
00:11:26,579 --> 00:11:29,180
has us covered

354
00:11:29,760 --> 00:11:33,480
another really common pattern in a lot

355
00:11:31,560 --> 00:11:34,980
of ETL systems is wanting to clean up

356
00:11:33,480 --> 00:11:36,959
values in the database that no longer

357
00:11:34,980 --> 00:11:39,480
exist Upstream so here's a really simple

358
00:11:36,959 --> 00:11:41,399
and honestly not very scalable version

359
00:11:39,480 --> 00:11:43,440
of that this will work fine up to a few

360
00:11:41,399 --> 00:11:44,760
thousand rows after that point it gets

361
00:11:43,440 --> 00:11:46,800
more complicated wouldn't really fit on

362
00:11:44,760 --> 00:11:48,839
a slide anymore but for smaller stuff

363
00:11:46,800 --> 00:11:51,360
this is great this is all you need for

364
00:11:48,839 --> 00:11:53,660
please keep me in sync with an upstream

365
00:11:51,360 --> 00:11:53,660
server

366
00:11:53,700 --> 00:11:57,540
but all right setting things up in ready

367
00:11:55,500 --> 00:11:59,100
callbacks is a great trick and now you

368
00:11:57,540 --> 00:12:00,180
know that uh but sometimes we want

369
00:11:59,100 --> 00:12:02,700
something a little bit simpler something

370
00:12:00,180 --> 00:12:04,620
more like celery's uh task decorator

371
00:12:02,700 --> 00:12:06,300
unfortunately Django is a helper method

372
00:12:04,620 --> 00:12:08,459
for this uh it is called Auto discover

373
00:12:06,300 --> 00:12:09,959
modules it is a little bit complex to

374
00:12:08,459 --> 00:12:12,420
use but once you know how to use it it

375
00:12:09,959 --> 00:12:14,880
is great so it all starts with you need

376
00:12:12,420 --> 00:12:16,320
a field somewhere called underscore

377
00:12:14,880 --> 00:12:18,839
registry it must be called exactly

378
00:12:16,320 --> 00:12:21,480
underscore registry uh

379
00:12:18,839 --> 00:12:23,700
you then pass two Auto discover modules

380
00:12:21,480 --> 00:12:26,220
the sub module inside each app that you

381
00:12:23,700 --> 00:12:28,440
want to look for and the place that the

382
00:12:26,220 --> 00:12:30,240
underscore registry object exists it

383
00:12:28,440 --> 00:12:32,339
will iterate over all of your Django

384
00:12:30,240 --> 00:12:33,540
apps look for that sub module and make

385
00:12:32,339 --> 00:12:36,120
sure that even if there are loading

386
00:12:33,540 --> 00:12:38,519
errors it does not corrupt the registry

387
00:12:36,120 --> 00:12:40,500
so if we want to use this we can make a

388
00:12:38,519 --> 00:12:42,360
decorator like we show here that adds

389
00:12:40,500 --> 00:12:44,700
things to the registry and we can make a

390
00:12:42,360 --> 00:12:46,740
single ready callback that Loops through

391
00:12:44,700 --> 00:12:48,959
the registry and calls create task and

392
00:12:46,740 --> 00:12:52,560
boom we've got exactly the same thing as

393
00:12:48,959 --> 00:12:54,480
celeries at task decorator

394
00:12:52,560 --> 00:12:56,399
having a loop that runs every 30 seconds

395
00:12:54,480 --> 00:12:58,740
is really good for some cases and maybe

396
00:12:56,399 --> 00:13:00,480
we could make that 60 seconds or 100

397
00:12:58,740 --> 00:13:01,620
seconds or how many seconds we need but

398
00:13:00,480 --> 00:13:03,480
sometimes we want more complex

399
00:13:01,620 --> 00:13:05,399
scheduling something like we'd get from

400
00:13:03,480 --> 00:13:07,320
KRON fortunately there's a fantastic

401
00:13:05,399 --> 00:13:09,000
library for this it is called chronotor

402
00:13:07,320 --> 00:13:10,980
and it handles all of the math around

403
00:13:09,000 --> 00:13:13,380
timing all you need to do is track for

404
00:13:10,980 --> 00:13:17,279
each task the cron spec the like the

405
00:13:13,380 --> 00:13:19,200
string of star star whatever and the

406
00:13:17,279 --> 00:13:20,519
last run time for the task you can store

407
00:13:19,200 --> 00:13:22,920
that in a database model you can store

408
00:13:20,519 --> 00:13:25,560
it in memory whatever you want make a

409
00:13:22,920 --> 00:13:28,200
single ready callback that runs it Loops

410
00:13:25,560 --> 00:13:30,120
every second or every minute checks it

411
00:13:28,200 --> 00:13:31,920
passes the cron spec and the last

412
00:13:30,120 --> 00:13:34,620
runtime into chronoter it'll tell you

413
00:13:31,920 --> 00:13:36,420
the next runtime that is calculated for

414
00:13:34,620 --> 00:13:39,120
that cross spec if that's in the past

415
00:13:36,420 --> 00:13:41,760
run a new version and you know update

416
00:13:39,120 --> 00:13:43,740
the last run time for it boom you've got

417
00:13:41,760 --> 00:13:45,959
a simple cron system this does mean

418
00:13:43,740 --> 00:13:47,639
duplicating some logic that a real cron

419
00:13:45,959 --> 00:13:49,620
system or celery beat would get you for

420
00:13:47,639 --> 00:13:51,480
free but in practice this is about 10

421
00:13:49,620 --> 00:13:54,380
lines of code and it saves you a ton of

422
00:13:51,480 --> 00:13:54,380
organizational complexity

423
00:13:54,660 --> 00:13:57,959
ETL is an acronym technically only

424
00:13:55,920 --> 00:13:59,100
covers the ingestion but really if we

425
00:13:57,959 --> 00:14:00,540
are doing this we're probably going to

426
00:13:59,100 --> 00:14:03,420
do something with the data and that

427
00:14:00,540 --> 00:14:05,399
thing is usually querying it now those

428
00:14:03,420 --> 00:14:06,839
could be SQL queries but again probably

429
00:14:05,399 --> 00:14:08,940
all of you know how to use the orm

430
00:14:06,839 --> 00:14:09,839
already at least to a basic extent so

431
00:14:08,940 --> 00:14:10,980
let's look at something a little bit

432
00:14:09,839 --> 00:14:13,200
more fun

433
00:14:10,980 --> 00:14:14,540
I want to talk about graphql but first I

434
00:14:13,200 --> 00:14:17,459
have to again do some disclaimers

435
00:14:14,540 --> 00:14:19,260
graphql does not scale it is phenomenal

436
00:14:17,459 --> 00:14:22,260
for small scale systems I love it and I

437
00:14:19,260 --> 00:14:23,700
will talk about why but it is difficult

438
00:14:22,260 --> 00:14:25,019
verging and impossible to get good

439
00:14:23,700 --> 00:14:27,480
performance out of it when you are

440
00:14:25,019 --> 00:14:29,220
dealing with a large scale system if

441
00:14:27,480 --> 00:14:31,139
your queries are super cash friendly

442
00:14:29,220 --> 00:14:33,839
that can be a solution but otherwise

443
00:14:31,139 --> 00:14:35,459
beware the dragons

444
00:14:33,839 --> 00:14:37,079
I can't possibly go over everything the

445
00:14:35,459 --> 00:14:38,399
graphql includes because that's a whole

446
00:14:37,079 --> 00:14:41,699
other conference talk but the really

447
00:14:38,399 --> 00:14:43,500
quick version graphql queries take a set

448
00:14:41,699 --> 00:14:45,660
of nested fields that you would like to

449
00:14:43,500 --> 00:14:47,459
retrieve usually there's going to be a

450
00:14:45,660 --> 00:14:50,639
top level field which is the equivalent

451
00:14:47,459 --> 00:14:52,320
of a table in SQL and then you give it

452
00:14:50,639 --> 00:14:53,880
some filters if you don't want to get

453
00:14:52,320 --> 00:14:55,560
all the objects and then the fields

454
00:14:53,880 --> 00:14:57,899
inside that object that you would like

455
00:14:55,560 --> 00:14:59,699
to retrieve the key here is that that

456
00:14:57,899 --> 00:15:01,980
can be nested so you're not just asking

457
00:14:59,699 --> 00:15:04,800
for the columns on one table you can

458
00:15:01,980 --> 00:15:07,139
recurse through to other tables

459
00:15:04,800 --> 00:15:09,360
in Django terms this means that we can

460
00:15:07,139 --> 00:15:11,040
look at either columns or foreign Keys

461
00:15:09,360 --> 00:15:13,560
many amenities or reverse managers as

462
00:15:11,040 --> 00:15:14,160
fields so here's an example

463
00:15:13,560 --> 00:15:16,260
um

464
00:15:14,160 --> 00:15:18,839
if graphql has all these sharp downsides

465
00:15:16,260 --> 00:15:21,480
why do I like it at all it is super cool

466
00:15:18,839 --> 00:15:23,279
for deeply interesting data so this is a

467
00:15:21,480 --> 00:15:25,980
graphql query that is looking at three

468
00:15:23,279 --> 00:15:28,079
different SQL tables items quests and a

469
00:15:25,980 --> 00:15:30,300
through table between them so we want to

470
00:15:28,079 --> 00:15:33,300
get name image and value for all items

471
00:15:30,300 --> 00:15:35,220
and then for every item look at all of

472
00:15:33,300 --> 00:15:36,899
the quests that use it get the title

473
00:15:35,220 --> 00:15:39,060
image and text of that Quest as well as

474
00:15:36,899 --> 00:15:42,240
the quantity used in that Quest and we

475
00:15:39,060 --> 00:15:44,579
can do all of that in one generic query

476
00:15:42,240 --> 00:15:46,560
we could of course make an API endpoint

477
00:15:44,579 --> 00:15:47,940
for this we could write some RM code and

478
00:15:46,560 --> 00:15:51,300
it would be fairly simple like doing

479
00:15:47,940 --> 00:15:53,279
this in a Django view not a big deal but

480
00:15:51,300 --> 00:15:55,500
the idea of graphql is what if I don't

481
00:15:53,279 --> 00:15:57,959
want to write a dedicated view for every

482
00:15:55,500 --> 00:15:59,459
type of query I want to run

483
00:15:57,959 --> 00:16:01,680
we're going to make some trade-offs

484
00:15:59,459 --> 00:16:04,500
there are performance issues as I keep

485
00:16:01,680 --> 00:16:06,600
mentioning uh but it's a balance you

486
00:16:04,500 --> 00:16:09,360
know we don't have to let the clients

487
00:16:06,600 --> 00:16:11,339
deal with all of the join data or give

488
00:16:09,360 --> 00:16:14,279
them too much information

489
00:16:11,339 --> 00:16:15,959
but at the flip that at the cost of it

490
00:16:14,279 --> 00:16:17,459
will be less performant than a

491
00:16:15,959 --> 00:16:20,279
handwritten query

492
00:16:17,459 --> 00:16:21,839
but in return we get a single view that

493
00:16:20,279 --> 00:16:24,240
can answer basically any question that

494
00:16:21,839 --> 00:16:25,680
we want to ask of our data

495
00:16:24,240 --> 00:16:27,420
other than large data sets though

496
00:16:25,680 --> 00:16:29,160
there's two other major places to avoid

497
00:16:27,420 --> 00:16:30,899
graphql so one is if you want to answer

498
00:16:29,160 --> 00:16:33,320
numeric queries and the other is poor

499
00:16:30,899 --> 00:16:35,399
linked data on the numeric queries

500
00:16:33,320 --> 00:16:36,959
graphql gives you lots of stuff to

501
00:16:35,399 --> 00:16:38,459
control which fields are included as we

502
00:16:36,959 --> 00:16:40,860
just saw but what it doesn't have is

503
00:16:38,459 --> 00:16:43,019
sql's numeric aggregation support so for

504
00:16:40,860 --> 00:16:45,779
example I could say give me the value of

505
00:16:43,019 --> 00:16:48,360
all items of type Foo super easy query

506
00:16:45,779 --> 00:16:50,040
what I cannot do is tell it to give me

507
00:16:48,360 --> 00:16:51,779
the average value all of those you'd

508
00:16:50,040 --> 00:16:54,000
have to compute that client side

509
00:16:51,779 --> 00:16:55,980
so if we loop back to talking about the

510
00:16:54,000 --> 00:16:58,019
transforms this is where you might want

511
00:16:55,980 --> 00:17:00,060
to instead pre-calculate those during

512
00:16:58,019 --> 00:17:01,860
the transform process instead of them in

513
00:17:00,060 --> 00:17:05,939
their own database model say average

514
00:17:01,860 --> 00:17:08,160
value for type Foo is five and then you

515
00:17:05,939 --> 00:17:09,919
can expose that model through graphql

516
00:17:08,160 --> 00:17:12,720
use that instead

517
00:17:09,919 --> 00:17:13,799
for disconnected tables that would be

518
00:17:12,720 --> 00:17:15,120
things where there's just not a lot of

519
00:17:13,799 --> 00:17:16,980
foreign Keys there's not a lot of links

520
00:17:15,120 --> 00:17:18,360
between the tables so graphql is not

521
00:17:16,980 --> 00:17:20,100
really getting you anything like sure

522
00:17:18,360 --> 00:17:21,839
you can use it but there's way easier

523
00:17:20,100 --> 00:17:23,880
ways to build generic views for a single

524
00:17:21,839 --> 00:17:25,500
table don't don't burden yourself with

525
00:17:23,880 --> 00:17:27,059
this for that

526
00:17:25,500 --> 00:17:28,980
the best tool I found for graphql and

527
00:17:27,059 --> 00:17:30,240
Django is strawberry core library from

528
00:17:28,980 --> 00:17:32,220
an Implement schema management and

529
00:17:30,240 --> 00:17:34,320
dataflow and then strawberry Django adds

530
00:17:32,220 --> 00:17:36,059
adapters for loading data using the RM

531
00:17:34,320 --> 00:17:37,980
and dealing with schema definitions for

532
00:17:36,059 --> 00:17:39,360
Django model types you'll see some

533
00:17:37,980 --> 00:17:40,980
guides referencing strawberry Django

534
00:17:39,360 --> 00:17:42,360
plus which was an enhancement library on

535
00:17:40,980 --> 00:17:45,000
top of these but it has been merged back

536
00:17:42,360 --> 00:17:47,039
to core so you don't need it anymore

537
00:17:45,000 --> 00:17:48,780
a strawberry skimmer defines the top

538
00:17:47,039 --> 00:17:51,720
level of the query namespace just like a

539
00:17:48,780 --> 00:17:53,280
root urls.pi does for HTTP it takes a

540
00:17:51,720 --> 00:17:55,080
root query type and that references

541
00:17:53,280 --> 00:17:56,520
other types those reference other types

542
00:17:55,080 --> 00:17:58,260
and each other and that slowly builds

543
00:17:56,520 --> 00:18:00,120
out the web of what can be queried

544
00:17:58,260 --> 00:18:01,740
through graphql

545
00:18:00,120 --> 00:18:02,940
a slight annoyance of strawberries

546
00:18:01,740 --> 00:18:04,500
having to restate all of our model

547
00:18:02,940 --> 00:18:06,240
definitions of strawberry types but at

548
00:18:04,500 --> 00:18:07,919
least the majority of things can be

549
00:18:06,240 --> 00:18:09,059
inferred automatically from the Jenga

550
00:18:07,919 --> 00:18:10,919
model the only place where we need to

551
00:18:09,059 --> 00:18:14,100
get explicit is either implicit Fields

552
00:18:10,919 --> 00:18:16,679
like ID or interlinks between types

553
00:18:14,100 --> 00:18:19,200
where we have to give it the python type

554
00:18:16,679 --> 00:18:20,880
to link to

555
00:18:19,200 --> 00:18:22,080
for structuring these things I usually

556
00:18:20,880 --> 00:18:24,660
like to do it the same way we do with

557
00:18:22,080 --> 00:18:26,460
urls.pi so I keep the graphql types in

558
00:18:24,660 --> 00:18:29,360
each app and then I import them into the

559
00:18:26,460 --> 00:18:29,360
big core query

560
00:18:29,460 --> 00:18:33,000
graphql doesn't offer much in terms of

561
00:18:31,260 --> 00:18:35,100
data slice and dice but there's a little

562
00:18:33,000 --> 00:18:36,720
bit so filters allow relatively basic

563
00:18:35,100 --> 00:18:39,120
wear checks we saw an example of that

564
00:18:36,720 --> 00:18:40,799
before orders let you set the sort order

565
00:18:39,120 --> 00:18:42,960
of the results although as a warning

566
00:18:40,799 --> 00:18:44,100
they are buggy in the latest release of

567
00:18:42,960 --> 00:18:46,980
strawberry Django and they will

568
00:18:44,100 --> 00:18:49,860
absolutely wreck your query performance

569
00:18:46,980 --> 00:18:51,419
and speaking of query performance we

570
00:18:49,860 --> 00:18:52,860
should maybe look at what happens when

571
00:18:51,419 --> 00:18:55,320
you actually run some of these so here's

572
00:18:52,860 --> 00:18:57,240
our simple query again

573
00:18:55,320 --> 00:18:58,679
and this is what it looks like in Django

574
00:18:57,240 --> 00:19:00,600
debug toolbar for those of you in the

575
00:18:58,679 --> 00:19:02,400
back who can't read this the first line

576
00:19:00,600 --> 00:19:03,660
is relatively simple it is running a

577
00:19:02,400 --> 00:19:04,860
select against the items table and

578
00:19:03,660 --> 00:19:07,740
pulling out the columns that we want

579
00:19:04,860 --> 00:19:09,600
totally normal that second line is where

580
00:19:07,740 --> 00:19:10,980
we get the problem it is querying

581
00:19:09,600 --> 00:19:13,380
against the quests in the through table

582
00:19:10,980 --> 00:19:15,600
but with an enormous Item ID in

583
00:19:13,380 --> 00:19:18,000
condition and what if we add a couple

584
00:19:15,600 --> 00:19:21,059
more Fields into our query

585
00:19:18,000 --> 00:19:22,200
this gets complicated very fast so this

586
00:19:21,059 --> 00:19:23,880
is running on my development server

587
00:19:22,200 --> 00:19:25,980
where each of these tables only has a

588
00:19:23,880 --> 00:19:27,660
few hundred rows if there were a million

589
00:19:25,980 --> 00:19:29,880
rows in each of these tables you can see

590
00:19:27,660 --> 00:19:31,980
why this gets to be a problem

591
00:19:29,880 --> 00:19:34,020
In fairness this only took 200

592
00:19:31,980 --> 00:19:36,780
milliseconds so like this isn't a huge

593
00:19:34,020 --> 00:19:40,140
problem even it's relatively small

594
00:19:36,780 --> 00:19:42,059
scales it's fine at you know a couple

595
00:19:40,140 --> 00:19:47,360
thousand rows per table still no problem

596
00:19:42,059 --> 00:19:47,360
but watch it be careful with it uh

597
00:19:47,460 --> 00:19:50,940
before we stop talking about strawberry

598
00:19:48,960 --> 00:19:52,799
one more word of warning their acing

599
00:19:50,940 --> 00:19:55,380
Django view is also currently a bit

600
00:19:52,799 --> 00:19:57,720
funky and gets intermittent data errors

601
00:19:55,380 --> 00:19:58,860
just use the synchronous view because of

602
00:19:57,720 --> 00:20:01,080
that thing that I mentioned before where

603
00:19:58,860 --> 00:20:02,520
you cannot overlap database queries

604
00:20:01,080 --> 00:20:04,320
there is not actually a performance

605
00:20:02,520 --> 00:20:07,200
benefit as far as I can tell to using

606
00:20:04,320 --> 00:20:09,419
the strawberry async view so hopefully

607
00:20:07,200 --> 00:20:10,679
that will be fixed soon though

608
00:20:09,419 --> 00:20:12,660
all right

609
00:20:10,679 --> 00:20:14,580
back to ETL stuff a particular place

610
00:20:12,660 --> 00:20:16,740
where ETL and graphql combine really

611
00:20:14,580 --> 00:20:18,179
well is static site generators Gatsby

612
00:20:16,740 --> 00:20:19,620
has deep native support for it and

613
00:20:18,179 --> 00:20:22,140
Pelican allows you to slot this in

614
00:20:19,620 --> 00:20:24,480
really easily as an HTTP data source

615
00:20:22,140 --> 00:20:26,700
so you can set up your static site

616
00:20:24,480 --> 00:20:28,679
generator set up a build running say

617
00:20:26,700 --> 00:20:30,120
every hour every day in your CI system

618
00:20:28,679 --> 00:20:33,299
of choice and you've got a really easy

619
00:20:30,120 --> 00:20:34,740
way to build dashboards for your data

620
00:20:33,299 --> 00:20:36,539
and another really useful feature of

621
00:20:34,740 --> 00:20:39,120
graphql is making queries and listening

622
00:20:36,539 --> 00:20:40,980
to live updates so with your dashboards

623
00:20:39,120 --> 00:20:43,140
this lets you immediately slot in

624
00:20:40,980 --> 00:20:44,940
automatic updates to your graphs without

625
00:20:43,140 --> 00:20:46,320
a very big code footprint for this to

626
00:20:44,940 --> 00:20:49,020
work in Strawberry you do need channels

627
00:20:46,320 --> 00:20:50,580
because we are only targeting a small

628
00:20:49,020 --> 00:20:52,080
server that's running inside a single

629
00:20:50,580 --> 00:20:54,240
process we can use the in-memory Channel

630
00:20:52,080 --> 00:20:55,860
layer although you can also use channels

631
00:20:54,240 --> 00:20:57,059
postgres if you would like check the

632
00:20:55,860 --> 00:20:59,340
strawberry docs you've got to make some

633
00:20:57,059 --> 00:21:00,900
config tweaks for this to work properly

634
00:20:59,340 --> 00:21:02,700
all right graphql certainly useful and

635
00:21:00,900 --> 00:21:04,080
interesting was not very fun what kind

636
00:21:02,700 --> 00:21:05,760
of weird and wonderful stuff can we do

637
00:21:04,080 --> 00:21:08,100
with our little ETL server

638
00:21:05,760 --> 00:21:09,480
iterating chatbot's fun right

639
00:21:08,100 --> 00:21:11,280
so the core of the integration is the

640
00:21:09,480 --> 00:21:12,419
same as we saw with ETL tasks this is

641
00:21:11,280 --> 00:21:14,760
the same general structure you're going

642
00:21:12,419 --> 00:21:17,160
to use for plugging anything in to async

643
00:21:14,760 --> 00:21:18,480
Django you make a Django app you spawn

644
00:21:17,160 --> 00:21:20,760
something from the ready callback and

645
00:21:18,480 --> 00:21:22,620
you go

646
00:21:20,760 --> 00:21:24,179
um a quick note when you were reading

647
00:21:22,620 --> 00:21:25,559
the docs for any async Library make sure

648
00:21:24,179 --> 00:21:27,360
that you know the difference between the

649
00:21:25,559 --> 00:21:28,679
blocking run the thing function and the

650
00:21:27,360 --> 00:21:30,000
underlying async task so what you're

651
00:21:28,679 --> 00:21:32,520
going to find in the tutorials for most

652
00:21:30,000 --> 00:21:34,740
async libraries is something that starts

653
00:21:32,520 --> 00:21:36,059
an event Loop and blocks forever we

654
00:21:34,740 --> 00:21:37,380
don't want that we're running an async

655
00:21:36,059 --> 00:21:38,940
server we've already got an event Loop

656
00:21:37,380 --> 00:21:41,039
going so make sure that you're using the

657
00:21:38,940 --> 00:21:42,659
correct one in the case of Discord Pi

658
00:21:41,039 --> 00:21:43,980
it's called client.start every library

659
00:21:42,659 --> 00:21:46,260
is going to call these different things

660
00:21:43,980 --> 00:21:47,640
but read the docs very carefully or

661
00:21:46,260 --> 00:21:49,559
you're going to have weird mysterious

662
00:21:47,640 --> 00:21:51,240
failures

663
00:21:49,559 --> 00:21:53,039
all right but with those basics in place

664
00:21:51,240 --> 00:21:55,980
we can make a chat bot that can reach

665
00:21:53,039 --> 00:21:58,140
into our ETL data inside Django and run

666
00:21:55,980 --> 00:21:59,640
things like Dynamic queries based on

667
00:21:58,140 --> 00:22:01,320
chat input

668
00:21:59,640 --> 00:22:02,820
we've got the full power of the orm here

669
00:22:01,320 --> 00:22:05,100
we can pull in any other libraries that

670
00:22:02,820 --> 00:22:06,720
we want all kinds of fun stuff we can

671
00:22:05,100 --> 00:22:08,460
also use it for logging notifications

672
00:22:06,720 --> 00:22:10,320
too or we could combine it with that

673
00:22:08,460 --> 00:22:12,000
cron pattern that we saw before and use

674
00:22:10,320 --> 00:22:14,100
this for sending say nightly reports to

675
00:22:12,000 --> 00:22:15,900
a chat Channel

676
00:22:14,100 --> 00:22:18,299
but chatbots are old news what if we

677
00:22:15,900 --> 00:22:20,039
want to SSH into our ETL server not into

678
00:22:18,299 --> 00:22:23,340
the server it's running on into the

679
00:22:20,039 --> 00:22:25,140
server itself async SSH contains a full

680
00:22:23,340 --> 00:22:27,059
async compatible SSH server

681
00:22:25,140 --> 00:22:29,520
implementation

682
00:22:27,059 --> 00:22:31,620
full probably not but maybe there's an

683
00:22:29,520 --> 00:22:34,020
edge case where you could justify this

684
00:22:31,620 --> 00:22:35,580
uh and then for some of my projects I go

685
00:22:34,020 --> 00:22:37,799
all the way into the just you couldn't

686
00:22:35,580 --> 00:22:39,840
justify this for a work project

687
00:22:37,799 --> 00:22:41,760
um talking down to Hardware uh I have a

688
00:22:39,840 --> 00:22:42,900
stream deck controller that is async

689
00:22:41,760 --> 00:22:45,179
compatible and it's a lot of fun to play

690
00:22:42,900 --> 00:22:46,919
with with these things and I have a lot

691
00:22:45,179 --> 00:22:49,200
of as I mentioned I work for Ikea so I

692
00:22:46,919 --> 00:22:50,820
have a lot of Ikea iot devices and one

693
00:22:49,200 --> 00:22:52,620
of the ETL systems in my house can

694
00:22:50,820 --> 00:22:56,100
automatically twiddle those

695
00:22:52,620 --> 00:22:58,440
is silly it's just for fun but cool

696
00:22:56,100 --> 00:23:00,059
uh all right back to reality I spent a

697
00:22:58,440 --> 00:23:01,740
lot of time singing the Praises of small

698
00:23:00,059 --> 00:23:03,480
systems and I will continue to do so but

699
00:23:01,740 --> 00:23:05,700
what if your system starts out small and

700
00:23:03,480 --> 00:23:07,260
then grows Django has you covered

701
00:23:05,700 --> 00:23:09,120
one common problem is they're just being

702
00:23:07,260 --> 00:23:10,559
too much data to transform and load in a

703
00:23:09,120 --> 00:23:12,900
single process this is going to come up

704
00:23:10,559 --> 00:23:14,880
where your pre-transform data is huge

705
00:23:12,900 --> 00:23:16,980
and post transform is very small it's

706
00:23:14,880 --> 00:23:18,059
getting reduced compressed whatever it

707
00:23:16,980 --> 00:23:18,600
is

708
00:23:18,059 --> 00:23:20,340
um

709
00:23:18,600 --> 00:23:22,020
so that fitting all of the

710
00:23:20,340 --> 00:23:24,480
pre-transformed data in memory at once

711
00:23:22,020 --> 00:23:26,940
is really hard so simple solution here

712
00:23:24,480 --> 00:23:28,559
Shard your ingest if you have multiple

713
00:23:26,940 --> 00:23:31,080
URLs you can divvy them up if you have

714
00:23:28,559 --> 00:23:33,600
every server and ID and only process the

715
00:23:31,080 --> 00:23:35,460
matching ones on the matching server end

716
00:23:33,600 --> 00:23:36,960
of problem sharded systems can get

717
00:23:35,460 --> 00:23:38,880
really complex with hash rings and

718
00:23:36,960 --> 00:23:41,580
Vector clocks but as I keep saying start

719
00:23:38,880 --> 00:23:43,679
simple build what you need

720
00:23:41,580 --> 00:23:45,179
a fairly common thing for transforms an

721
00:23:43,679 --> 00:23:47,820
ndcl system to want to do is number

722
00:23:45,179 --> 00:23:49,559
crunching working with CPU and python is

723
00:23:47,820 --> 00:23:51,360
about four talks on its own but the

724
00:23:49,559 --> 00:23:52,860
really short version uh if the call is

725
00:23:51,360 --> 00:23:55,620
something that drops the Gill like numpy

726
00:23:52,860 --> 00:23:57,840
or Pi torch you can use sync to async

727
00:23:55,620 --> 00:23:59,580
with thread sensitive set to false

728
00:23:57,840 --> 00:24:00,840
under the hood that'll move it into a

729
00:23:59,580 --> 00:24:02,760
background thread where it can chew up

730
00:24:00,840 --> 00:24:04,020
CPU to its heart content eventually

731
00:24:02,760 --> 00:24:06,299
it'll finish and then it'll transfer

732
00:24:04,020 --> 00:24:07,559
control back to your async function if

733
00:24:06,299 --> 00:24:09,120
it is something that does not drop the

734
00:24:07,559 --> 00:24:11,100
Gill the options are a little bit more

735
00:24:09,120 --> 00:24:12,840
limited you'll want to look at either

736
00:24:11,100 --> 00:24:15,299
the process pool executor from the

737
00:24:12,840 --> 00:24:16,440
Futures library or AIO multi-process

738
00:24:15,299 --> 00:24:18,059
which is an async wrapper around

739
00:24:16,440 --> 00:24:19,440
multi-processing

740
00:24:18,059 --> 00:24:20,640
and if you need to grow beyond all of

741
00:24:19,440 --> 00:24:22,559
this the big tools are still right there

742
00:24:20,640 --> 00:24:24,120
you can still use them maybe you swap

743
00:24:22,559 --> 00:24:25,919
your homegrown sharded loader for some

744
00:24:24,120 --> 00:24:27,720
Hadoop And Hive you've got to rewrite a

745
00:24:25,919 --> 00:24:29,520
couple of Django orm queries into pie

746
00:24:27,720 --> 00:24:30,960
Hive but the rest of your code keeps on

747
00:24:29,520 --> 00:24:32,760
trucking

748
00:24:30,960 --> 00:24:34,799
so to recap what we've talked about here

749
00:24:32,760 --> 00:24:36,059
ETL systems let us move data around pull

750
00:24:34,799 --> 00:24:38,220
out the most important bits we need

751
00:24:36,059 --> 00:24:40,140
Ascent Django isn't perfect but it's

752
00:24:38,220 --> 00:24:43,200
very usable and it's great for building

753
00:24:40,140 --> 00:24:45,480
small scale ETL systems graphql pairs

754
00:24:43,200 --> 00:24:47,880
well with both of them and as a generic

755
00:24:45,480 --> 00:24:49,740
query interface it can give us very easy

756
00:24:47,880 --> 00:24:51,780
queries with some performance issues at

757
00:24:49,740 --> 00:24:53,159
scale async Python's a lot of fun

758
00:24:51,780 --> 00:24:55,860
libraries to add

759
00:24:53,159 --> 00:24:57,419
and there is value in starting small and

760
00:24:55,860 --> 00:24:58,740
simple letting your tool grow with its

761
00:24:57,419 --> 00:25:00,720
use cases rather than investing in

762
00:24:58,740 --> 00:25:04,280
massive complexity up front

763
00:25:00,720 --> 00:25:04,280
thank you very much any questions

764
00:25:08,580 --> 00:25:13,940
I was going to offer just give the mic

765
00:25:10,500 --> 00:25:13,940
directly to Russell but that's fine

766
00:25:22,200 --> 00:25:26,400
a little bit off topic maybe but um do

767
00:25:24,360 --> 00:25:28,140
you have any advice for deploying such

768
00:25:26,400 --> 00:25:30,299
things

769
00:25:28,140 --> 00:25:32,400
I am

770
00:25:30,299 --> 00:25:34,679
Ultra of kubernetes so I am obviously

771
00:25:32,400 --> 00:25:36,960
heavily biased towards it um it is my

772
00:25:34,679 --> 00:25:40,200
weapon of choice for most things is it a

773
00:25:36,960 --> 00:25:42,179
bit heavyweight for a small ETL

774
00:25:40,200 --> 00:25:43,500
so the problem with kubernetes is if it

775
00:25:42,179 --> 00:25:45,539
was the only thing you were doing in

776
00:25:43,500 --> 00:25:49,200
kubernetes absolutely extreme Overkill

777
00:25:45,539 --> 00:25:50,400
way too much to learn uh I do so this

778
00:25:49,200 --> 00:25:52,860
this thing that we've been looking at is

779
00:25:50,400 --> 00:25:54,720
actually deployed in kubernetes on a on

780
00:25:52,860 --> 00:25:56,640
a small hosting company that gives me a

781
00:25:54,720 --> 00:25:58,860
little a little mini server that runs

782
00:25:56,640 --> 00:26:01,080
k3s but it's really easy for me because

783
00:25:58,860 --> 00:26:02,460
I know all of it if you were if your

784
00:26:01,080 --> 00:26:05,039
team doesn't and you were learning it

785
00:26:02,460 --> 00:26:07,500
from scratch again dramatic Overkill

786
00:26:05,039 --> 00:26:08,700
um the difficult thing is that finding

787
00:26:07,500 --> 00:26:11,100
places that are compatible with

788
00:26:08,700 --> 00:26:13,320
long-running processes is difficult this

789
00:26:11,100 --> 00:26:16,140
is not the model of things like say

790
00:26:13,320 --> 00:26:18,480
cloud run or Lambda where they want to

791
00:26:16,140 --> 00:26:19,500
control the the event Loop structure for

792
00:26:18,480 --> 00:26:21,480
you

793
00:26:19,500 --> 00:26:23,400
um you can make it work with those

794
00:26:21,480 --> 00:26:25,940
though and certainly if you can I highly

795
00:26:23,400 --> 00:26:25,940
recommend it

796
00:26:28,140 --> 00:26:30,919
hands

797
00:26:32,520 --> 00:26:37,679
you very subtly dropped a reference to

798
00:26:34,500 --> 00:26:40,260
pep 703 and then didn't mention it

799
00:26:37,679 --> 00:26:43,200
um can you mention it now what impact

800
00:26:40,260 --> 00:26:44,820
does paper 703 going to have on on those

801
00:26:43,200 --> 00:26:47,159
sort of optimization strategies that's a

802
00:26:44,820 --> 00:26:50,100
very good question uh

803
00:26:47,159 --> 00:26:52,320
stay tuned so the pep so pep 703 is

804
00:26:50,100 --> 00:26:54,120
python without a gill this is very

805
00:26:52,320 --> 00:26:57,120
exciting to a lot of people myself

806
00:26:54,120 --> 00:27:01,140
included uh but it has not yet been

807
00:26:57,120 --> 00:27:02,900
actually accepted so I don't know has it

808
00:27:01,140 --> 00:27:05,159
yes

809
00:27:02,900 --> 00:27:08,700
steering committee said they are going

810
00:27:05,159 --> 00:27:10,919
to accept it they have not accepted it

811
00:27:08,700 --> 00:27:12,539
as they have said that with some changes

812
00:27:10,919 --> 00:27:14,100
they will accept it but I do not yet

813
00:27:12,539 --> 00:27:16,919
know what those changes will be because

814
00:27:14,100 --> 00:27:22,140
they have not accepted it yet

815
00:27:16,919 --> 00:27:24,240
so it will be accepted mostly As is uh

816
00:27:22,140 --> 00:27:26,400
the intent is that there will be a

817
00:27:24,240 --> 00:27:27,840
compile time flag that you can set that

818
00:27:26,400 --> 00:27:31,020
will build python without the guild this

819
00:27:27,840 --> 00:27:32,820
means that if you just like go into a

820
00:27:31,020 --> 00:27:34,559
standard like Ubuntu server and you run

821
00:27:32,820 --> 00:27:36,539
python you're going to get python with

822
00:27:34,559 --> 00:27:38,279
the guild same as we've always had it

823
00:27:36,539 --> 00:27:40,740
will be a build time thing you will need

824
00:27:38,279 --> 00:27:42,840
to use a specialized build of python and

825
00:27:40,740 --> 00:27:44,279
probably specialized libraries to some

826
00:27:42,840 --> 00:27:47,940
extent

827
00:27:44,279 --> 00:27:49,799
um but you will get all of the benefits

828
00:27:47,940 --> 00:27:51,659
of real threading like we have had in

829
00:27:49,799 --> 00:27:53,520
say Java or go

830
00:27:51,659 --> 00:27:56,960
exactly how that will end up working is

831
00:27:53,520 --> 00:27:56,960
still a large open question though

832
00:27:57,559 --> 00:28:03,059
hi when you were talking about uh web

833
00:28:00,299 --> 00:28:04,980
scraping uh you mentioned not breaking

834
00:28:03,059 --> 00:28:07,919
the terms and conditions of a website

835
00:28:04,980 --> 00:28:09,539
yes uh are you from what are the legal

836
00:28:07,919 --> 00:28:14,100
implications of breaking the terms of

837
00:28:09,539 --> 00:28:16,820
conditions except not being uh like

838
00:28:14,100 --> 00:28:19,380
allowed to access to the website anymore

839
00:28:16,820 --> 00:28:21,179
I really cannot answer that question I

840
00:28:19,380 --> 00:28:22,620
am sorry uh I would feel uncomfortable

841
00:28:21,179 --> 00:28:23,760
answering it I am if the accident

842
00:28:22,620 --> 00:28:27,240
doesn't give away I am not from

843
00:28:23,760 --> 00:28:29,820
Australia I super do not know your uh

844
00:28:27,240 --> 00:28:33,779
computer laws here and I would feel very

845
00:28:29,820 --> 00:28:35,279
poorly equipped to opine uh about things

846
00:28:33,779 --> 00:28:37,260
uh

847
00:28:35,279 --> 00:28:39,600
uh speak to a legal professional at your

848
00:28:37,260 --> 00:28:42,539
company or a friend uh that person

849
00:28:39,600 --> 00:28:46,520
should be more knowledgeable than me

850
00:28:42,539 --> 00:28:46,520
we have time for one more question

851
00:28:47,580 --> 00:28:52,860
no thank you very much everybody yay

852
00:28:50,290 --> 00:28:53,590
[Applause]

853
00:28:52,860 --> 00:28:57,240
thank you

854
00:28:53,590 --> 00:28:57,240
[Applause]