1 00:00:00,480 --> 00:00:03,480 foreign 2 00:00:11,960 --> 00:00:18,240 who is going to be speaking about small 3 00:00:15,120 --> 00:00:22,920 footprint ETL please clap 4 00:00:18,240 --> 00:00:24,779 [Applause] 5 00:00:22,920 --> 00:00:26,699 thank you very much uh hi there I'm no 6 00:00:24,779 --> 00:00:28,680 cantritz I'm an SRE at geomagical Labs 7 00:00:26,699 --> 00:00:31,260 we do computer vision and augmented 8 00:00:28,680 --> 00:00:33,000 reality stuff for Ikea and I'm here to 9 00:00:31,260 --> 00:00:34,680 talk about building small footprint ETL 10 00:00:33,000 --> 00:00:37,620 systems as Katie so succinctly said 11 00:00:34,680 --> 00:00:39,300 using Django in particular but I'm not 12 00:00:37,620 --> 00:00:41,100 here to talk about work 13 00:00:39,300 --> 00:00:42,860 use one of my side projects form RPG 14 00:00:41,100 --> 00:00:45,120 it's a free online web and mobile game 15 00:00:42,860 --> 00:00:46,379 and more important for our needs it has 16 00:00:45,120 --> 00:00:48,780 a lot of fun data and I don't have to 17 00:00:46,379 --> 00:00:50,340 run this past a million lawyers 18 00:00:48,780 --> 00:00:51,600 if you're not a big gamer don't worry 19 00:00:50,340 --> 00:00:53,219 the big thing to understand is that 20 00:00:51,600 --> 00:00:55,320 games tend to have highly interconnected 21 00:00:53,219 --> 00:00:56,760 data items drop from monsters and their 22 00:00:55,320 --> 00:00:59,579 use and recipes and recipes come from 23 00:00:56,760 --> 00:01:01,379 quests etc etc in SQL terms this looks 24 00:00:59,579 --> 00:01:03,780 like every table has at least one 25 00:01:01,379 --> 00:01:05,820 foreign key usually three or four 26 00:01:03,780 --> 00:01:09,420 as much more of a web structure than 27 00:01:05,820 --> 00:01:10,680 you'd have in a normal rest application 28 00:01:09,420 --> 00:01:12,180 and this isn't really part of the main 29 00:01:10,680 --> 00:01:14,040 topic in case anyone's wondering did he 30 00:01:12,180 --> 00:01:15,780 really spend a year building a ETL 31 00:01:14,040 --> 00:01:17,280 system for a free internet game yes I 32 00:01:15,780 --> 00:01:19,200 did because it was fun and it's a great 33 00:01:17,280 --> 00:01:21,420 way to learn these kinds of tools many 34 00:01:19,200 --> 00:01:22,920 of which I now use at my day job big 35 00:01:21,420 --> 00:01:24,600 shout out to fun side projects where you 36 00:01:22,920 --> 00:01:26,220 can move at your own speed and no one 37 00:01:24,600 --> 00:01:28,080 worries if you are down for a week 38 00:01:26,220 --> 00:01:29,759 anyway moving on 39 00:01:28,080 --> 00:01:31,259 all right so this talks about ETL what 40 00:01:29,759 --> 00:01:33,240 does that even mean the core is quite 41 00:01:31,259 --> 00:01:34,259 literal extract data from somewhere run 42 00:01:33,240 --> 00:01:36,360 it through some kind of transformation 43 00:01:34,259 --> 00:01:38,340 and load it into a database 44 00:01:36,360 --> 00:01:39,540 not every tail is a web scraper but the 45 00:01:38,340 --> 00:01:41,220 two are very similar so we can kind of 46 00:01:39,540 --> 00:01:42,720 think of them the same terms maybe 47 00:01:41,220 --> 00:01:44,700 instead of an Internet website you're 48 00:01:42,720 --> 00:01:46,680 scraping an internal API or maybe it's a 49 00:01:44,700 --> 00:01:47,880 database instead of HTTP but they're all 50 00:01:46,680 --> 00:01:49,200 sort of the same structure if you think 51 00:01:47,880 --> 00:01:50,700 web scraper you're probably in the right 52 00:01:49,200 --> 00:01:52,259 ballpark 53 00:01:50,700 --> 00:01:53,579 to touch on it real briefly a lot of 54 00:01:52,259 --> 00:01:56,340 very fancy folks have been thought 55 00:01:53,579 --> 00:01:58,380 leadering about elt instead of ETL the 56 00:01:56,340 --> 00:02:00,299 same core idea but instead of storing 57 00:01:58,380 --> 00:02:01,799 the transformed data first you store the 58 00:02:00,299 --> 00:02:03,840 raw data so you can re-transform it 59 00:02:01,799 --> 00:02:05,399 later if you need to if that's a feature 60 00:02:03,840 --> 00:02:07,020 that you need by all means pursue it if 61 00:02:05,399 --> 00:02:08,459 your raw data is very big though that's 62 00:02:07,020 --> 00:02:09,599 going to balloon your complexity and 63 00:02:08,459 --> 00:02:11,099 your storage requirements and since 64 00:02:09,599 --> 00:02:13,379 we're here to talk about small systems I 65 00:02:11,099 --> 00:02:15,120 don't think this is for us 66 00:02:13,379 --> 00:02:16,860 and also because we are talking about 67 00:02:15,120 --> 00:02:18,599 scraping I would be remiss if I did not 68 00:02:16,860 --> 00:02:19,739 remind everyone that hostile scraping is 69 00:02:18,599 --> 00:02:21,420 generally against the terms and 70 00:02:19,739 --> 00:02:23,099 conditions of websites make sure you 71 00:02:21,420 --> 00:02:25,500 have permission before you scrape things 72 00:02:23,099 --> 00:02:27,180 any website API or data source that you 73 00:02:25,500 --> 00:02:28,860 don't own if you have questions about 74 00:02:27,180 --> 00:02:31,980 what is allowed please talk to the owner 75 00:02:28,860 --> 00:02:34,680 or a trusted legal professional or both 76 00:02:31,980 --> 00:02:37,260 but all right scrapers aren't all of ETL 77 00:02:34,680 --> 00:02:38,340 we need the T and the L2 the transforms 78 00:02:37,260 --> 00:02:40,140 in many of these systems are going to 79 00:02:38,340 --> 00:02:41,580 have two steps first you're going to 80 00:02:40,140 --> 00:02:43,760 want to parse stuff into structured data 81 00:02:41,580 --> 00:02:46,140 if you're lucky in simple cases this is 82 00:02:43,760 --> 00:02:47,340 json.lodess if you're not lucky it's 83 00:02:46,140 --> 00:02:49,620 going to be something ugly with 84 00:02:47,340 --> 00:02:51,599 beautiful soup or a binary parser or who 85 00:02:49,620 --> 00:02:53,340 knows what then we need to take that 86 00:02:51,599 --> 00:02:54,780 structure data and mold it into a form 87 00:02:53,340 --> 00:02:56,940 that's going to be more useful for our 88 00:02:54,780 --> 00:02:58,319 queries later this can take the form of 89 00:02:56,940 --> 00:02:59,879 something like SQL normalization or 90 00:02:58,319 --> 00:03:01,620 denormalization meaning breaking things 91 00:02:59,879 --> 00:03:03,120 apart into smaller models or gluing them 92 00:03:01,620 --> 00:03:05,280 together into bigger models or more 93 00:03:03,120 --> 00:03:07,379 mundane stuff like just renaming fields 94 00:03:05,280 --> 00:03:09,660 or combining multiple data sources into 95 00:03:07,379 --> 00:03:10,920 a single model stuff like that 96 00:03:09,660 --> 00:03:13,200 um 97 00:03:10,920 --> 00:03:15,120 the transforms can also sometimes be 98 00:03:13,200 --> 00:03:16,379 doing data aggregation at transform time 99 00:03:15,120 --> 00:03:17,640 we'll talk about this a little bit more 100 00:03:16,379 --> 00:03:19,019 later on some of the trade-offs in this 101 00:03:17,640 --> 00:03:21,780 but if you're being really rigorous 102 00:03:19,019 --> 00:03:23,640 about what ETL means then the transform 103 00:03:21,780 --> 00:03:25,440 would also be doing data collapse and 104 00:03:23,640 --> 00:03:27,540 aggregation as well 105 00:03:25,440 --> 00:03:29,879 but uh 106 00:03:27,540 --> 00:03:31,080 we also need to talk about the Elven in 107 00:03:29,879 --> 00:03:32,700 that case it's mostly going to be the 108 00:03:31,080 --> 00:03:34,500 Django auram which I assume most people 109 00:03:32,700 --> 00:03:36,120 here probably know how to use you could 110 00:03:34,500 --> 00:03:38,459 use more complex stuff like Django rest 111 00:03:36,120 --> 00:03:39,659 framework serializers or pedantic but in 112 00:03:38,459 --> 00:03:41,700 most cases it's going to be something 113 00:03:39,659 --> 00:03:43,440 along those lines 114 00:03:41,700 --> 00:03:45,480 async and Django has been a long journey 115 00:03:43,440 --> 00:03:47,099 and that Journey isn't over yet but 116 00:03:45,480 --> 00:03:48,420 async Django is great and you can use it 117 00:03:47,099 --> 00:03:50,159 today for real production applications 118 00:03:48,420 --> 00:03:51,540 I'm going to touch a little bit more on 119 00:03:50,159 --> 00:03:53,700 some of the limitations later but the 120 00:03:51,540 --> 00:03:55,620 overall Top Line thing to uh to know is 121 00:03:53,700 --> 00:03:57,599 that you can really use this I highly 122 00:03:55,620 --> 00:04:00,239 recommend it 123 00:03:57,599 --> 00:04:02,220 why use asynchango as the basis for an 124 00:04:00,239 --> 00:04:05,280 ETL system it lets us keep everything in 125 00:04:02,220 --> 00:04:07,019 one code base uh we we in most detail 126 00:04:05,280 --> 00:04:09,420 systems they have a fairly well deserved 127 00:04:07,019 --> 00:04:10,680 reputation for being finicky tools one 128 00:04:09,420 --> 00:04:12,540 system falls out of sync with another 129 00:04:10,680 --> 00:04:14,400 and then the whole pipeline locks up 130 00:04:12,540 --> 00:04:16,799 until some sat on call engineer gets 131 00:04:14,400 --> 00:04:18,120 page to come out and fix it having fewer 132 00:04:16,799 --> 00:04:19,680 Services means that we have fewer things 133 00:04:18,120 --> 00:04:21,479 that can go wrong and when they do go 134 00:04:19,680 --> 00:04:23,160 wrong we have simpler solutions for them 135 00:04:21,479 --> 00:04:25,320 here we're probably only going to have 136 00:04:23,160 --> 00:04:27,000 two things Django and postgres and if 137 00:04:25,320 --> 00:04:28,259 Django breaks restart Django and we're 138 00:04:27,000 --> 00:04:30,780 back in business 139 00:04:28,259 --> 00:04:32,340 uh also having things inside one service 140 00:04:30,780 --> 00:04:33,960 makes a lot easier to move them between 141 00:04:32,340 --> 00:04:35,340 different deployment tools and structure 142 00:04:33,960 --> 00:04:37,380 providers and it makes local development 143 00:04:35,340 --> 00:04:39,000 a whole lot easier 144 00:04:37,380 --> 00:04:40,560 microservice has been the cool way to do 145 00:04:39,000 --> 00:04:42,120 things for a really long time now and 146 00:04:40,560 --> 00:04:43,740 they are certainly useful when projects 147 00:04:42,120 --> 00:04:45,720 get big and they cross team boundaries 148 00:04:43,740 --> 00:04:47,759 but we're doing small small doesn't need 149 00:04:45,720 --> 00:04:49,320 microservices 150 00:04:47,759 --> 00:04:51,180 you save a lot on the organizational 151 00:04:49,320 --> 00:04:53,520 complexity on the or the infrastructure 152 00:04:51,180 --> 00:04:56,060 complexity don't worry about it embrace 153 00:04:53,520 --> 00:04:56,060 the monolith 154 00:04:56,160 --> 00:05:00,300 thank you lots of talks as well as the 155 00:04:59,040 --> 00:05:01,680 job the Django documentation have 156 00:05:00,300 --> 00:05:03,900 covered the basics of starting an async 157 00:05:01,680 --> 00:05:05,639 app starting with 4.2 it's also 158 00:05:03,900 --> 00:05:08,400 basically the same as starting any 159 00:05:05,639 --> 00:05:10,620 Django app you build your your Django 160 00:05:08,400 --> 00:05:12,240 project you start the first app and then 161 00:05:10,620 --> 00:05:13,979 you start adding views when you add the 162 00:05:12,240 --> 00:05:15,540 view you decorate it as async and that's 163 00:05:13,979 --> 00:05:18,060 it you've got yourself an async Django 164 00:05:15,540 --> 00:05:19,560 project all of the core middleware is 165 00:05:18,060 --> 00:05:20,460 async compatible so you only have to 166 00:05:19,560 --> 00:05:22,500 worry about middleware if you start 167 00:05:20,460 --> 00:05:25,380 adding custom stuff and even if you have 168 00:05:22,500 --> 00:05:26,820 a project that is adding non-asyn 169 00:05:25,380 --> 00:05:29,699 compatible middleware Django will adapt 170 00:05:26,820 --> 00:05:31,199 it for you though at a performance cost 171 00:05:29,699 --> 00:05:32,759 one thing Django doesn't currently 172 00:05:31,199 --> 00:05:34,139 include is an async compatible web 173 00:05:32,759 --> 00:05:35,940 server we heard a little bit of this in 174 00:05:34,139 --> 00:05:37,259 the last talk as well fortunately the 175 00:05:35,940 --> 00:05:39,419 community has several for us to pick 176 00:05:37,259 --> 00:05:41,060 from I personally like yuvicorn it has 177 00:05:39,419 --> 00:05:44,220 the same sort of reload on change 178 00:05:41,060 --> 00:05:46,500 development flow as the core run server 179 00:05:44,220 --> 00:05:47,639 development server and you can use it in 180 00:05:46,500 --> 00:05:49,380 production if you just throw an extra 181 00:05:47,639 --> 00:05:51,479 couple flags on there turn on the axis 182 00:05:49,380 --> 00:05:53,220 logging turn on TLS stuff like that but 183 00:05:51,479 --> 00:05:56,580 you can use the same project or the same 184 00:05:53,220 --> 00:05:58,259 server in both scenarios 185 00:05:56,580 --> 00:06:01,139 for any ETL system we're going to need 186 00:05:58,259 --> 00:06:02,639 background tasks the normal Django way 187 00:06:01,139 --> 00:06:04,860 to do background tasks to celery and 188 00:06:02,639 --> 00:06:06,180 celery beats but that's a whole big set 189 00:06:04,860 --> 00:06:08,100 of things to run you need to run all the 190 00:06:06,180 --> 00:06:10,139 celery demons and then a broker maybe a 191 00:06:08,100 --> 00:06:13,259 result store I don't really want to do 192 00:06:10,139 --> 00:06:15,240 that anymore at least if I can help it 193 00:06:13,259 --> 00:06:16,620 a simpler solution is cron with custom 194 00:06:15,240 --> 00:06:18,120 management commands but I think we can 195 00:06:16,620 --> 00:06:19,919 do even better 196 00:06:18,120 --> 00:06:21,180 one big advantage to the async system is 197 00:06:19,919 --> 00:06:23,580 that we can spawn additional background 198 00:06:21,180 --> 00:06:25,800 tasks to run concurrently inside the web 199 00:06:23,580 --> 00:06:27,960 application so let's do that 200 00:06:25,800 --> 00:06:28,860 the core of any looping task in async 201 00:06:27,960 --> 00:06:31,020 Python is going to look something like 202 00:06:28,860 --> 00:06:32,699 this run the extractor function sleep 203 00:06:31,020 --> 00:06:34,319 repeat 204 00:06:32,699 --> 00:06:35,580 to integrate this with Django though we 205 00:06:34,319 --> 00:06:37,560 need to hook it into the server startup 206 00:06:35,580 --> 00:06:39,539 process the easiest way to do that is 207 00:06:37,560 --> 00:06:41,639 with the ready callback in app configs 208 00:06:39,539 --> 00:06:42,900 so we want to launch our async task in 209 00:06:41,639 --> 00:06:44,759 the background while Django is 210 00:06:42,900 --> 00:06:46,259 initializing and then let Django 211 00:06:44,759 --> 00:06:48,600 continue initializing as the server 212 00:06:46,259 --> 00:06:49,560 starts up all of our code loads our 213 00:06:48,600 --> 00:06:51,120 stuff will just be running in the 214 00:06:49,560 --> 00:06:53,039 background firing off every 30 seconds 215 00:06:51,120 --> 00:06:54,360 or whatever we told it to create task 216 00:06:53,039 --> 00:06:56,720 does exactly this thing that we're 217 00:06:54,360 --> 00:06:56,720 looking for 218 00:06:56,880 --> 00:07:00,900 but we do need to talk about some of the 219 00:06:58,319 --> 00:07:02,460 downsides as well uh with celery Revenue 220 00:07:00,900 --> 00:07:04,620 cue we have a much more durable system 221 00:07:02,460 --> 00:07:06,539 if a task gets accepted by the queue 222 00:07:04,620 --> 00:07:09,240 into celery it's going to run at least 223 00:07:06,539 --> 00:07:12,120 once no matter what barring any really 224 00:07:09,240 --> 00:07:14,280 weird bugs but with this if something 225 00:07:12,120 --> 00:07:15,539 crashes it's just gone okay well we can 226 00:07:14,280 --> 00:07:18,360 add exception handling we can add 227 00:07:15,539 --> 00:07:20,280 retries but still if our process crashes 228 00:07:18,360 --> 00:07:23,460 completely again it's just no longer 229 00:07:20,280 --> 00:07:25,560 there but this is usually okay in ETL 230 00:07:23,460 --> 00:07:27,900 systems if we miss a scrape one hour 231 00:07:25,560 --> 00:07:29,759 we'll get at the next hour no worries as 232 00:07:27,900 --> 00:07:31,319 long as each scrape is bringing in more 233 00:07:29,759 --> 00:07:32,819 than its interval worth of data so if 234 00:07:31,319 --> 00:07:34,080 you scrape every hour if every scrape is 235 00:07:32,819 --> 00:07:36,539 bringing in more than an hour of data 236 00:07:34,080 --> 00:07:39,000 you've got a buffer for failure 237 00:07:36,539 --> 00:07:41,039 this basically builds in the the fault 238 00:07:39,000 --> 00:07:42,960 tolerance for you in cases where you 239 00:07:41,039 --> 00:07:45,419 can't do that you can also build 240 00:07:42,960 --> 00:07:46,560 specific models to track the status of 241 00:07:45,419 --> 00:07:48,780 tasks just like we would have with 242 00:07:46,560 --> 00:07:50,220 rabbitmq so as an example rather than 243 00:07:48,780 --> 00:07:52,199 having a background task for sending an 244 00:07:50,220 --> 00:07:54,180 email we can instead store a database 245 00:07:52,199 --> 00:07:55,800 Row for each pending email we have a 246 00:07:54,180 --> 00:07:57,060 looping task that every few minutes will 247 00:07:55,800 --> 00:07:59,819 go through and try to send everything 248 00:07:57,060 --> 00:08:01,740 that is pending when they succeed it'll 249 00:07:59,819 --> 00:08:04,500 get flushed out of the database 250 00:08:01,740 --> 00:08:07,199 and if not it'll get retried this gives 251 00:08:04,500 --> 00:08:08,759 a similar level of safety to rabbitmq as 252 00:08:07,199 --> 00:08:11,160 long as something gets into that table 253 00:08:08,759 --> 00:08:12,660 it will be tried at least once the more 254 00:08:11,160 --> 00:08:14,220 this is more work though it's always 255 00:08:12,660 --> 00:08:16,199 decide you know you have to weigh the 256 00:08:14,220 --> 00:08:18,360 pros and cons of how much failure 257 00:08:16,199 --> 00:08:20,160 tolerance you want in each sort of piece 258 00:08:18,360 --> 00:08:22,319 of this system 259 00:08:20,160 --> 00:08:24,300 but okay back to async stuff general 260 00:08:22,319 --> 00:08:25,560 rule of the async orm is any method that 261 00:08:24,300 --> 00:08:29,220 would talk to the database is prefix 262 00:08:25,560 --> 00:08:31,139 with an A A get a first a save a update 263 00:08:29,220 --> 00:08:32,760 if you ever miss an A and call the 264 00:08:31,139 --> 00:08:34,380 synchronous method in an async context 265 00:08:32,760 --> 00:08:35,880 Django will raise an exception to remind 266 00:08:34,380 --> 00:08:38,459 you hey you really need to call the 267 00:08:35,880 --> 00:08:39,719 async version of this so don't worry too 268 00:08:38,459 --> 00:08:41,520 much about that 269 00:08:39,719 --> 00:08:43,560 there are two big limitations left in 270 00:08:41,520 --> 00:08:45,480 the async go around transactions don't 271 00:08:43,560 --> 00:08:47,820 work in async code and queries can't 272 00:08:45,480 --> 00:08:49,680 overlap talk about the second one first 273 00:08:47,820 --> 00:08:51,600 the lack of overlap is mostly an 274 00:08:49,680 --> 00:08:53,339 internal detail so you can run multiple 275 00:08:51,600 --> 00:08:54,120 concurrent queries at the async i o 276 00:08:53,339 --> 00:08:55,980 level 277 00:08:54,120 --> 00:08:58,019 set up a you know an async gather 278 00:08:55,980 --> 00:08:59,279 whatever you want to do with it but from 279 00:08:58,019 --> 00:09:00,360 the databases point of view it's going 280 00:08:59,279 --> 00:09:01,740 to see those queries sequentially 281 00:09:00,360 --> 00:09:03,720 because they are run on a single 282 00:09:01,740 --> 00:09:05,880 background worker thread the end result 283 00:09:03,720 --> 00:09:06,899 of this is that you do not yet get a 284 00:09:05,880 --> 00:09:10,440 performance benefit from running 285 00:09:06,899 --> 00:09:12,420 multiple SQL queries in parallel 286 00:09:10,440 --> 00:09:14,459 async transactions require a little bit 287 00:09:12,420 --> 00:09:17,339 more complexity than usual so if we just 288 00:09:14,459 --> 00:09:20,399 naively did transactions in async code 289 00:09:17,339 --> 00:09:21,959 we run the risk of multiple database 290 00:09:20,399 --> 00:09:23,820 queries getting interleaved into the 291 00:09:21,959 --> 00:09:25,980 transaction so we have to pull it all 292 00:09:23,820 --> 00:09:28,200 into one synchronous block and run it 293 00:09:25,980 --> 00:09:29,940 using asgi ref's sync to async helper 294 00:09:28,200 --> 00:09:32,820 this is definitely something the Django 295 00:09:29,940 --> 00:09:34,800 team is looking to improve uh in very 296 00:09:32,820 --> 00:09:36,120 recent releases of some of the database 297 00:09:34,800 --> 00:09:37,800 Library the underlying database 298 00:09:36,120 --> 00:09:39,600 libraries like psycho pg3 we've started 299 00:09:37,800 --> 00:09:41,519 seeing true async support that should 300 00:09:39,600 --> 00:09:44,519 open a lot of doors for better support 301 00:09:41,519 --> 00:09:46,140 in the async orm 302 00:09:44,519 --> 00:09:47,580 because Spectrum from web servers is 303 00:09:46,140 --> 00:09:49,140 such a common extractor you will very 304 00:09:47,580 --> 00:09:51,720 likely need an async compatible HTTP 305 00:09:49,140 --> 00:09:53,580 client my recommendation is httpx it's 306 00:09:51,720 --> 00:09:55,140 got a really simple API it's got great 307 00:09:53,580 --> 00:09:56,940 testing support via another package 308 00:09:55,140 --> 00:09:58,080 called rest specs and it's generally 309 00:09:56,940 --> 00:10:00,000 going to be really familiar to anyone 310 00:09:58,080 --> 00:10:02,399 that's used requests before there's 311 00:10:00,000 --> 00:10:04,800 another one available called AIO HTTP it 312 00:10:02,399 --> 00:10:07,440 has better sort of raw HTTP performance 313 00:10:04,800 --> 00:10:10,260 but it doesn't support hdb2 so that kind 314 00:10:07,440 --> 00:10:12,959 of negates the benefit of being faster 315 00:10:10,260 --> 00:10:15,180 also unlike the RM both of these can 316 00:10:12,959 --> 00:10:17,279 fully overlap requests so you can speed 317 00:10:15,180 --> 00:10:19,080 up fetching by running multiple multiple 318 00:10:17,279 --> 00:10:20,519 gets or posts or whatever in parallel 319 00:10:19,080 --> 00:10:22,019 though of course make sure that you do 320 00:10:20,519 --> 00:10:24,300 not overload the server that is on the 321 00:10:22,019 --> 00:10:25,740 other side of things 322 00:10:24,300 --> 00:10:26,760 a new showing code on slides is a bit 323 00:10:25,740 --> 00:10:28,080 questionable but I want to go through 324 00:10:26,760 --> 00:10:29,339 some examples from my case study 325 00:10:28,080 --> 00:10:31,500 projects to show you just how simple 326 00:10:29,339 --> 00:10:34,560 this can really be in practice 327 00:10:31,500 --> 00:10:37,200 so this is the most basic version of ETL 328 00:10:34,560 --> 00:10:38,940 in Django make an HTTP request and then 329 00:10:37,200 --> 00:10:40,680 throw things into a database we wrap 330 00:10:38,940 --> 00:10:43,440 this using that looping call helper that 331 00:10:40,680 --> 00:10:45,540 we saw before and congrats that's a mini 332 00:10:43,440 --> 00:10:47,820 ETL system and maybe this is enough for 333 00:10:45,540 --> 00:10:49,920 you I have plenty of things in this ETL 334 00:10:47,820 --> 00:10:51,600 for the farm RPG project that look 335 00:10:49,920 --> 00:10:52,920 exactly like this 336 00:10:51,600 --> 00:10:54,079 um maybe with a little bit more error 337 00:10:52,920 --> 00:10:56,640 handling but doesn't fit on the slide 338 00:10:54,079 --> 00:10:58,500 but maybe you want to make things more 339 00:10:56,640 --> 00:11:00,899 complex the nice thing about this being 340 00:10:58,500 --> 00:11:02,519 in code is that rather than being built 341 00:11:00,899 --> 00:11:04,740 out of dozens of different microservice 342 00:11:02,519 --> 00:11:08,000 config files we can adapt adjust and 343 00:11:04,740 --> 00:11:08,000 improve it really easily 344 00:11:08,579 --> 00:11:12,240 capacity for example would be using 345 00:11:10,140 --> 00:11:13,800 Django rest framework to parse out the 346 00:11:12,240 --> 00:11:15,899 incoming data this is really useful if 347 00:11:13,800 --> 00:11:17,880 you're getting back deeply nested data 348 00:11:15,899 --> 00:11:18,899 that contains multiple sub-objects and 349 00:11:17,880 --> 00:11:21,540 you want to parse those into different 350 00:11:18,899 --> 00:11:23,880 tables drf can handle that for you 351 00:11:21,540 --> 00:11:26,579 another thing to note drf doesn't handle 352 00:11:23,880 --> 00:11:29,180 async on its own yet but sync to async 353 00:11:26,579 --> 00:11:29,180 has us covered 354 00:11:29,760 --> 00:11:33,480 another really common pattern in a lot 355 00:11:31,560 --> 00:11:34,980 of ETL systems is wanting to clean up 356 00:11:33,480 --> 00:11:36,959 values in the database that no longer 357 00:11:34,980 --> 00:11:39,480 exist Upstream so here's a really simple 358 00:11:36,959 --> 00:11:41,399 and honestly not very scalable version 359 00:11:39,480 --> 00:11:43,440 of that this will work fine up to a few 360 00:11:41,399 --> 00:11:44,760 thousand rows after that point it gets 361 00:11:43,440 --> 00:11:46,800 more complicated wouldn't really fit on 362 00:11:44,760 --> 00:11:48,839 a slide anymore but for smaller stuff 363 00:11:46,800 --> 00:11:51,360 this is great this is all you need for 364 00:11:48,839 --> 00:11:53,660 please keep me in sync with an upstream 365 00:11:51,360 --> 00:11:53,660 server 366 00:11:53,700 --> 00:11:57,540 but all right setting things up in ready 367 00:11:55,500 --> 00:11:59,100 callbacks is a great trick and now you 368 00:11:57,540 --> 00:12:00,180 know that uh but sometimes we want 369 00:11:59,100 --> 00:12:02,700 something a little bit simpler something 370 00:12:00,180 --> 00:12:04,620 more like celery's uh task decorator 371 00:12:02,700 --> 00:12:06,300 unfortunately Django is a helper method 372 00:12:04,620 --> 00:12:08,459 for this uh it is called Auto discover 373 00:12:06,300 --> 00:12:09,959 modules it is a little bit complex to 374 00:12:08,459 --> 00:12:12,420 use but once you know how to use it it 375 00:12:09,959 --> 00:12:14,880 is great so it all starts with you need 376 00:12:12,420 --> 00:12:16,320 a field somewhere called underscore 377 00:12:14,880 --> 00:12:18,839 registry it must be called exactly 378 00:12:16,320 --> 00:12:21,480 underscore registry uh 379 00:12:18,839 --> 00:12:23,700 you then pass two Auto discover modules 380 00:12:21,480 --> 00:12:26,220 the sub module inside each app that you 381 00:12:23,700 --> 00:12:28,440 want to look for and the place that the 382 00:12:26,220 --> 00:12:30,240 underscore registry object exists it 383 00:12:28,440 --> 00:12:32,339 will iterate over all of your Django 384 00:12:30,240 --> 00:12:33,540 apps look for that sub module and make 385 00:12:32,339 --> 00:12:36,120 sure that even if there are loading 386 00:12:33,540 --> 00:12:38,519 errors it does not corrupt the registry 387 00:12:36,120 --> 00:12:40,500 so if we want to use this we can make a 388 00:12:38,519 --> 00:12:42,360 decorator like we show here that adds 389 00:12:40,500 --> 00:12:44,700 things to the registry and we can make a 390 00:12:42,360 --> 00:12:46,740 single ready callback that Loops through 391 00:12:44,700 --> 00:12:48,959 the registry and calls create task and 392 00:12:46,740 --> 00:12:52,560 boom we've got exactly the same thing as 393 00:12:48,959 --> 00:12:54,480 celeries at task decorator 394 00:12:52,560 --> 00:12:56,399 having a loop that runs every 30 seconds 395 00:12:54,480 --> 00:12:58,740 is really good for some cases and maybe 396 00:12:56,399 --> 00:13:00,480 we could make that 60 seconds or 100 397 00:12:58,740 --> 00:13:01,620 seconds or how many seconds we need but 398 00:13:00,480 --> 00:13:03,480 sometimes we want more complex 399 00:13:01,620 --> 00:13:05,399 scheduling something like we'd get from 400 00:13:03,480 --> 00:13:07,320 KRON fortunately there's a fantastic 401 00:13:05,399 --> 00:13:09,000 library for this it is called chronotor 402 00:13:07,320 --> 00:13:10,980 and it handles all of the math around 403 00:13:09,000 --> 00:13:13,380 timing all you need to do is track for 404 00:13:10,980 --> 00:13:17,279 each task the cron spec the like the 405 00:13:13,380 --> 00:13:19,200 string of star star whatever and the 406 00:13:17,279 --> 00:13:20,519 last run time for the task you can store 407 00:13:19,200 --> 00:13:22,920 that in a database model you can store 408 00:13:20,519 --> 00:13:25,560 it in memory whatever you want make a 409 00:13:22,920 --> 00:13:28,200 single ready callback that runs it Loops 410 00:13:25,560 --> 00:13:30,120 every second or every minute checks it 411 00:13:28,200 --> 00:13:31,920 passes the cron spec and the last 412 00:13:30,120 --> 00:13:34,620 runtime into chronoter it'll tell you 413 00:13:31,920 --> 00:13:36,420 the next runtime that is calculated for 414 00:13:34,620 --> 00:13:39,120 that cross spec if that's in the past 415 00:13:36,420 --> 00:13:41,760 run a new version and you know update 416 00:13:39,120 --> 00:13:43,740 the last run time for it boom you've got 417 00:13:41,760 --> 00:13:45,959 a simple cron system this does mean 418 00:13:43,740 --> 00:13:47,639 duplicating some logic that a real cron 419 00:13:45,959 --> 00:13:49,620 system or celery beat would get you for 420 00:13:47,639 --> 00:13:51,480 free but in practice this is about 10 421 00:13:49,620 --> 00:13:54,380 lines of code and it saves you a ton of 422 00:13:51,480 --> 00:13:54,380 organizational complexity 423 00:13:54,660 --> 00:13:57,959 ETL is an acronym technically only 424 00:13:55,920 --> 00:13:59,100 covers the ingestion but really if we 425 00:13:57,959 --> 00:14:00,540 are doing this we're probably going to 426 00:13:59,100 --> 00:14:03,420 do something with the data and that 427 00:14:00,540 --> 00:14:05,399 thing is usually querying it now those 428 00:14:03,420 --> 00:14:06,839 could be SQL queries but again probably 429 00:14:05,399 --> 00:14:08,940 all of you know how to use the orm 430 00:14:06,839 --> 00:14:09,839 already at least to a basic extent so 431 00:14:08,940 --> 00:14:10,980 let's look at something a little bit 432 00:14:09,839 --> 00:14:13,200 more fun 433 00:14:10,980 --> 00:14:14,540 I want to talk about graphql but first I 434 00:14:13,200 --> 00:14:17,459 have to again do some disclaimers 435 00:14:14,540 --> 00:14:19,260 graphql does not scale it is phenomenal 436 00:14:17,459 --> 00:14:22,260 for small scale systems I love it and I 437 00:14:19,260 --> 00:14:23,700 will talk about why but it is difficult 438 00:14:22,260 --> 00:14:25,019 verging and impossible to get good 439 00:14:23,700 --> 00:14:27,480 performance out of it when you are 440 00:14:25,019 --> 00:14:29,220 dealing with a large scale system if 441 00:14:27,480 --> 00:14:31,139 your queries are super cash friendly 442 00:14:29,220 --> 00:14:33,839 that can be a solution but otherwise 443 00:14:31,139 --> 00:14:35,459 beware the dragons 444 00:14:33,839 --> 00:14:37,079 I can't possibly go over everything the 445 00:14:35,459 --> 00:14:38,399 graphql includes because that's a whole 446 00:14:37,079 --> 00:14:41,699 other conference talk but the really 447 00:14:38,399 --> 00:14:43,500 quick version graphql queries take a set 448 00:14:41,699 --> 00:14:45,660 of nested fields that you would like to 449 00:14:43,500 --> 00:14:47,459 retrieve usually there's going to be a 450 00:14:45,660 --> 00:14:50,639 top level field which is the equivalent 451 00:14:47,459 --> 00:14:52,320 of a table in SQL and then you give it 452 00:14:50,639 --> 00:14:53,880 some filters if you don't want to get 453 00:14:52,320 --> 00:14:55,560 all the objects and then the fields 454 00:14:53,880 --> 00:14:57,899 inside that object that you would like 455 00:14:55,560 --> 00:14:59,699 to retrieve the key here is that that 456 00:14:57,899 --> 00:15:01,980 can be nested so you're not just asking 457 00:14:59,699 --> 00:15:04,800 for the columns on one table you can 458 00:15:01,980 --> 00:15:07,139 recurse through to other tables 459 00:15:04,800 --> 00:15:09,360 in Django terms this means that we can 460 00:15:07,139 --> 00:15:11,040 look at either columns or foreign Keys 461 00:15:09,360 --> 00:15:13,560 many amenities or reverse managers as 462 00:15:11,040 --> 00:15:14,160 fields so here's an example 463 00:15:13,560 --> 00:15:16,260 um 464 00:15:14,160 --> 00:15:18,839 if graphql has all these sharp downsides 465 00:15:16,260 --> 00:15:21,480 why do I like it at all it is super cool 466 00:15:18,839 --> 00:15:23,279 for deeply interesting data so this is a 467 00:15:21,480 --> 00:15:25,980 graphql query that is looking at three 468 00:15:23,279 --> 00:15:28,079 different SQL tables items quests and a 469 00:15:25,980 --> 00:15:30,300 through table between them so we want to 470 00:15:28,079 --> 00:15:33,300 get name image and value for all items 471 00:15:30,300 --> 00:15:35,220 and then for every item look at all of 472 00:15:33,300 --> 00:15:36,899 the quests that use it get the title 473 00:15:35,220 --> 00:15:39,060 image and text of that Quest as well as 474 00:15:36,899 --> 00:15:42,240 the quantity used in that Quest and we 475 00:15:39,060 --> 00:15:44,579 can do all of that in one generic query 476 00:15:42,240 --> 00:15:46,560 we could of course make an API endpoint 477 00:15:44,579 --> 00:15:47,940 for this we could write some RM code and 478 00:15:46,560 --> 00:15:51,300 it would be fairly simple like doing 479 00:15:47,940 --> 00:15:53,279 this in a Django view not a big deal but 480 00:15:51,300 --> 00:15:55,500 the idea of graphql is what if I don't 481 00:15:53,279 --> 00:15:57,959 want to write a dedicated view for every 482 00:15:55,500 --> 00:15:59,459 type of query I want to run 483 00:15:57,959 --> 00:16:01,680 we're going to make some trade-offs 484 00:15:59,459 --> 00:16:04,500 there are performance issues as I keep 485 00:16:01,680 --> 00:16:06,600 mentioning uh but it's a balance you 486 00:16:04,500 --> 00:16:09,360 know we don't have to let the clients 487 00:16:06,600 --> 00:16:11,339 deal with all of the join data or give 488 00:16:09,360 --> 00:16:14,279 them too much information 489 00:16:11,339 --> 00:16:15,959 but at the flip that at the cost of it 490 00:16:14,279 --> 00:16:17,459 will be less performant than a 491 00:16:15,959 --> 00:16:20,279 handwritten query 492 00:16:17,459 --> 00:16:21,839 but in return we get a single view that 493 00:16:20,279 --> 00:16:24,240 can answer basically any question that 494 00:16:21,839 --> 00:16:25,680 we want to ask of our data 495 00:16:24,240 --> 00:16:27,420 other than large data sets though 496 00:16:25,680 --> 00:16:29,160 there's two other major places to avoid 497 00:16:27,420 --> 00:16:30,899 graphql so one is if you want to answer 498 00:16:29,160 --> 00:16:33,320 numeric queries and the other is poor 499 00:16:30,899 --> 00:16:35,399 linked data on the numeric queries 500 00:16:33,320 --> 00:16:36,959 graphql gives you lots of stuff to 501 00:16:35,399 --> 00:16:38,459 control which fields are included as we 502 00:16:36,959 --> 00:16:40,860 just saw but what it doesn't have is 503 00:16:38,459 --> 00:16:43,019 sql's numeric aggregation support so for 504 00:16:40,860 --> 00:16:45,779 example I could say give me the value of 505 00:16:43,019 --> 00:16:48,360 all items of type Foo super easy query 506 00:16:45,779 --> 00:16:50,040 what I cannot do is tell it to give me 507 00:16:48,360 --> 00:16:51,779 the average value all of those you'd 508 00:16:50,040 --> 00:16:54,000 have to compute that client side 509 00:16:51,779 --> 00:16:55,980 so if we loop back to talking about the 510 00:16:54,000 --> 00:16:58,019 transforms this is where you might want 511 00:16:55,980 --> 00:17:00,060 to instead pre-calculate those during 512 00:16:58,019 --> 00:17:01,860 the transform process instead of them in 513 00:17:00,060 --> 00:17:05,939 their own database model say average 514 00:17:01,860 --> 00:17:08,160 value for type Foo is five and then you 515 00:17:05,939 --> 00:17:09,919 can expose that model through graphql 516 00:17:08,160 --> 00:17:12,720 use that instead 517 00:17:09,919 --> 00:17:13,799 for disconnected tables that would be 518 00:17:12,720 --> 00:17:15,120 things where there's just not a lot of 519 00:17:13,799 --> 00:17:16,980 foreign Keys there's not a lot of links 520 00:17:15,120 --> 00:17:18,360 between the tables so graphql is not 521 00:17:16,980 --> 00:17:20,100 really getting you anything like sure 522 00:17:18,360 --> 00:17:21,839 you can use it but there's way easier 523 00:17:20,100 --> 00:17:23,880 ways to build generic views for a single 524 00:17:21,839 --> 00:17:25,500 table don't don't burden yourself with 525 00:17:23,880 --> 00:17:27,059 this for that 526 00:17:25,500 --> 00:17:28,980 the best tool I found for graphql and 527 00:17:27,059 --> 00:17:30,240 Django is strawberry core library from 528 00:17:28,980 --> 00:17:32,220 an Implement schema management and 529 00:17:30,240 --> 00:17:34,320 dataflow and then strawberry Django adds 530 00:17:32,220 --> 00:17:36,059 adapters for loading data using the RM 531 00:17:34,320 --> 00:17:37,980 and dealing with schema definitions for 532 00:17:36,059 --> 00:17:39,360 Django model types you'll see some 533 00:17:37,980 --> 00:17:40,980 guides referencing strawberry Django 534 00:17:39,360 --> 00:17:42,360 plus which was an enhancement library on 535 00:17:40,980 --> 00:17:45,000 top of these but it has been merged back 536 00:17:42,360 --> 00:17:47,039 to core so you don't need it anymore 537 00:17:45,000 --> 00:17:48,780 a strawberry skimmer defines the top 538 00:17:47,039 --> 00:17:51,720 level of the query namespace just like a 539 00:17:48,780 --> 00:17:53,280 root urls.pi does for HTTP it takes a 540 00:17:51,720 --> 00:17:55,080 root query type and that references 541 00:17:53,280 --> 00:17:56,520 other types those reference other types 542 00:17:55,080 --> 00:17:58,260 and each other and that slowly builds 543 00:17:56,520 --> 00:18:00,120 out the web of what can be queried 544 00:17:58,260 --> 00:18:01,740 through graphql 545 00:18:00,120 --> 00:18:02,940 a slight annoyance of strawberries 546 00:18:01,740 --> 00:18:04,500 having to restate all of our model 547 00:18:02,940 --> 00:18:06,240 definitions of strawberry types but at 548 00:18:04,500 --> 00:18:07,919 least the majority of things can be 549 00:18:06,240 --> 00:18:09,059 inferred automatically from the Jenga 550 00:18:07,919 --> 00:18:10,919 model the only place where we need to 551 00:18:09,059 --> 00:18:14,100 get explicit is either implicit Fields 552 00:18:10,919 --> 00:18:16,679 like ID or interlinks between types 553 00:18:14,100 --> 00:18:19,200 where we have to give it the python type 554 00:18:16,679 --> 00:18:20,880 to link to 555 00:18:19,200 --> 00:18:22,080 for structuring these things I usually 556 00:18:20,880 --> 00:18:24,660 like to do it the same way we do with 557 00:18:22,080 --> 00:18:26,460 urls.pi so I keep the graphql types in 558 00:18:24,660 --> 00:18:29,360 each app and then I import them into the 559 00:18:26,460 --> 00:18:29,360 big core query 560 00:18:29,460 --> 00:18:33,000 graphql doesn't offer much in terms of 561 00:18:31,260 --> 00:18:35,100 data slice and dice but there's a little 562 00:18:33,000 --> 00:18:36,720 bit so filters allow relatively basic 563 00:18:35,100 --> 00:18:39,120 wear checks we saw an example of that 564 00:18:36,720 --> 00:18:40,799 before orders let you set the sort order 565 00:18:39,120 --> 00:18:42,960 of the results although as a warning 566 00:18:40,799 --> 00:18:44,100 they are buggy in the latest release of 567 00:18:42,960 --> 00:18:46,980 strawberry Django and they will 568 00:18:44,100 --> 00:18:49,860 absolutely wreck your query performance 569 00:18:46,980 --> 00:18:51,419 and speaking of query performance we 570 00:18:49,860 --> 00:18:52,860 should maybe look at what happens when 571 00:18:51,419 --> 00:18:55,320 you actually run some of these so here's 572 00:18:52,860 --> 00:18:57,240 our simple query again 573 00:18:55,320 --> 00:18:58,679 and this is what it looks like in Django 574 00:18:57,240 --> 00:19:00,600 debug toolbar for those of you in the 575 00:18:58,679 --> 00:19:02,400 back who can't read this the first line 576 00:19:00,600 --> 00:19:03,660 is relatively simple it is running a 577 00:19:02,400 --> 00:19:04,860 select against the items table and 578 00:19:03,660 --> 00:19:07,740 pulling out the columns that we want 579 00:19:04,860 --> 00:19:09,600 totally normal that second line is where 580 00:19:07,740 --> 00:19:10,980 we get the problem it is querying 581 00:19:09,600 --> 00:19:13,380 against the quests in the through table 582 00:19:10,980 --> 00:19:15,600 but with an enormous Item ID in 583 00:19:13,380 --> 00:19:18,000 condition and what if we add a couple 584 00:19:15,600 --> 00:19:21,059 more Fields into our query 585 00:19:18,000 --> 00:19:22,200 this gets complicated very fast so this 586 00:19:21,059 --> 00:19:23,880 is running on my development server 587 00:19:22,200 --> 00:19:25,980 where each of these tables only has a 588 00:19:23,880 --> 00:19:27,660 few hundred rows if there were a million 589 00:19:25,980 --> 00:19:29,880 rows in each of these tables you can see 590 00:19:27,660 --> 00:19:31,980 why this gets to be a problem 591 00:19:29,880 --> 00:19:34,020 In fairness this only took 200 592 00:19:31,980 --> 00:19:36,780 milliseconds so like this isn't a huge 593 00:19:34,020 --> 00:19:40,140 problem even it's relatively small 594 00:19:36,780 --> 00:19:42,059 scales it's fine at you know a couple 595 00:19:40,140 --> 00:19:47,360 thousand rows per table still no problem 596 00:19:42,059 --> 00:19:47,360 but watch it be careful with it uh 597 00:19:47,460 --> 00:19:50,940 before we stop talking about strawberry 598 00:19:48,960 --> 00:19:52,799 one more word of warning their acing 599 00:19:50,940 --> 00:19:55,380 Django view is also currently a bit 600 00:19:52,799 --> 00:19:57,720 funky and gets intermittent data errors 601 00:19:55,380 --> 00:19:58,860 just use the synchronous view because of 602 00:19:57,720 --> 00:20:01,080 that thing that I mentioned before where 603 00:19:58,860 --> 00:20:02,520 you cannot overlap database queries 604 00:20:01,080 --> 00:20:04,320 there is not actually a performance 605 00:20:02,520 --> 00:20:07,200 benefit as far as I can tell to using 606 00:20:04,320 --> 00:20:09,419 the strawberry async view so hopefully 607 00:20:07,200 --> 00:20:10,679 that will be fixed soon though 608 00:20:09,419 --> 00:20:12,660 all right 609 00:20:10,679 --> 00:20:14,580 back to ETL stuff a particular place 610 00:20:12,660 --> 00:20:16,740 where ETL and graphql combine really 611 00:20:14,580 --> 00:20:18,179 well is static site generators Gatsby 612 00:20:16,740 --> 00:20:19,620 has deep native support for it and 613 00:20:18,179 --> 00:20:22,140 Pelican allows you to slot this in 614 00:20:19,620 --> 00:20:24,480 really easily as an HTTP data source 615 00:20:22,140 --> 00:20:26,700 so you can set up your static site 616 00:20:24,480 --> 00:20:28,679 generator set up a build running say 617 00:20:26,700 --> 00:20:30,120 every hour every day in your CI system 618 00:20:28,679 --> 00:20:33,299 of choice and you've got a really easy 619 00:20:30,120 --> 00:20:34,740 way to build dashboards for your data 620 00:20:33,299 --> 00:20:36,539 and another really useful feature of 621 00:20:34,740 --> 00:20:39,120 graphql is making queries and listening 622 00:20:36,539 --> 00:20:40,980 to live updates so with your dashboards 623 00:20:39,120 --> 00:20:43,140 this lets you immediately slot in 624 00:20:40,980 --> 00:20:44,940 automatic updates to your graphs without 625 00:20:43,140 --> 00:20:46,320 a very big code footprint for this to 626 00:20:44,940 --> 00:20:49,020 work in Strawberry you do need channels 627 00:20:46,320 --> 00:20:50,580 because we are only targeting a small 628 00:20:49,020 --> 00:20:52,080 server that's running inside a single 629 00:20:50,580 --> 00:20:54,240 process we can use the in-memory Channel 630 00:20:52,080 --> 00:20:55,860 layer although you can also use channels 631 00:20:54,240 --> 00:20:57,059 postgres if you would like check the 632 00:20:55,860 --> 00:20:59,340 strawberry docs you've got to make some 633 00:20:57,059 --> 00:21:00,900 config tweaks for this to work properly 634 00:20:59,340 --> 00:21:02,700 all right graphql certainly useful and 635 00:21:00,900 --> 00:21:04,080 interesting was not very fun what kind 636 00:21:02,700 --> 00:21:05,760 of weird and wonderful stuff can we do 637 00:21:04,080 --> 00:21:08,100 with our little ETL server 638 00:21:05,760 --> 00:21:09,480 iterating chatbot's fun right 639 00:21:08,100 --> 00:21:11,280 so the core of the integration is the 640 00:21:09,480 --> 00:21:12,419 same as we saw with ETL tasks this is 641 00:21:11,280 --> 00:21:14,760 the same general structure you're going 642 00:21:12,419 --> 00:21:17,160 to use for plugging anything in to async 643 00:21:14,760 --> 00:21:18,480 Django you make a Django app you spawn 644 00:21:17,160 --> 00:21:20,760 something from the ready callback and 645 00:21:18,480 --> 00:21:22,620 you go 646 00:21:20,760 --> 00:21:24,179 um a quick note when you were reading 647 00:21:22,620 --> 00:21:25,559 the docs for any async Library make sure 648 00:21:24,179 --> 00:21:27,360 that you know the difference between the 649 00:21:25,559 --> 00:21:28,679 blocking run the thing function and the 650 00:21:27,360 --> 00:21:30,000 underlying async task so what you're 651 00:21:28,679 --> 00:21:32,520 going to find in the tutorials for most 652 00:21:30,000 --> 00:21:34,740 async libraries is something that starts 653 00:21:32,520 --> 00:21:36,059 an event Loop and blocks forever we 654 00:21:34,740 --> 00:21:37,380 don't want that we're running an async 655 00:21:36,059 --> 00:21:38,940 server we've already got an event Loop 656 00:21:37,380 --> 00:21:41,039 going so make sure that you're using the 657 00:21:38,940 --> 00:21:42,659 correct one in the case of Discord Pi 658 00:21:41,039 --> 00:21:43,980 it's called client.start every library 659 00:21:42,659 --> 00:21:46,260 is going to call these different things 660 00:21:43,980 --> 00:21:47,640 but read the docs very carefully or 661 00:21:46,260 --> 00:21:49,559 you're going to have weird mysterious 662 00:21:47,640 --> 00:21:51,240 failures 663 00:21:49,559 --> 00:21:53,039 all right but with those basics in place 664 00:21:51,240 --> 00:21:55,980 we can make a chat bot that can reach 665 00:21:53,039 --> 00:21:58,140 into our ETL data inside Django and run 666 00:21:55,980 --> 00:21:59,640 things like Dynamic queries based on 667 00:21:58,140 --> 00:22:01,320 chat input 668 00:21:59,640 --> 00:22:02,820 we've got the full power of the orm here 669 00:22:01,320 --> 00:22:05,100 we can pull in any other libraries that 670 00:22:02,820 --> 00:22:06,720 we want all kinds of fun stuff we can 671 00:22:05,100 --> 00:22:08,460 also use it for logging notifications 672 00:22:06,720 --> 00:22:10,320 too or we could combine it with that 673 00:22:08,460 --> 00:22:12,000 cron pattern that we saw before and use 674 00:22:10,320 --> 00:22:14,100 this for sending say nightly reports to 675 00:22:12,000 --> 00:22:15,900 a chat Channel 676 00:22:14,100 --> 00:22:18,299 but chatbots are old news what if we 677 00:22:15,900 --> 00:22:20,039 want to SSH into our ETL server not into 678 00:22:18,299 --> 00:22:23,340 the server it's running on into the 679 00:22:20,039 --> 00:22:25,140 server itself async SSH contains a full 680 00:22:23,340 --> 00:22:27,059 async compatible SSH server 681 00:22:25,140 --> 00:22:29,520 implementation 682 00:22:27,059 --> 00:22:31,620 full probably not but maybe there's an 683 00:22:29,520 --> 00:22:34,020 edge case where you could justify this 684 00:22:31,620 --> 00:22:35,580 uh and then for some of my projects I go 685 00:22:34,020 --> 00:22:37,799 all the way into the just you couldn't 686 00:22:35,580 --> 00:22:39,840 justify this for a work project 687 00:22:37,799 --> 00:22:41,760 um talking down to Hardware uh I have a 688 00:22:39,840 --> 00:22:42,900 stream deck controller that is async 689 00:22:41,760 --> 00:22:45,179 compatible and it's a lot of fun to play 690 00:22:42,900 --> 00:22:46,919 with with these things and I have a lot 691 00:22:45,179 --> 00:22:49,200 of as I mentioned I work for Ikea so I 692 00:22:46,919 --> 00:22:50,820 have a lot of Ikea iot devices and one 693 00:22:49,200 --> 00:22:52,620 of the ETL systems in my house can 694 00:22:50,820 --> 00:22:56,100 automatically twiddle those 695 00:22:52,620 --> 00:22:58,440 is silly it's just for fun but cool 696 00:22:56,100 --> 00:23:00,059 uh all right back to reality I spent a 697 00:22:58,440 --> 00:23:01,740 lot of time singing the Praises of small 698 00:23:00,059 --> 00:23:03,480 systems and I will continue to do so but 699 00:23:01,740 --> 00:23:05,700 what if your system starts out small and 700 00:23:03,480 --> 00:23:07,260 then grows Django has you covered 701 00:23:05,700 --> 00:23:09,120 one common problem is they're just being 702 00:23:07,260 --> 00:23:10,559 too much data to transform and load in a 703 00:23:09,120 --> 00:23:12,900 single process this is going to come up 704 00:23:10,559 --> 00:23:14,880 where your pre-transform data is huge 705 00:23:12,900 --> 00:23:16,980 and post transform is very small it's 706 00:23:14,880 --> 00:23:18,059 getting reduced compressed whatever it 707 00:23:16,980 --> 00:23:18,600 is 708 00:23:18,059 --> 00:23:20,340 um 709 00:23:18,600 --> 00:23:22,020 so that fitting all of the 710 00:23:20,340 --> 00:23:24,480 pre-transformed data in memory at once 711 00:23:22,020 --> 00:23:26,940 is really hard so simple solution here 712 00:23:24,480 --> 00:23:28,559 Shard your ingest if you have multiple 713 00:23:26,940 --> 00:23:31,080 URLs you can divvy them up if you have 714 00:23:28,559 --> 00:23:33,600 every server and ID and only process the 715 00:23:31,080 --> 00:23:35,460 matching ones on the matching server end 716 00:23:33,600 --> 00:23:36,960 of problem sharded systems can get 717 00:23:35,460 --> 00:23:38,880 really complex with hash rings and 718 00:23:36,960 --> 00:23:41,580 Vector clocks but as I keep saying start 719 00:23:38,880 --> 00:23:43,679 simple build what you need 720 00:23:41,580 --> 00:23:45,179 a fairly common thing for transforms an 721 00:23:43,679 --> 00:23:47,820 ndcl system to want to do is number 722 00:23:45,179 --> 00:23:49,559 crunching working with CPU and python is 723 00:23:47,820 --> 00:23:51,360 about four talks on its own but the 724 00:23:49,559 --> 00:23:52,860 really short version uh if the call is 725 00:23:51,360 --> 00:23:55,620 something that drops the Gill like numpy 726 00:23:52,860 --> 00:23:57,840 or Pi torch you can use sync to async 727 00:23:55,620 --> 00:23:59,580 with thread sensitive set to false 728 00:23:57,840 --> 00:24:00,840 under the hood that'll move it into a 729 00:23:59,580 --> 00:24:02,760 background thread where it can chew up 730 00:24:00,840 --> 00:24:04,020 CPU to its heart content eventually 731 00:24:02,760 --> 00:24:06,299 it'll finish and then it'll transfer 732 00:24:04,020 --> 00:24:07,559 control back to your async function if 733 00:24:06,299 --> 00:24:09,120 it is something that does not drop the 734 00:24:07,559 --> 00:24:11,100 Gill the options are a little bit more 735 00:24:09,120 --> 00:24:12,840 limited you'll want to look at either 736 00:24:11,100 --> 00:24:15,299 the process pool executor from the 737 00:24:12,840 --> 00:24:16,440 Futures library or AIO multi-process 738 00:24:15,299 --> 00:24:18,059 which is an async wrapper around 739 00:24:16,440 --> 00:24:19,440 multi-processing 740 00:24:18,059 --> 00:24:20,640 and if you need to grow beyond all of 741 00:24:19,440 --> 00:24:22,559 this the big tools are still right there 742 00:24:20,640 --> 00:24:24,120 you can still use them maybe you swap 743 00:24:22,559 --> 00:24:25,919 your homegrown sharded loader for some 744 00:24:24,120 --> 00:24:27,720 Hadoop And Hive you've got to rewrite a 745 00:24:25,919 --> 00:24:29,520 couple of Django orm queries into pie 746 00:24:27,720 --> 00:24:30,960 Hive but the rest of your code keeps on 747 00:24:29,520 --> 00:24:32,760 trucking 748 00:24:30,960 --> 00:24:34,799 so to recap what we've talked about here 749 00:24:32,760 --> 00:24:36,059 ETL systems let us move data around pull 750 00:24:34,799 --> 00:24:38,220 out the most important bits we need 751 00:24:36,059 --> 00:24:40,140 Ascent Django isn't perfect but it's 752 00:24:38,220 --> 00:24:43,200 very usable and it's great for building 753 00:24:40,140 --> 00:24:45,480 small scale ETL systems graphql pairs 754 00:24:43,200 --> 00:24:47,880 well with both of them and as a generic 755 00:24:45,480 --> 00:24:49,740 query interface it can give us very easy 756 00:24:47,880 --> 00:24:51,780 queries with some performance issues at 757 00:24:49,740 --> 00:24:53,159 scale async Python's a lot of fun 758 00:24:51,780 --> 00:24:55,860 libraries to add 759 00:24:53,159 --> 00:24:57,419 and there is value in starting small and 760 00:24:55,860 --> 00:24:58,740 simple letting your tool grow with its 761 00:24:57,419 --> 00:25:00,720 use cases rather than investing in 762 00:24:58,740 --> 00:25:04,280 massive complexity up front 763 00:25:00,720 --> 00:25:04,280 thank you very much any questions 764 00:25:08,580 --> 00:25:13,940 I was going to offer just give the mic 765 00:25:10,500 --> 00:25:13,940 directly to Russell but that's fine 766 00:25:22,200 --> 00:25:26,400 a little bit off topic maybe but um do 767 00:25:24,360 --> 00:25:28,140 you have any advice for deploying such 768 00:25:26,400 --> 00:25:30,299 things 769 00:25:28,140 --> 00:25:32,400 I am 770 00:25:30,299 --> 00:25:34,679 Ultra of kubernetes so I am obviously 771 00:25:32,400 --> 00:25:36,960 heavily biased towards it um it is my 772 00:25:34,679 --> 00:25:40,200 weapon of choice for most things is it a 773 00:25:36,960 --> 00:25:42,179 bit heavyweight for a small ETL 774 00:25:40,200 --> 00:25:43,500 so the problem with kubernetes is if it 775 00:25:42,179 --> 00:25:45,539 was the only thing you were doing in 776 00:25:43,500 --> 00:25:49,200 kubernetes absolutely extreme Overkill 777 00:25:45,539 --> 00:25:50,400 way too much to learn uh I do so this 778 00:25:49,200 --> 00:25:52,860 this thing that we've been looking at is 779 00:25:50,400 --> 00:25:54,720 actually deployed in kubernetes on a on 780 00:25:52,860 --> 00:25:56,640 a small hosting company that gives me a 781 00:25:54,720 --> 00:25:58,860 little a little mini server that runs 782 00:25:56,640 --> 00:26:01,080 k3s but it's really easy for me because 783 00:25:58,860 --> 00:26:02,460 I know all of it if you were if your 784 00:26:01,080 --> 00:26:05,039 team doesn't and you were learning it 785 00:26:02,460 --> 00:26:07,500 from scratch again dramatic Overkill 786 00:26:05,039 --> 00:26:08,700 um the difficult thing is that finding 787 00:26:07,500 --> 00:26:11,100 places that are compatible with 788 00:26:08,700 --> 00:26:13,320 long-running processes is difficult this 789 00:26:11,100 --> 00:26:16,140 is not the model of things like say 790 00:26:13,320 --> 00:26:18,480 cloud run or Lambda where they want to 791 00:26:16,140 --> 00:26:19,500 control the the event Loop structure for 792 00:26:18,480 --> 00:26:21,480 you 793 00:26:19,500 --> 00:26:23,400 um you can make it work with those 794 00:26:21,480 --> 00:26:25,940 though and certainly if you can I highly 795 00:26:23,400 --> 00:26:25,940 recommend it 796 00:26:28,140 --> 00:26:30,919 hands 797 00:26:32,520 --> 00:26:37,679 you very subtly dropped a reference to 798 00:26:34,500 --> 00:26:40,260 pep 703 and then didn't mention it 799 00:26:37,679 --> 00:26:43,200 um can you mention it now what impact 800 00:26:40,260 --> 00:26:44,820 does paper 703 going to have on on those 801 00:26:43,200 --> 00:26:47,159 sort of optimization strategies that's a 802 00:26:44,820 --> 00:26:50,100 very good question uh 803 00:26:47,159 --> 00:26:52,320 stay tuned so the pep so pep 703 is 804 00:26:50,100 --> 00:26:54,120 python without a gill this is very 805 00:26:52,320 --> 00:26:57,120 exciting to a lot of people myself 806 00:26:54,120 --> 00:27:01,140 included uh but it has not yet been 807 00:26:57,120 --> 00:27:02,900 actually accepted so I don't know has it 808 00:27:01,140 --> 00:27:05,159 yes 809 00:27:02,900 --> 00:27:08,700 steering committee said they are going 810 00:27:05,159 --> 00:27:10,919 to accept it they have not accepted it 811 00:27:08,700 --> 00:27:12,539 as they have said that with some changes 812 00:27:10,919 --> 00:27:14,100 they will accept it but I do not yet 813 00:27:12,539 --> 00:27:16,919 know what those changes will be because 814 00:27:14,100 --> 00:27:22,140 they have not accepted it yet 815 00:27:16,919 --> 00:27:24,240 so it will be accepted mostly As is uh 816 00:27:22,140 --> 00:27:26,400 the intent is that there will be a 817 00:27:24,240 --> 00:27:27,840 compile time flag that you can set that 818 00:27:26,400 --> 00:27:31,020 will build python without the guild this 819 00:27:27,840 --> 00:27:32,820 means that if you just like go into a 820 00:27:31,020 --> 00:27:34,559 standard like Ubuntu server and you run 821 00:27:32,820 --> 00:27:36,539 python you're going to get python with 822 00:27:34,559 --> 00:27:38,279 the guild same as we've always had it 823 00:27:36,539 --> 00:27:40,740 will be a build time thing you will need 824 00:27:38,279 --> 00:27:42,840 to use a specialized build of python and 825 00:27:40,740 --> 00:27:44,279 probably specialized libraries to some 826 00:27:42,840 --> 00:27:47,940 extent 827 00:27:44,279 --> 00:27:49,799 um but you will get all of the benefits 828 00:27:47,940 --> 00:27:51,659 of real threading like we have had in 829 00:27:49,799 --> 00:27:53,520 say Java or go 830 00:27:51,659 --> 00:27:56,960 exactly how that will end up working is 831 00:27:53,520 --> 00:27:56,960 still a large open question though 832 00:27:57,559 --> 00:28:03,059 hi when you were talking about uh web 833 00:28:00,299 --> 00:28:04,980 scraping uh you mentioned not breaking 834 00:28:03,059 --> 00:28:07,919 the terms and conditions of a website 835 00:28:04,980 --> 00:28:09,539 yes uh are you from what are the legal 836 00:28:07,919 --> 00:28:14,100 implications of breaking the terms of 837 00:28:09,539 --> 00:28:16,820 conditions except not being uh like 838 00:28:14,100 --> 00:28:19,380 allowed to access to the website anymore 839 00:28:16,820 --> 00:28:21,179 I really cannot answer that question I 840 00:28:19,380 --> 00:28:22,620 am sorry uh I would feel uncomfortable 841 00:28:21,179 --> 00:28:23,760 answering it I am if the accident 842 00:28:22,620 --> 00:28:27,240 doesn't give away I am not from 843 00:28:23,760 --> 00:28:29,820 Australia I super do not know your uh 844 00:28:27,240 --> 00:28:33,779 computer laws here and I would feel very 845 00:28:29,820 --> 00:28:35,279 poorly equipped to opine uh about things 846 00:28:33,779 --> 00:28:37,260 uh 847 00:28:35,279 --> 00:28:39,600 uh speak to a legal professional at your 848 00:28:37,260 --> 00:28:42,539 company or a friend uh that person 849 00:28:39,600 --> 00:28:46,520 should be more knowledgeable than me 850 00:28:42,539 --> 00:28:46,520 we have time for one more question 851 00:28:47,580 --> 00:28:52,860 no thank you very much everybody yay 852 00:28:50,290 --> 00:28:53,590 [Applause] 853 00:28:52,860 --> 00:28:57,240 thank you 854 00:28:53,590 --> 00:28:57,240 [Applause]