00:00
This material is made available to you by on behalf of the university of Melbourne under section one month three of the copyright act nineteen sixty eight it may be subject to copyright for more information visit the university copyright website。So there are two places okay we going to make a start。So welcome again thank you for all turning up on Friday I know it's been a long week for it's the first first week of the semester so I。Yeah so I'm very glad that I see quite a lot of you turning up um so on'Friday on Wednesday I was a little bit running out of time so I haven't I didn't get the opportunity to introduce our student representative so。
01:17
Um are you here please come to the front。Please come so these are these two students or let let me introduce themselves and so for this subject if you have any any feedback any concerns or any that you want to um you know let us know um you can also contact them。So。To introduce myself introduce hi guys all they can contact you hi guys my name's ISA my emails up there so if you ever want any feedback or if you want to you know talk about something that's bothering you to send me an email and I'LL be in coordination with the stuff on e odp so yeah it's great to meet all of you and thank you。
02:17
Lucy is Lucy here。No okay。So if we。OKSO。If we have time I'talk about assessment and other things later but that's just start the content the lecture content for this week。Okay so。
03:01
Wednesday was was a bit a little bit of a motivation so letting you understanding the the the overall context of what this subject is about so if you recall we start with you know being a data wangler trying to do data wangling we start。Everything starts with the data okay so ah there are multiple data comes from everywhere so we are going to start this lecture introducing you some very basic data format。So just as a motivation。This is a very very simple scenario ah um your manager might be asking you to do these kind of things okay so um you first week into the job so this is what you need to do okay so you got you got maybe a Friday or a week okay and then after that you have to actually um。
04:04
Produce a report about customer segmentations or。Produced analysis on the customer profiles and the data can be stored anywhere depending on the type of data attribute you want to include for the customer so this is a simplified example but what I want to show you is that actually when you start to deal with data you have you are facing with different format so in the this is an introductory subject so we are only going to cover a few very basic and well known and this this few data format is very common that you going to see。Okay and then of course if you recall on Wednesday I show you this open data websites。
05:00
So in from Melbourne council Melbourne or Victoria government or the government website if you actually found five time To Go into that website you will realize that actually the data format is a lot richer。Then what we are going to introduce you introduce you but as a minimum right you should be competent about these basic ones these are the kind of format that is'very easy to interpret even with the human eyes。OK and some of you might have been very familiar with all of these already。Okay so let me。Is that my computer。OK um raise your hands if you know what c SV is。Have you ever dealt with csv files。Yes good xml。Okay good。Jason。
06:02
Okay so that's good because you're going to you know at least for the first topic you'going to find it quite familiar。But before we go into that um so just give you an overall structure of what when we talk about data what what kind of you know how we generally categorize them。Okay so all the data format that I'VE just mentioned and we are going to cover in the first like broad topic of this subject um forcing two different categories okay so they are structured data。So does anybody know what structure data mean。No okay um so structure data is like people'a Friday that's okay you ask for too much um so structure data is like a relational data so people um do you have any experience with my SQL the kind of thing yes so the data is structured by structured we mean that the。
07:11
The the OK how do I DeFine structure okay what needs To Be in the data set ah the attributes the relationship between different attributes are well DeFined beforehand before you popular the data okay so um and because it has a structure you you have you have you can have a very sophisticated method to manage it to optimize it to make it fast。Okay and then the attributes or at any location when you see the data you know exactly what is constraint what is meaning what the meaning is so that's a structure data。And so。Yeah that's interesting OK so csv is also an example oftructur data the the other extreme is unstructured so if you for example in some of your assignments maybe you need to write a report。
08:10
Right you need to write an essay。So nobody tells you apart from that you need to have paragraph nobody ask you。How you need to you know format it so it's unstructured you are free to you know you creativity you free to write whatever you need to write it's a free text there's no constraint about how how you can format or how you can。Make up the data set as a collection of text。And semi structure is is what we're going to introduce so um the the structure is not is they they is some structure in there but it is not very rigid。Okay so that's semi structured。So why do we need different formats。
09:01
Why do we have these different formats for like why do we have to have data all these different formats。Yes so this'different sources so as I I keep going back in reality like ah when you want to source the data to for you analysis they don't live in one place okay and then so initially the data might be collected or generated for different purpose and that's why you have a different different kind of forms quite straight quite quite common sense。So we will introduce a few of them。In this。In this subject um in this lecture and the next but before I start we we are going to we are not going to um cover the relational database okay because relational database on its own um deserve a subject to cover you know the length of a subject to cover it。
10:00
And so。Ah a relational relational database has like like like early on you see in the previous slide it's got structured so basically you have tables。And in the table maybe a table is a conceptual entity that you will decide and the columns are the normally the attribute the roles are。The instances。Okay and so this kind of table allows you to query a cell a role or a column quite easily。That's just an example。So if you look at this example the attributes。A column is an attribute so if this is a table I have one two three four five six seven roles and these are the columns so what can you observe about the columns the attribute so when I attributes for these columns。
11:23
So if you see customer ID they generally the same format。Right so for example in this case they are all probably all takes but um maybe if I ADD another attribual price you'LL see that these numeric values so the structure means that including the data type and perhaps behind the seeing it they also a constraint for example maybe the customer street address。It's map to something that is like a only valid addressing Australia for example。Okay so there are some constraints that DeFines what values you can have in attributes。
12:05
And that's another different kind of design where。To for for relational database it is it is a it is a normally leave under manage database management system so it's highly highly optimized and as a result um。Different data may be the design I mean that you don't have a giant flag file that contains everything。It is a very space。Memory space efficient way of storage。Right and then so if it's separated into multiple tables。Okay and then this is called normalized。And in order so you can do all sort the things with it you can you know one of the one of the tables here it's been created like with SQ o syntax like that。
13:02
And then you can query using a so SQL is basically based on some set theories so there a。Very precise DeFinition of how you can construct the logic and then retrieve all the matching records or subset of records from the database。So when I say structure it is it is like this so the whole subject if you take that so this is a very relevant。Aspects of different data sources for data wangling but we are covering that because it is this completely bit。OK so it includes how how I can internally have a point To Get the the data access quickly and underneath how do apply comppression to have the file in the most memory efficient way and how do I optimize for maybe thousands even hundreds and thousands of people concurrent accessing the data face at the same time。
14:06
OK so that is the。Very complicated thing so that's that's that's this is a structure for relational database um I believe there are some other nowadays in the big data world you know you you have streaming data you also have。Distributed database and not only nowadays in terms of structure not only do we use。Ah just SQL we also use no SQL so that means okay no SQL means not only SQL so because and then it is decides to suit distributed database so how it's not possible to if I want to join a table two tables。To say say customer information is stored in some server in Australia and um their purchasercd information is store somewhere maybe in the us okay assuming physically so joining becomes quite hard if I have to fetch the entire table from far and down the down through the networked down to Australia so because of this distributed considerations they would be some other designs so just to illustrate that this is the sort of things that you will learn in data ass systems。
15:29
Okay so the mechanism is quite complex and you just need To Be aware that there is a subject there so is anyone taking that subject is。Is semester。Good okay so um not many but you don't have to in order to do this subject well um but that's certainly going to help you to you know increase you the the to that you are having you are going to have。So so enough about the structure the structured data we are going to。
16:06
Look at this。Let well DeFined and then see how we can then apply our skills To Be able to make use of the data。Okay so for example。Takes HTML they don't have。Um。That will DeFine scheme are like the relational database has。But。Um as you can see like apart from those online systems there are still many many many data sources that actually has on structure data or semi structure structure data or not so well DeFined they may be in csv Excel lying around everywhere so um we。That is what we're going to cover so in Python。
17:01
So um in the lecture I'm going to introduce you what kind of different format is and what are the the sort of tricks that people do to。Persistent。But in your workshop in your with your tutors then you're going to have a hands on experience to try and manipulate the data。Indifferent format。Just an example of csv。He all very familiar with this。CSVOKSO。What do you notice what do you notice about this example。If you opening Excel you probably look like this。
18:00
If it's no pad it probably looks like that。Okay so。So is csv structured or not structured。Structured。I didnt prepare the pool everywhere for this question structured your hand。And not structured。Okay so good um I think both answers are correct the thing is if you are very disciplined maybe structured。You agree。So as you can see in this example early on when is show you the relational tables I'm showing you the attributes of columns。And that's what people are used to work on but like a csv you are free to do whatever you want nothing is stopping me to enter something here that is not conform To Glucose you know a va value for glucose。
19:06
Yeah you agree okay so that's the c so um I know maybe you are all very familiar with csv but when you are dealing with data like that you still have to have a you know do some error checking。So which will cover later on but you will still have to。Worry about it because the structure is really fair。The software doesn't give you the structure the structure is in your head。Yeah so you need you need to apply understanding and。And do AR checking but it is really easy to use。So it's not like the relational database I I introduced earlier right with the relational database it is a numeric value and you are putting a string you can't even enter the data。Okay so that's the that's the thing but seriously we allow you to do that。
20:03
And this is another example。So this looks quite familiar to me I often get data you know with。A few thousand rows a few hundred thousand rows on these these kind of things。So。We'LL worry about well deal we see later on now let's look at on structure data。So that's an example you EA okay or just some notes that you take。So is this structured or not structured。
21:08
嗯。So if I say this is a collection of records can you see how many records are there on the screen。Yeah I'LL probably say three right and so if I didn't ask you um。You know I actually want to summarize the data or the heart rate how how are you going to do that so I'm showing only three say suppose I'VE got a thousand patients or a thousand people you know doing this measurement ah and I want to say this is how I recall my data in a text file now tell me how what's the average。
22:00
Blood pressure。Of the thousand people。How are we doing that。Start from the second life and every four Rose okay I like that this is you thinking in alrithmatic way so put outgorithm。Okay so that's an algorithm practically how do I do that。Yeah such good so you you kind of try say this is a this is a very careful person entering the data right what happens if by ah one patient actually um。Forgot to put the one of the things so let me see which one。So if I say like for the the seventy first patient out of my one thousand patients I forgot to put the date。
23:09
So the algorithm will break right you will say all every four lines and then you're going To Get the wrong things because one line is missing and everything is out of sink。But if you want you you familiar with words like when you are trying to find an era or I Miss type a course code wrong。I do control if。Yes so do control if and then maybe replace I replace everything。So that's the sort of。The the the sort of pattern recognition that you are probably familiar with so people do that you you have experienced with work you do that right so we have we are going to try and do that。So it is hard to index how to organize because we are all human beings so human beings are we are very you know very easy we make mistakes。
24:07
So if you are certain the internal structure is very well formed then you can do the every second row and grab the result。But we are going To Go for a different method。OKSO。When you do control if。And then enter the world what happens?What's the restriction。Do you do you find that it sometimes it's hard to do when you have。So if you spell one word wrong and you know what the era the wrong spelling is then maybe you can do that but if you are trying to find um。Some more extract patterns where maybe some some of the areas is like you type a little a and then maybe。
25:02
The ES is like anything that is more than one so sometimes maybe the keyboard forty or something so some of the words is one a some of them is two three but anything thats more than one a is wrong so and it was a large document you can't say okay I don't know what's the maximum number。Like repetitions I have then it's really hard I have to do okay control f AA is wrong control a AA is wrong so you kind of have To Go through all the possible exact。Miss spellings that you you you made right so。But。There is a more powerful way to do a to search for pattern in text。So that's called regular expression。Okay so for example if you want to check if。In my large text document or maybe thousands of text document I want to find if it contains any IP addresses。
26:08
OK and so for example for in the intellIgEnce context maybe I want to say I want to find all the telephone numbers or all the account。Details。So if it's you know free text document then how am I going to do it so。It's similar to someone there was saying early on I need To Be able to express the pattern。Right。And with regular expression you have you can do search you can search for it and then you can do more than that after you found the regular the pattern then you can do something about it you either correct or you remove it you can do something to to。Fix your data。So what we'are going to do is To Go through the very basic rules about how you can use this powerful expression to find patterns in free text。
27:10
Okay so。Or just go through these。So if you see each top point that is the main main rules that you need to know。OK first of all if I for now just consider is control if OK in the world in in in your correct your word document you are correct your spelling is or find some patterns in your word document。Okay in my control if I want to find things if you type a dot。Okay I want the word processor to actually match any character。So that's that's what regular expression gives you if you type do you will match everything every any characters in the in in the text。
28:00
And。This simple。Will match the start of the string。So if you have ten lines okay and then you type that you will find the the。You will find the first at the beginning of the line。So these are the building blocks for you to make up the powerful expressions and then。The friend is the Dollar sign that matches the end of the line a stream。And then if I if I。Put a star。Ah after anything then that means I'm going to specify that particular character or some patterns has to repeat zero or more times。
29:00
OK I can have zero or more repetitions。And the plus is one or more repetitions。And then this vertical bar is the all operator so you you can say I want to match morning。To the bar there or afternoon。Okay a or b so you put a bar and then so sometimes I get lazy I don't want to type everything a square bracket gives you a set of characters。Okay and then one thing To Be careful because I think you need to use this quite a lot in your practice is that in the second point when I say this what you call this。Character character matches the start of the stream okay that is only when it is outside of the square bracket。When when I'm when I have a square square bracket where I want to specify a set of characters then if I put the put the same symbol at the front inside the bracket in means s。
30:08
Okay it means not so for example this。Means that I'm imagine any characters that is not a to that。And then what's commonly being used together with the oper the or operator is the is the round bracket。The brown bracket is saying that。Is normally you know for example these brown bracket。Groups the or operator so this means。If I if I so X y ver about it。Mean I'm trying to match the pattern X follow by y or the pattern with a single character z。
31:02
But if I put a brown bracket there。You know between X and y this means I'm looking for a pattern that is X。X has To Be there followed by option option follow by either y or it。Okay so you need to so the thing is is kind of training training iological mind a little bit to look at the to you know try and construct。What is the semantic of the pattern that you are looking for。And they?A lot more detailed more complex rules that you can learn you can go to this website and learn more but I think what we introduce in the class and what you learned in the tutorial in the workshop class。Some of the things you have to do in assignment will give you some basics and then you can then start working on data like that until you hit something you don't know what to do then you that's the time um you know I believe we would have given you enough foundations for you to look it up and to understand the explanations。
32:16
Online yourself。OK so so all these are the same let me see let me show you so this generally when you put a string is like a control f in word。Right it is exactly as same pattern for example j。Means。I am looking for a pattern that is j immediately followed by a。So the。The sequence is as you as you。As you type yeah。And the second one。Okay so this is a point that we need to make sure that we don't get confused because in a lot of other software commercial softwares star means well cut。
33:02
So it means that I can this j star。Okay please hear the whole thing at once don't chop it up and and then think I'I told you the wrong thing um in the in a lot of commercial software JA star will match j and anything else right but this is not the case in regular expression。J a star。So remember that this star is zero or more occurrence right but because I'm I am quantifying this is a quantifying operator it follows a so I'm saying that。A can appear zero or more times。Okay so the difference is whether a is there。OK so the regular expression okay so this is important one。And then the quantyfi can be put can be placed around the bracket as well。
34:04
Okay so that will mean。Anything I specify inside the bracket can be repeated zero or more times。Is it clear。Does it make sense。So inside the bracket is j o a。Right and then I'm saying that j o a I can have it zero or more times。And this is one time one or more。Repetitions。Okay so I have。Let me show you。Let me give you one。
35:11
Maybe I can show you some first。So this is what I prepare earlier so this is a very cool website called RA rag X one or one do com。Okay so in here you can then start to test out your regular expressions。OK so for example I can do a。Can you see so the the highlight was show me what I'm you know what I'm match so this is kind of a advanced version of control if。OK so this is a what what else can I do I can do ae nothing the document I have doesn't have ae。I can do an r。E。So I have three three r as a pattern。
36:01
Right so and I can do。Maybe。What else?I can say an r followed by any of the a。EI。Oh you。Okay so on your own time you can go and practice your regular expressions so this means。I'm matching two letters right one the first one has To Be are the second one can be anything。That in the bracket。OK。
37:07
Okay so that's that。So that just to a final one。So what do you think this is doing。What is the regular expression。Of that thought point。Email。Yeah okay so we have to so we decompose this we will see this in sign that's a character so I have some patterntos that I specify before that and some patternans I specify after that。Okay so that oh sorry I should make it。Um。OK so for example so that that would be a regular so this is at。
38:08
This is an ad and then I specify a class of characters that I'm allowing as my um。Name of the email right and then a class of characters followed by。Okay so this is a kind of special because I remember remember I told you do means any character。So because it is such a special character it means anything so if I want to specify dot exactly just the dot symbol itself I have to escape it so that's why I have a back slash character before that。So these two together means。Means the character do itself okay so this is a regular expression。Let's see。Let's see whether。I should have coed OK let me copy this。
39:10
So let me type this in so this is this is what's on our on the slide。And then so I put my email my sort of made up the email address fair。mypersonalsomatchespaulingdotlin@gmail.com。Okay so that's good but if I do。So that's my university account email account email address。So what can you see what's the problem。It is not quite matching right you matches you email dot e quite and then it can't can't match the rest of dot AU or dot something dot something something something is sometimes the email is really long。
40:03
Okay so how do we improve it。Say any suggestions how can we can improve this because it is almost like right right but it's not quite because it can match UN email door edu but it doesn't match UN edu do a do universe something。Yes。Have strange。You mean at the end。Here。Okay so I so if I change this to okay so maybe。In the interest of yeah so maybe I can tell you here the plus is really for this。
41:02
Set of characters so I'm saying this is the plus size is for dot。The the come after the dot。Okay I can be it means that you can have more than one character。Yeah so remember the quantifying operator the star of the plus。Can be can be quantifying a bracket you know。A sub pattern if you want so what I can do is to。Because ah if if we remember the earlier example you need me dot edu is OK but what I want is to for it to continue to match do something do something do something do something right so what I want to do is then so this is the part that gives me do something the last thought something。Yeah and then I just want to have it。
42:05
Repeatable。Yeah so that gives me thing so that's the thing you have to so for you to practice this is just just to illustrate to you that um writing regular expression is a practice and it's quite interesting sometimes。Any questions so far about regular expressions。So yes。Um。There is no difference inside or outside of the bracket。You mean the round bracket or the square bracket。Bracket。
43:02
Yeah yeah so in in the ah sorry yes OK that's a very good point so ah inside the square bracket um the do。If I go back。These okay um all these special meanings of the pattern does does not work inside the bracket inside the square bracket you can specify exactly the literal characters that is quant is qualified。Okay so for example this thought if I put it outside it is match any character if I put it inside the the square bracket it just means itself。And if I want the dot outside of a square bracket then I need to escape it with a back slash。So that's the the whole thing about yeah。
44:03
It is any one of those。Yes any order in fact it just means any one of them。Then you can then use the quantifying operator to say how many you want yeah。Okay so if no more questions we don't have a lot of time so this is To Be or not To Be but not worry about that。Let me do an HTML so you all know HTML it is a document language this is an example of HTML OK so。You familiar we HTML。Document。It's the it's the markup language that allows you to display with pages。Okay so it looks like this is it。So in general HTML is for marking up the presentation how it tells the browser how you can display。
45:05
It tells the browser how you can display the information on the web page so all the text is about。The presentations OK so for example this is a hit two is a headher two and then I'm going to display the rest of the information as the list on order list。So that is what happens with HTML。So it's designed purely for presentation。Um it's hard for us to read it's easy for the browser to read。So that's what you need to need to sort of know about HTML and so if you'VE learned how to do a markup language the HTML markup language you know that what you need to all you need to do is to learn a set of tax is already pretty fine so p means paragraph p means in our board。
46:02
Table and hiters and all that things is a pretty five set and you you just used to say to to tell the browser how you're going to display the the the web page。OK so that's HT ML nothing nothing special but what you need To Be aware is that HT ML by it it should be well formed as well so normally if you have an opening tag you should have close tag to to complete a you know a instruction for presentation。But however as you can see nowadays the browser is going because it's going to you know be be able to to service so many different people somehow they are tolerant。So if you somehow if you get it wrong they still you can still display it just just not very。The the out might not very the format might not be what you expected but somehow you will still render it you won't crash you。
47:02
OK so that's HTML。So in they are in Python they are libraries that you can actually take an HTML page and scrap all the all the markup tag around and then just extract the the content information out of it。Okay and the next one is atensible markup language short its Co X ML。OK so I'm just going to use an example this the the sntax is quite simple but this very distinct difference between HTML and xml OK so this is what I get you know I just copied from the the the subject handbook。The offer media with this this is our subject year of over this twenty nineteen it is at the level of undergra level To Campus bla so that's what I see on。On our website and if I forget about all the format this is the PLA version。
48:06
OK。And its HTML version looks like this。Okay so in here I have already simplified some of these I'm going to show you later in the later slides but basically it is a table。Okay the tag is about how I am presenting it。OK a table and then this is table roles inside table roles I'VE got a head cell and a data cell。That's all it is。Right and with that。Class instructions about how I'm going to treat the cells these are the sort of。How how can I make the web page more beautiful。Okay and then as you can see this is how I can change the color。Of you know the alternating color To Beautify the table。
49:01
And that's the whole thing。Okay without going through the the rules I'm just showing you then how can I mark the same thing up with X m l。OK so compared to so this is an example exam I can do with with the same subject guide。Okay so what can you observe。That is different。To HTML。What's it is?Yeah so you actually have more meanings in the tas right and this actually are made them up。
50:04
Okay this is this text is something that makes sense to me because eva creating the the the subject guy this is what makes what makes sense to me so it means that it's an extensible markup language so you are free to DeFine what tags you need to use that makes sense to particular collection of Excel documents。Okay so the tag is chosen by the domain experts that is going to make sense to you。OK you are not bound to p table one h to all that kind of you know presentation style kind of tag。And the same rules applied you have to a very well formed tax for for To Be actually processed properly by the programmers of the programs so you can you can see in the example。You have to start with a decoration To Begin to say that this is an xml document。
51:06
And then after that the the first Tech I specify is university。OK so in an xml document you need to have a root element。So if you canize it is like a tree structure okay you start with the root。And then and then you can then branch out within the route you can have a first level branch second level branch and this is how。The the structure looks like you have to have a root and then every tag has To Be properly closed with the closing tag。Anything else can you can you notice that in this document what about this green text。What are they?They are attributes you can have attribute in xml documents OK so the way is to have an attribute name e sign and then quote the attribute value。
52:14
That's about it?I think rather than going through each individual like rules just learn by example this is what an examin document looks like。So if I was going to say OK it is you have to have a decoration saying that this is an xml document。And then we start with the root element then you have then。Build a like a tree like structure。And then for each element you can you know well form you can open it close you can also option have a。Attributes one or more attributes。
53:04
And then finally okay。Okay just another thirty seconds in our finish。And so the last point is。Sometimes you want to communicate with other programmers or other people who are reading the documents okay but you don't want that To Become you know part of the the content of the xml。So that's when comments coming so they are taking not consider as part of the xml document。So you can start using that to chat with people don't do that but you know so the the format of the comment is is like this you open it with exclamation and two dash and close this way。Okay and we will start continue next week Wednesday。Any questions。
我来说两句