Time | 20.07 | Day 1 21.07 | Day 2 22.07 | Day 3 23.07 | Day 4 24.07 | Day 5 25.07 | Day 6 26.07 | Day 7 27.07 | 28.07 |
---|---|---|---|---|---|---|---|---|---|
08:00 09:00 | Breakfast | ||||||||
09:00 11:30 | Statistics for data science | Basics of machine learning (ML) | Prompting techniques and LLM APIs | Time series: Forecasting, XAI, and databases | LLMs-based data processing | LM-based agents for data science | Data pipeline scheduling on the computing continuum | Departures | |
Linear models for classification | Data-driven decision making | Graph data management | Conversational AI | Findable, Accessible, Interoperable, Reusable (FAIR) data | Operationalizing data and ML pipelines | ||||
11:30 12:00 | Unlocking Business Value with Data Science – OMV Petrom’s Approach | ||||||||
12:00 12:30 | Arrivals | Lunch break | Lunch break | ||||||
12:30 14:00 | Lunch break | ||||||||
14:00 17:00 | Statistical learning | Applied Deep Learning | Causal AI | Graph data analytics | Social event | Best practices in data sharing | Deployment, orchestration, monitoring of data and ML pipelines | ||
Basics of Large Language Models (LLMs) | Time series analysis and forecasting | Data enrichment | High performance data processing | ||||||
17:00 19:00 | Intro event | Free time | |||||||
19:00 21:00 | Dinner |
Detailed information about the activities
Statistics for data science (Dan Nicolae)
- A data science case study
- Foundations of data analysis
- Statistical inference with resampling methods
Statistical learning (Dan Nicolae)
- Probability and simulations
- Regression models and inference
- Model Complexity
- Prediction and classification
Basics of machine learning (ML) (Razvan Bunescu)
- Feature vector representations
- Occam’s razor for ML, intelligence, and science
- Overfitting, underfitting, generalization, and regularization
- ML experiments: training, validation, and testing
Linear models for classification (Razvan Bunescu)
- Logistic regression, softmax, and temperature
- ML algorithms in Python: the sklearn library
- Linear vs. non-linear classification and deep learning
Applied Deep Learning (Gabriel Terejanu)
- From linear models to neural networks
- Why deep neural networks?
- What is an embedding?
- How to make use of pre-trained models?
Basics of Large Language Models (LLMs) (Razvan Bunescu)
- Subword tokenization, word embeddings, and neural language models (LMs)
- Encoder, encoder-decoder, and decoder LMs
- Pre-training and fine-tuning
Prompting techniques and LLM APIs (Razvan Bunescu)
- Zero-shot and few-shot in-context learning
- Chain-of-thought prompting, retrieval augmented generation, and ReAct
- The chat completion API and LangChain
Data-driven decision making (Gabriel Terejanu)
- Introduction to A/B testing for decision making
- Designing effective A/B tests
- Analytical techniques in A/B testing
Causal AI (Gabriel Terejanu)
- Importance of causality in AI
- What is a causal model?
- What is an intervention?
- How to estimate causal effects?
Time series analysis and forecasting (Jože Rožanec)
- Introduction to time series
- Analysis tools and real-world examples
- Time series forecasting
Time series: Forecasting, XAI, and databases (Jože Rožanec)
- Using network models to represent and forecast time series
- Introduction to explainability methods
- Introduction to time series databases
Graph data management (Dumitru Roman)
- Intro to graph data structure
- Knowledge Graphs
- Graph data management (graph databases with Noe4j, graph data model, graph construction and querying)
Graph data analytics (Daniel Schroeder, Dumitru Roman)
- Complex Network and Graph Analysis in IGraph
- Intro to Graph Neural Networks in PyG
- Graph data visualization
Data enrichment (Roberto Avogadro, Dumitru Roman)
- Data linking
- Tabular data enrichment
- Human-in-the-Loop (HITL) for data enrichment
LLMs-based data processing (Ioan Toma)
- Introduction to LLM-based Data Processing
- Knowledge Extraction using LLMs
- Document Classification, Summarization and Comparison using LLMs
Conversational AI (Ioan Toma)
- Conversational AI setup and designing a chatbot interface
- Semantic Knowledge Graphs and their role in Conversational AI
- Building a chatbot using Onlim Conversational AI framework
LLM-based agents for data science (Hui Song)
- Use of ChatGPT Data Analyst to process data files, generate data processing code
- Development of LLM-based agents for multi-phase data processing tasks
- Multi-agents for complex and collaborative data processing
Findable, Accessible, Interoperable, Reusable (FAIR) data (Anna Fensel)
- Introduction to FAIR data. Examples from agri-food and health domains
- How to make data FAIR? Open data, closed data and everything in between
- Research data infrastructures
Best practices in data sharing (Anna Fensel)
- Legal compliance (GDPR, AI Act, Data Act)
- Consent, contracts and licenses, empowered with knowledge graphs
- Incentivising data sharing
High performance data processing (Radu Prodan)
- Parallel computing architectures
- Multiprocessing
- Parallel algorithms
- Parallel computing for AI and data science
Data pipeline scheduling on the computing continuum (Nikolay Nikolov)
- Introduction to pipeline scheduling in the context of big data and distributed applications
- The importance and challenges of pipeline scheduling
- Solutions and practical approaches to pipeline scheduling
Operationalizing data and ML pipelines (Wiktor Sowinski-Mydlarz)
- Contemporary Data Processing
- GATE Institute Data Platform
- Alternatives and Decisions
- The Lifecycle
Deployment, orchestration, monitoring of data and ML pipelines (Wiktor Sowinski-Mydlarz)
- Data Spaces: Decentralized Supply and Consumption of Data Services
- Private Cloud for Big Data Processing
- Platform support
- Resources: Free products and software bibles
Software (preliminary): Software tools/services to be used during the sessions include:
- Anaconda (https://www.anaconda.com): Installation instructions for various platforms can be found at: https://docs.anaconda.com/anaconda/install
- A number of relevant tools and libraries that we will use can be configured from Anaconda: Python 3, NumPy, SciPy, Matplotlib, Jupyter Notebook, Ipython, Pandas, and Scikit-learn.
- Other Python packages: statsmodels, transformers, lingam
- Onlim Platform (https://app.onlim.com/): Conversational and Knowledge Graph Platform. Accounts can be created https://auth.onlim.com/auth/realms/onlim/login-actions/registration?client_id=onlim&tab_id=gmTCMEh3-6U
- Neo4j (https://neo4j.com): Installation and documentation can be found at https://neo4j.com/developer/get-started.We will use the online sandbox service provided at https://neo4j.com/sandbox, so no installation on local machines is needed for experimenting with Neo4j. Alternatively you can download and install Neo4j Desktop, which provides a convenient way for developers to work with local Neo4j databases (this can be downloaded from https://neo4j.com/download-center/#desktop). We will also use Neo4j Graph Data Science (https://neo4j.com/product/graph-data-science) which comes with Neo4j.
- Docker (https://www.docker.com): An open-source containerization platform that will be used for ML pipelines. Installation instructions can be found at https://docs.docker.com/engine/install.
- SIM-PIPE (https://github.com/DataCloud-project/SIM-PIPE): An open-source tool for dry running of Big Data Pipelines using sample data. The tool allows evaluating pipeline performance and resource requirements at scale. An open version of the tool is available on
.